|
|
The Function of Punctuation in the Automatic Identification of Chinese Academic Papers Online |
Zou Yongli1, Wang Hao2 |
1. School of Information Management, Sun Yat-sen University, Guangzhou 510006; 2. Huafa Group, Zhuhai 519020 |
|
|
Abstract With the ever-growing numbers of academic papers published online, it is important to explore more efficient ways to identify those papers with general search engines. Aiming at providing ideas for automatic identification of online Chinese academic papers, this essay presents a comparative study of Chinese academic papers and news reports on their use of punctuation. Two corpora were built and analyzed: one comprising 6,906 academic papers and another comprising 16,316 news reports. Comparison of total punctuation numbers, relative usage rate, and average usage numbers between the two types of documents reveal that both similarities and differences exist in the usage of punctuation. Similarities in macro, relative level, and two stable sequences of punctuation usage were discovered, while the differences, which lie in micro and absolute level and independent sample non-parametric tests, show that Chinese academic papers and news reports are significantly different in their use of all 14 kinds of punctuation analyzed in this study. The findings were tested in the NSIRS, a system formerly developed by the authors, to which a punctuation analysis module was added to evaluate the identifying effect of punctuation. Retrieval experiments show that the punctuation characteristics of academic papers do have identifying effects and can be used to improve the retrieval precision of academic articles online.
|
Received: 27 June 2017
|
|
|
|
[1] 邹永利, 林智昊. 中文学术文献网页的特征[J]. 图书馆论坛, 2011, 31(6): 178-185. [2] 胡德华, 金建彬. 基于网络引文的网络学术资源利用效率研究[J]. 情报科学, 2009, 27(3): 379-383. [3] 安形輝, 池内淳, 石田栄美. 日本語学術論文PDF ファイルの自動判定[J]. Library and Information Science, 2006, 56(2): 43-63. [4] 池内淳, 安形輝, 石田栄美. プーリング手法を用いた学術論文の自動判別実験[C]// 情報処理论会研究報告. 東京: 日本情報処理论会, 2007, 34: 33-40. [5] 石田栄美, 安形輝, 宮田洋輔, et al.構造と構成要素に基づく学術論文の自動判定[J]. 日本図書館情報学会誌, 2014, 60(1): 18-34. [6] Ishita E, Agata T, Ikeuchi A, et al.A search engine for Japanese academic papers[C]// Proceedings of the 10th Annual Joint Conference on Digital Libraries. New York: ACM Press, 2010: 379-380. [7] 邹永利, 何侃, 徐健. 文体特征在网络学术文献检索中的意义与应用[J]. 情报理论与实践, 2008, 31(4): 594-597. [8] 孙坤, 王荣. 当代国外标点符号研究[J]. 当代语言学, 2010, 12(2): 148-162, 190. [9] 邹永利, 颜秀飞. 文体特征与网络中文学术文献的自动识别——学术文献文体与新闻报道文体的比较研究[J]. 情报学报, 2012, 31(2): 160-165. [10] 邹永利, 冯静娴, 郑荟. 学术文献的文体特征及其检索意义——计算机科学文献与相关新闻报道文体的比较研究[J]. 中国图书馆学报, 2014, 40(2): 33-40. [11] 黄光照. 网络中文学术文献搜索中的干扰文献及其特征研究[D]. 广州: 中山大学, 2011. [12] Leighton H V, Srivastava J.First 20 precision among World Wide Web search services (search engines)[J]. Journal of the American Society for Information Science, 1999, 50(10): 870-881. |
|
|
|