|
|
Automatic Labeling of Semantic Clauses in Research Articles |
Huang Wenbin1, Wang Yueqian1, Bu Yi1, Che Shangkun2 |
1.Department of Information Management, Peking University, Beijing 100871 2.School of Economics and Management, Tsinghua University, Beijing 100084 |
|
|
Abstract Analyzing the semantic structure of research articles can be widely used to address multiple issues such as information extraction and retrieval. This paper describes the semantic structure of research articles by applying machine learning techniques to recognize the semantic types of discourse segments in these articles. We extracted the macro structure of research articles, including the syntactic and lexical information of each discourse segment as input features, and trained five models, namely support vector machines (SVM), conditional random fields (CRF), random forests (RF), gradient boost classifier (GBC), and stochastic gradient descent classifier (SGD). We integrated three best-performing models, that is, CRF, SVM, and GBC, to form a bagging model for classifying all discourse segments from the full text. Experimental results showed that our bagging model outperformed the baseline model on tasks of classifying discourse segments from full text and result sections with a higher accuracy and F-score. Furthermore, a topic-clustering experiment demonstrated the effectiveness of the model on topic detection, which is a common task in the field of text mining.
|
Received: 13 May 2020
|
|
|
|
1 Swales J M. Genre analysis: English in academic and research settings[M]. Cambridge: Cambridge University Press, 1990. 2 黄曾阳. HNC(概念层次网络)论: 计算机理解语言研究的新思路[M]. 北京: 清华大学出版社, 1998. 3 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985. 4 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300. 5 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538. 6 Lu W, Huang Y, Bu Y, et al. Functional structure identification of scientific documents in computer science[J]. Scientometrics, 2018, 115(1): 463-486. 7 de Waard A. A pragmatic structure for research articles[C]// Proceedings of the 2nd International Conference on Pragmatic Web. New York: ACM Press, 2007: 83-89. 8 Huang J, White R W. Parallel browsing behavior on the web[C]// Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. New York: ACM Press, 2010: 13-18. 9 薛家秀, 欧石燕. 科学论文篇章结构建模与解析研究进展[J]. 图书与情报, 2019(2): 120-132. 10 Teufel S, Carletta J, Moens M. An annotation scheme for discourse-level argumentation in research articles[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1999: 110-117. 11 Mizuta Y, Collier N. An annotation scheme for a rhetorical analysis of biology articles[C]// Proceeding of the Fourth Internation Conference on Language Resources and Evaluation. European Language Resources Association, 2004: 1737-1740. 12 de Waard A, Kircz J. Modeling scientific research articles: shifting perspectives and persistent issues[C]// Proceedings of the 12th International Conference on Electronic Publishing, Toronto, 2008. 13 Guo Y F, Korhonen A, Liakata M, et al. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment[J]. BMC Bioinformatics, 2011, 12(1): 1-18. 14 Dasigi P, Burns G A P C, Hovy E, et al. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks[OL]. (2017-02-17). https://arxiv.org/pdf/1702.05398v1.pdf. 15 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61. 16 Kiela D, Guo Y, Stenius U, et al. Unsupervised discovery of information structure in biomedical documents[J]. Bioinformatics, 2015, 31(7): 1084-1092. 17 Grosz B J, Sidner C L. Attention, intentions, and the structure of discourse[J]. Computational Linguistics, 1986, 12(3): 175-204. 18 Etzioni O, Cafarella M, Downey D, et al. Methods for domain-independent information extraction from the web: an experimental comparison[C]// Proceedings of the 19th National Conference on Artifical Intelligence. Palo Alto: AAAI Press, 2004: 391-398. 19 Yang R Y, Allison D. Research articles in applied linguistics: structures from a functional perspective[J]. English for Specific Purposes, 2004, 23(3): 264-279. 20 de Waard A, Pander Maat H. Verb form indicates discourse segment type in biological research papers: experimental evidence[J]. Journal of English for Academic Purposes, 2012, 11(4): 357-366. 21 Burns G A P C, Dasigi P, de Waard A, et al. Automated detection of discourse segment and experimental types from the text of cancer pathway results sections[J]. Database, 2016, 2016: baw122. 22 Church K W, Hanks P. Word association norms, mutual information, and lexicography[J]. Computational Linguistics, 1990, 16(1): 22-29. |
|
|
|