学术论文子句语义类型自动标注技术研究

doi:10.3772/j.issn.1000-0135.2021.06.007

情报学报

2021, Vol. 40

Issue (6): 621-629 DOI: 10.3772/j.issn.1000-0135.2021.06.007

Current Issue | Archive | Adv Search

Automatic Labeling of Semantic Clauses in Research Articles

Huang Wenbin¹, Wang Yueqian¹, Bu Yi¹, Che Shangkun²

1.Department of Information Management, Peking University, Beijing 100871
2.School of Economics and Management, Tsinghua University, Beijing 100084

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (784 KB) HTML (105 KB)
Export: BibTeX | EndNote (RIS)

Abstract Analyzing the semantic structure of research articles can be widely used to address multiple issues such as information extraction and retrieval. This paper describes the semantic structure of research articles by applying machine learning techniques to recognize the semantic types of discourse segments in these articles. We extracted the macro structure of research articles, including the syntactic and lexical information of each discourse segment as input features, and trained five models, namely support vector machines (SVM), conditional random fields (CRF), random forests (RF), gradient boost classifier (GBC), and stochastic gradient descent classifier (SGD). We integrated three best-performing models, that is, CRF, SVM, and GBC, to form a bagging model for classifying all discourse segments from the full text. Experimental results showed that our bagging model outperformed the baseline model on tasks of classifying discourse segments from full text and result sections with a higher accuracy and F-score. Furthermore, a topic-clustering experiment demonstrated the effectiveness of the model on topic detection, which is a common task in the field of text mining.

Key words： research article semantic labeling text classification machine learning clustering

Received: 13 May 2020

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Huang Wenbin
	Wang Yueqian
	Bu Yi
	Che Shangkun

Cite this article:

Huang Wenbin,Wang Yueqian,Bu Yi, et al. Automatic Labeling of Semantic Clauses in Research Articles[J]. 情报学报, 2021, 40(6): 621-629.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2021.06.007 OR https://qbxb.istic.ac.cn/EN/Y2021/V40/I6/621

1 Swales J M. Genre analysis: English in academic and research settings[M]. Cambridge: Cambridge University Press, 1990.
2 黄曾阳. HNC(概念层次网络)论: 计算机理解语言研究的新思路[M]. 北京: 清华大学出版社, 1998.
3 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985.
4 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300.
5 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538.
6 Lu W, Huang Y, Bu Y, et al. Functional structure identification of scientific documents in computer science[J]. Scientometrics, 2018, 115(1): 463-486.
7 de Waard A. A pragmatic structure for research articles[C]// Proceedings of the 2nd International Conference on Pragmatic Web. New York: ACM Press, 2007: 83-89.
8 Huang J, White R W. Parallel browsing behavior on the web[C]// Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. New York: ACM Press, 2010: 13-18.
9 薛家秀, 欧石燕. 科学论文篇章结构建模与解析研究进展[J]. 图书与情报, 2019(2): 120-132.
10 Teufel S, Carletta J, Moens M. An annotation scheme for discourse-level argumentation in research articles[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1999: 110-117.
11 Mizuta Y, Collier N. An annotation scheme for a rhetorical analysis of biology articles[C]// Proceeding of the Fourth Internation Conference on Language Resources and Evaluation. European Language Resources Association, 2004: 1737-1740.
12 de Waard A, Kircz J. Modeling scientific research articles: shifting perspectives and persistent issues[C]// Proceedings of the 12th International Conference on Electronic Publishing, Toronto, 2008.
13 Guo Y F, Korhonen A, Liakata M, et al. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment[J]. BMC Bioinformatics, 2011, 12(1): 1-18.
14 Dasigi P, Burns G A P C, Hovy E, et al. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks[OL]. (2017-02-17). https://arxiv.org/pdf/1702.05398v1.pdf.
15 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
16 Kiela D, Guo Y, Stenius U, et al. Unsupervised discovery of information structure in biomedical documents[J]. Bioinformatics, 2015, 31(7): 1084-1092.
17 Grosz B J, Sidner C L. Attention, intentions, and the structure of discourse[J]. Computational Linguistics, 1986, 12(3): 175-204.
18 Etzioni O, Cafarella M, Downey D, et al. Methods for domain-independent information extraction from the web: an experimental comparison[C]// Proceedings of the 19th National Conference on Artifical Intelligence. Palo Alto: AAAI Press, 2004: 391-398.
19 Yang R Y, Allison D. Research articles in applied linguistics: structures from a functional perspective[J]. English for Specific Purposes, 2004, 23(3): 264-279.
20 de Waard A, Pander Maat H. Verb form indicates discourse segment type in biological research papers: experimental evidence[J]. Journal of English for Academic Purposes, 2012, 11(4): 357-366.
21 Burns G A P C, Dasigi P, de Waard A, et al. Automated detection of discourse segment and experimental types from the text of cancer pathway results sections[J]. Database, 2016, 2016: baw122.
22 Church K W, Hanks P. Word association norms, mutual information, and lexicography[J]. Computational Linguistics, 1990, 16(1): 22-29.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3