|
|
Research on the Structure Recognition of Academic Texts Under Different Characteristics |
Wang Dongbo1, Gao Ruiqing1, Ye Wenhao1, Zhou Xin2, Zhu Danhao3 |
1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095; 2. Department of Information Management, Nanjing University, Nanjing 210093; 3. Department of Computer Science and Technology, Nanjing University, Nanjing 210093 |
|
|
Abstract With the emergence of a large number of full-text scientific theses, the process of extracting the useful information in these volumes is not only beneficial to knowledge-based organizations but is also useful for the accurate retrieval of academic literature. The recognition of the structure of academic text is the basis for this investigation because structure recognition is helpful in the comprehension of these documents from the perspective of depth and semantic, to promote research into academic text mining. This paper examines different structural functions of academic texts as research objects, and considers 1579 papers from the Journal of the Association for Information Science and Technology as the dataset, and compares three types of models, namely bidirectional long short-term memory neural network, support vector machine, and conditional random fields, and the conditional random field determined to be used in the following exploration. Based on this approach, the problem of functional structure recognition of academic texts was transformed to identify the sequence of sentence units. Finally, the best model was obtained for an F-measure of 92.88% for the average of the open test, and the effect of different features on the structure recognition problem was explored. The experimental results showed that the lexical information in the chapter titles and the feature words in the chapters play an important role in academic text functional structure recognition, and satisfactory results were produced. However, the length of the structure affected the conditional random fields method. The causes of the errors associated with the identification of academic texts are summarized, in addition to the identification of the limitations and plans for further studies.
|
Received: 25 February 2018
|
|
|
|
[1] 高时阔, 宇文高峰. 科技期刊学术论文文体结构特点分析[J]. 中国科技期刊研究, 2004, 15(1): 19-21. [2] 刘君君. 论社会科学学术论文的语篇结构[J]. 宜春学院学报, 2006, 28(1): 126-130. [3] 刘辉. 学术期刊论文方法部分体裁结构的比较研究[J]. 外语学刊, 2017(4): 6-12. [4] Zhu X, Turney P, Lemire D, et al.Measuring academic influence: Not all citations are equal[J]. Journal of the Association for Information Science and Technology, 2015, 66(2): 408-427. [5] 张玉芳, 莫凌琳, 熊忠阳, 等. 基于条件随机场的科研论文信息分层抽取[J]. 计算机应用研究, 2009, 26(10): 3690-3693. [6] 莫凌琳. 基于条件随机场的科研论文信息分层抽取研究[D]. 重庆: 重庆大学, 2009: 35-38. [7] 朱海军. 基于标题特征和词汇关联的文本结构分析[D]. 沈阳: 沈阳航空工业学院, 2008: 45-59. [8] Zhang X, Lecun Y.Text understanding from scratch[J]. Computer Science, 2015, 25(8): 84-92. [9] 王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2): 45-50. [10] 类艳春. 基于篇章结构的抄袭论文识别系统的研究与实现[D]. 沈阳: 东北师范大学, 2009: 62-63. [11] 金博, 史彦军, 滕弘飞. 基于篇章结构相似度的复制检测算法[J]. 大连理工大学学报, 2007, 47(1): 125-130. [12] 王继成, 武港山, 周源远, 等. 一种篇章结构指导的中文Web文档自动摘要方法[J]. 计算机研究与发展, 2003, 40(3): 398-405. [13] 刘宝超. 学位论文规范性评估系统的设计与实现[D]. 延边: 延边大学, 2015: 74-78. [14] Tkaczyk D, Fedoryszak M, Dendek P J, et al.CERMINE: automatic extraction of structured metadata from scientific literature[J]. International Journal on Document Analysis and Recognition, 2015, 18(4): 317-335. [15] 陆伟, 黄永, 程齐凯, 等. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985. [16] 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300. [17] 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538. [18] 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——在学术搜索中的应用[J]. 情报学报, 2016, 35(4): 425-431. [19] 方龙, 李信, 黄永, 等. 学术文本的结构功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2017, 36(6): 599-605. [20] Lafferty J D, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282-289. [21] Rumelhart D E, Hinton G E, Williams R J.Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533-536. [22] Werbos P J.Generalization of backpropagation with application to a recurrent gas market model[J]. Neural Networks, 1988, 1(4): 339-356. [23] Joachims T.Text categorization with Support Vector Machines: Learning with many relevant features[C]//European Conference on Machine Learning. Springer, Berlin, Heidelberg, 1998: 137-142. [24] Srivastava N, Hinton G, Krizhevsky A, et al.Dropout: a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1): 1929-1958. [25] Pascanu R, Mikolov T, Bengio Y.On the difficulty of training recurrent neural networks[C]//International Conference on Machine Learning, 2013: 1301-1310. [26] Le Q, Mikolov T.Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014: 1180-1188. |
|
|
|