|
|
A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning |
Wang Hao1,2, Deng Sanhong1,2, Su Xinning1,2, Guan Qin1,2 |
1.School of Information Management, Nanjing University, Nanjing 210023 2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210093 |
|
|
Abstract The study of theory and method is the driving force for the continuous development of any discipline. It is important to understand the application and development of the current theories and methods in the subject area. In this paper, terminology recognition which is a branch of the task of named entities is used to study the theoretical methods of information science. About 20000 articles in the field of information science in the past 20 years are collected, and as large-scale corpus to be trained and tested in Bi-LSTM-CRFs, a model of Deep Learning. The experiments verify the model’s feasibility and explore the impact of each experimental variable on the model’s effect, in order to maximize the effect of model recognition. The results show that for complex entities such as theoretical method terms, the corpus recognition based on word segmentation is better than the word segmentation-based corpus. The length of the term also has a certain influence on the recognition effect. When the length of the term is too long (word count ≥6), the recognition effect is obviously reduced. At the same time, the training corpus quantity is positively correlated with the recognition effect. Larger corpus quantities lead to better recognition. The type and quantity of the entity directly affects the recognition result. The entity recognition with obvious word formation features is better. In the feature introduction experiment, in addition to the pinyin feature, the part of speech, the length of the word, and the feature of the word vector can improve the F1 value. The improvement of the word vector and the part of speech features are obvious.
|
Received: 26 July 2019
|
|
|
|
1 Bush V. As we may think[J]. The Atlantic Monthly, 1945, 176: 101-108. 2 包昌火, 刘彦君, 张婧, 等. 中国情报学论纲[J]. 情报杂志, 2018, 37(1): 1-8 3 Henshel R B R L, Wallace W L. The Logic of science in sociology[J]. Contemporary Sociology, 1972, 1(6): 520-521. 4 符福峘, 陆婷. 论情报学方法论体系的构建、发展和应用[J]. 情报理论与实践, 2007, 30(2): 149-153. 5 陆伟, 孟睿, 刘兴帮. 面向引用关系的引文内容标注框架研究[J]. 中国图书馆学报, 2014, 40(6): 93-104. 6 徐庶睿, 卢超, 章成志. 术语引用视角下的学科交叉测度——以PLOS ONE上六个学科为例[J]. 情报学报, 2017, 36(8): 809-820. 7 Pettigrew K E, McKechnie L. The use of theory in information science research[J]. Journal of the American Society for Information Science and Technology, 2001, 52(1): 62-73. 8 Jeong D Y, Kim S J. Knowledge structure of library and information science in South Korea[J]. Library & Information Science Research, 2005, 27(1): 51-72. 9 Kim S J, Jeong D Y. An analysis of the development and use of theory in library and information science research articles[J]. Library & Information Science Research, 2006, 28(4): 548-562. 10 Kumasi K D, Charbonneau D H, Walster D. Theory talk in the library science scholarly literature: An exploratory analysis[J]. Library & Information Science Research, 2013, 35(3): 175-180. 11 van de Water N, Surprenant N, Genova B K L, et al. Research in information science: An assessment[J]. Information Processing & Management, 1976, 12(2): 117-123. 12 Tuomaala O, J?rvelin K, Vakkari P. Evolution of library and information science 1965-2005: Content analysis of journal articles[J]. Journal of the Association for Information Science and Technology, 2014, 65(7): 1446-1462. 13 Chu H T. Research methods in library and information science: A content analysis[J]. Library & Information Science Research, 2015, 37(1): 36-41. 14 Ferran-Ferrer N, Guallar J, Abadal E, et al. Research methods and techniques in Spanish library and information science journals (2012-2014)[J]. Information Research, 2017, 22(1): 1-8. 15 王芳, 史海燕, 纪雪梅. 我国情报学研究中理论的应用: 基于《情报学报》的内容分析[J]. 情报学报, 2015, 34(6): 581-591. 16 王芳, 陈锋, 祝娜, 等. 我国情报学理论的来源、应用及学科专属度研究[J]. 情报学报, 2016, 35(11): 1148-1164. 17 王知津, 王璇, 韩正彪. 90年代以来我国情报学理论研究期刊论文统计分析[J]. 图书馆理论与实践, 2012(1): 21-26. 18 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016, 60(2): 122-128. 19 钱军, 杨欣, 杨娟. 情报研究方法的聚类分析[J]. 情报科学, 2006, 24(10): 1561-1567. 20 杨锐. 关于情报学方法体系建设的思考[J]. 情报探索, 2008(5): 126-128. 21 王芳, 王向女. 我国情报学研究方法的计量分析: 以1999~2008年《情报学报》为例[J]. 情报学报, 2010, 29(4): 652-662. 22 王芳, 祝娜, 翟羽佳. 我国情报学研究中混合方法的应用及其领域分布分析[J]. 情报学报, 2017, 36(11): 1119-1129. 23 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75. 24 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340. 25 杨红梅, 李琳, 杨日东, 等. 基于双向LSTM神经网络电子病历命名实体的识别模型[J]. 中国组织工程研究, 2018, 22(20): 3237-3242. 26 单赫源, 吴照林, 张海粟, 等. 结合词语规则和SVM模型的军事命名实体关系抽取方法[J]. 指挥控制与仿真, 2016, 38(4): 58-63. 27 梁晨. 金融领域术语识别的研究[D]. 大连: 大连理工大学, 2017. 28 杨双龙, 吕学强, 李卓, 等. 中文专利文献术语自动识别研究[J]. 中文信息学报, 2016, 30(3): 111-117, 124. 29 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9): 923-938. 30 Cherry C, Guo H Y. The unreasonable effectiveness of word representations for Twitter nam ed entity recognition[C]// Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2015: 735-745. 31 Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling[C]// Proceedings of the 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, 2012: 601-608. 32 Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013: 6645-6649. 33 Peng N Y, Dredze M. Improving named entity recognition for Chinese social media with word segmentation representation learning[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 149-155. 34 Chiu J P C, Nichols E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370. 35 Rei M, Crichton G K O, Pyysalo S. Attending to characters in neural sequence labeling models[C]// Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics. The COLING 2016 Organizing Committee, 2016: 309-318. 36 Shen Y Y, Yun H, Lipton Z C, et al. Deep active learning for named entity recognition[C]// Proceedings of the 2nd Workshop on Representation Learning for NLP. Stroudsburg: Association for Computational Linguistics, 2017: 252-256. 37 Yang Z L, Salakhutdinov R, Cohen W W. Transfer learning for sequence tgging with hierarchical recurrent networks[C]// Proceedings of the 6th International Conference on Learning Representations, Toulon, France, 2017: 234-253. 38 Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[OL]. https://arxiv.org/pdf/1508.01991v1.pdf. 39 王昊, 苏新宁. 基于CRFs的角色标注人名识别模型在网络舆情分析中的应用[J]. 情报学报, 2009, 28(1): 88-96. 40 Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model[C]// Proceedings of INTERSPEECH 2010, the 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, 2010: 1045-1048. 41 Rondeau M A, Su Y. LSTM-based NeuroCRFs for named entity recognition[C]// Proceedings of INTERSPEECH 2016, the 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 2016: 665-669. 42 周浪. 中文术语抽取若干问题研究[D]. 南京: 南京理工大学, 2010. 43 周浪, 史树敏, 冯冲, 等. 基于多策略融合的中文术语抽取方法[J]. 情报学报, 2010, 29(3): 460-467. 44 刘章勋. 中文命名实体识别粒度和特征选择研究[D]. 哈尔滨: 哈尔滨工业大学, 2010. |
|
|
|