|
|
Automatic Discipline Classification for Scientific Papers Based on a Deep Pre-training Language Model |
Luo Pengcheng1,2, Wang Yibo2, Wang Jimin1 |
1.Department of Information Management, Peking University, Beijing 100871 2.Peking University Library, Beijing 100871 |
|
|
Abstract In order to support discipline-related intelligence and literature services, this paper explores the use of a deep pre-training language model to automatically classify scientific papers for the Ministry of Education. Based on BERT and ERNIE, we constructed a literature classification model. The model was verified using a dataset that consisted of about 100,000 journal papers from 21 first-level disciplines belonging to the humanities and social sciences. We compared our model with traditional machine learning methods (such as, Na?ve Bayes, Support Vector Machines) and typical deep learning methods (i.e., Convolution Neural Network and Recurrent Neural Network). The results showed that the method based on the deep pre-training language model works best, and the top-1 and top-2 accuracy of ERNIE could reach 75.56% and 89.35%, respectively. The classifier that simultaneously used the title, keyword, and abstract of the papers as the input achieved the best result. Relatively independent disciplines achieved good classification accuracy. For example, the F1 score of Sports Science was 0.98. Other disciplines demonstrated poor accuracy owing to their relatively high intersection with other disciplines. For example, the F1 score of Theoretical Economics and Applied Economics was around 0.6. In addition, this paper further discusses the topics of disciplinary intersection, model application, and optimization.
|
Received: 02 September 2019
|
|
|
|
1 教育部. 学位授予和人才培养学科目录(2018年4月更新)[EB/OL]. [2019-10-22]. http://www.moe.gov.cn/s78/A22/xwb_left/moe_833/201804/t20180419_333655.html. 2 肖珑. 支持“双一流”建设的高校图书馆服务创新趋势研究[J]. 大学图书馆学报, 2018, 36(5): 43-51. 3 吴爱芝, 肖珑, 张春红, 等. 基于文献计量的高校学科竞争力评估方法与体系[J]. 大学图书馆学报, 2018, 36(1): 62-67, 26. 4 北京大学图书馆. 北京大学科学研究前沿(2018年版)[EB/OL]. [2019-08-18]. https://www.lib.pku.edu.cn/portal/cn/fw/kyzc/zhishi chanquan. 5 马芳珍, 李峰, 肖珑. 基于知识服务的海洋学科门户建设[J]. 大学图书馆学报, 2018, 36(3): 46-51. 6 学位中心关于第四轮学科评估成果及人员归属说明[EB/OL]. [2019-08-18]. http://yjs.jlict.edu.cn/show.aspx?id=476&cid=50. 7 CSSC category to Web of Science category mapping 2012[EB/OL]. [2019-08-18]. http://help.incites.clarivate.com/inCites2Live/filterValuesGroup/researchAreaSchema/chinaSCADCSubjCat.html. 8 蔺梅芳, 刘静. 基于InCites学科映射的一级学科文献计量分 析——以电子科技大学为例[J]. 四川图书馆学报, 2015(3): 71-73. 9 刘虹, 徐嘉莹. 上海市高校学科国际影响力评价——基于InCites数据库学科映射的文献计量分析[J]. 复旦教育论坛, 2014, 12(4): 29-34. 10 刘文娟. 国内三种期刊数据库学科分类之比较[J]. 中国信息化, 2019(1): 83-84. 11 詹萌. 学科(专业)分类与文献分类之间的映射关系研究[J]. 情报理论与实践, 2013, 36(10): 40-43, 35. 12 单连慧, 赵迎光, 钱庆. 基于词汇相似度的医学分类体系映射研究与实现[J]. 医学信息学杂志, 2016, 37(11): 46-50. 13 梁瑛, 邹小筑. ESI工程类与中国教育部学科分类的对比研究[J]. 农业图书情报学刊, 2016, 28(1): 76-81. 14 Gl?nzel W, Schubert A, Czerwon H J. An item-by-item subject classification of papers published in multidisciplinary and general journals using reference analysis[J]. Scientometrics, 1999, 44(3): 427-439. 15 Fang H. Classifying research articles in multidisciplinary sciences journals into subject categories[J]. Knowledge Organization, 2015, 42(3): 139-153. 16 Taheriyan M. Subject classification of research papers based on interrelationships analysis[C]// Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation, New York: ACM Press, 2011: 39-44. 17 Gómez-Nú?ez A J, Vargas-Quesada B, de Moya-Anegón F, et al. Improving SCImago Journal & Country Rank (SJR) subject classification through reference analysis[J]. Scientometrics, 2011, 89(3): 741-758. 18 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010, 36(6): 28-39. 19 杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012, 56(9): 114-119. 20 李湘东, 阮涛. 内容相近类目实现自动分类时相关分类技术的比较研究——以《中图法》E271和E712.51为例[J]. 图书馆杂志, 2018, 37(6): 11-21, 30. 21 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3): 80-87. 22 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6): 96-103. 23 傅余洋子. 基于LSTM模型的中文图书分类研究[D]. 南京: 南京大学, 2017. 24 董微, 赵捷. 基于密度分布单类支持向量机的科技文献分类研究[J]. 情报工程, 2018, 4(3): 67-72. 25 王效岳, 白如江, 王晓笛, 等. 海量网络学术文献自动分类系统[J]. 图书情报工作, 2013, 57(16): 117-122. 26 Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 27 Sun Y, Wang S H, Li Y K, et al. ERNIE: Enhanced representation through knowledge integration[OL]. [2019-08-18]. https://arxiv.org/abs/1904.09223v1. 28 Hu D C. An introductory survey on attention mechanisms in NLP problems[C]// Proceedings of SAI Intelligent Systems Conference. Cham: Springer, 2020: 432-448. 29 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook: Curran Associates Inc., 2017: 6000-6010. 30 Tenney I, Das D, Pavlick E. BERT rediscovers the classical NLP pipeline[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4593-4601. 31 TensorFlow code and pre-trained models for BERT[EB/OL]. [2019-08-18]. https://github.com/google-research/bert. 32 An implementation of ERNIE for language understanding[EB/OL]. [2019-08-18]. https://github.com/PaddlePaddle/ERNIE. 33 scikit-learn: Machine learning in python[EB/OL]. [2019-08-18]. https://scikit-learn.org/. 34 结巴中文分词[EB/OL]. [2019-08-18]. https://github.com/fxsjy/jieba. 35 Kim Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1746-1751. 36 Song Y, Shi S M, Li J, et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 175-180. 37 Grave E, Bojanowski P, Gupta P, et al. Learning word vectors for 157 languages[C]// Proceedings of the Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association, 2018: 3483-3487. 38 Li S, Zhao Z, Hu R F, et al. Analogical reasoning on Chinese morphological and semantic relations[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Stroudsburg: Association for Computational Linguistics, 2018: 138-143. 39 谢靖, 钱力, 师洪波, 等. 科研学术大数据的精准服务架构设计[J]. 数据分析与知识发现, 2019, 3(1): 63-71. 40 Beltagy I, Cohan A, Lo K. SciBERT: Pretrained contextualized embeddings for scientific text[OL]. [2019-08-18]. https://arxiv.org/abs/1903.10676. |
|
|
|