|
|
Research on Literature Classification Methods Based on Multiple Feature Correlation and a Graph Attention Network Model: A Case Study of Chinese Medical Literature |
Chen Shuaipu1,2,3, Qian Yuxing1,2,3, Qian Zhiqiang1, Liu Zhenghao1,2,3, Zhang Zhijian1,2,3 |
1.School of Information Management, Wuhan University, Wuhan 430072 2.Institute of Big Data, Wuhan University, Wuhan 430072 3.Center for Studies of Information Resources, Wuhan University, Wuhan 430072 |
|
|
Abstract The increasing complexity and detailed themes of scientific literature pose a significant challenge for efficient classification. A potential solution is the development of automatic literature classification technology, enabling the intelligent management of information resources and efficient scientific research retrieval. In response, this research presents a Hierarchical Text Classification Networks based on Multiple feature Correlation and Graph Attention Network (HTCN-MCGAT) to overcome the limitations of traditional methods. The HTCN-MCGAT model comprises three integral components: (1) The text representation and enhancement module redesigns the fine-tuning stage of the Bidirectional Encoder Representation from Transformers pre-training model to enhance the representation of the current literature at two levels: the internal character correlations of literature abstracts, titles, and keywords, and external document correlation; (2) The label association modeling module employs the Graph Attention Network to model the hierarchy and relationships between label semantics; and (3) The hierarchical interaction classification module incorporates a hierarchical fusion attention mechanism and a hierarchical classification framework that consists of global and local information based on multi-task learning for integrating high-level features classification. The proposed model is applied to the Chinese medical literature domain and tested with a series of experiments. The results demonstrate the HTCN-MCGAT model’s superior performance compared to traditional literature classification methods, improving the F1-score by 4.34%-13.21%. This research offers an optimized approach to literature classification from text-semantic enrichment and hierarchical-relationship-modeling perspectives. The findings hold potential for applications not only in literature classification tasks but also in hierarchical classification fields.
|
Received: 06 May 2023
|
|
|
|
1 国务院关于全面加强基础科学研究的若干意见[EB/OL]. (2018-01-19) [2023-03-22]. http://www.moe.gov.cn/jyb_xxgk/moe_1777/moe_1778/201802/t20180202_326384.html. 2 中国科技论文统计与分析课题组. 2021年中国科技论文统计与分析简报[J]. 中国科技期刊研究, 2023, 34(1): 87-95. 3 王卫军, 宁致远, 杜一, 等. 基于多标签分类的科技文献学科交叉研究性质识别[J]. 数据分析与知识发现, 2023, 7(1): 102-112. 4 罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10): 1046-1059. 5 Dong K, Xu H Y, Luo R, et al. An integrated method for interdisciplinary topic identification and prediction: a case study on information science and library science[J]. Scientometrics, 2018, 115(2): 849-868. 6 戎璐, 张亚洲. 一种注意力序列到序列模型的生成式层次文档分类[J]. 图书馆学研究, 2022(5): 45-56. 7 El Hindi K, AlSalman H, Qasem S, et al. Building an ensemble of fine-tuned naive Bayesian classifiers for text classification[J]. Entropy, 2018, 20(11): 857. 8 赵旸, 张智雄, 刘欢. 基于层次分类法的中文医学文献分类研究[J]. 图书馆学研究, 2021(21): 49-55, 61. 9 刘江峰, 林立涛, 刘畅, 等. 深度学习驱动的海量人文社会科学学术文献学科分类研究[J]. 情报理论与实践, 2023, 46(2): 71-81. 10 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 11 Onan A, Koruko?lu S, Bulut H. Ensemble of keyword extraction methods and classifiers in text classification[J]. Expert Systems with Applications, 2016, 57: 232-247. 12 王昊, 严明, 苏新宁. 基于机器学习的中文书目自动分类研究[J]. 中国图书馆学报, 2010, 36(6): 28-39. 13 冉亚鑫, 韩红旗, 张运良, 等. 基于Stacking集成学习的大规模文本层次分类方法[J]. 情报理论与实践, 2020, 43(10): 171-176, 182. 14 肖琳, 陈博理, 黄鑫, 等. 基于标签语义注意力的多标签文本分类[J]. 软件学报, 2020, 31(4): 1079-1089. 15 Veli?kovi? P, Cucurull G, Casanova A, et al. Graph attention networks[C]// Proceedings of the 6th International Conference on Learning Representations. Washington: ICLR, 2018. DOI: 10.48550/arXiv.1710.10903 16 韩普, 张展鹏, 张伟. 基于多任务学习和多态语义特征的中文疾病名称归一化研究[J]. 情报学报, 2021, 40(11): 1234-1244. 17 Liu P F, Qiu X P, Huang X J. Adversarial multi-task learning for text classification[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1-10. 18 Yang J L, Liu Y N, Qian M H, et al. Information extraction from electronic medical records using multitask recurrent neural network with contextual word embedding[J]. Applied Sciences, 2019, 9(18): 3658. 19 Wehrmann J, Cerri R, Barros R C. Hierarchical multi-label classification networks[C]// Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018: 5075-5084. 20 张文峰, 奚雪峰, 崔志明, 等. 多标签文本分类研究回顾与展望[J]. 计算机工程与应用, 2023, 59(18): 28-48. 21 Larson R R. Experiments in automatic Library of Congress classification[J]. Journal of the American Society for Information Science, 1992, 43(2): 130-148. 22 叶新明. 基于《中图法》的中文文献自动分类[J]. 情报学报, 1995, 14(6): 423-433. 23 苏燕, 徐萍, 孔亮亮, 等. 基于MeSH的生物医学分类主题词表重构探索——以干细胞研究文献为例[J]. 图书馆杂志, 2015, 34(3): 47-52. 24 成颖, 史九林. 自动分类研究现状与展望[J]. 情报学报, 1999, 18(1): 20-26. 25 周丽红, 刘勘. 基于关联规则的科技文献分类研究[J]. 图书情报工作, 2012, 56(4): 12-16, 119. 26 Chakraborty V, Chiu V, Vasarhelyi M. Automatic classification of accounting literature[J]. International Journal of Accounting Information Systems, 2014, 15(2): 122-148. 27 武建光, 苏云梅, 于琦, 等. 基于知识元的学术文献分类研究[J]. 情报理论与实践, 2019, 42(3): 160-165. 28 Zhang Y, Jin R, Zhou Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1-4): 43-52. 29 Ramos J. Using TF-IDF to determine word relevance in document queries[C]// Proceedings of the First Instructional Conference on Machine Learning, 2003: 29-48. 30 王贤明, 胡智文, 谷琼. 一种基于随机n-grams的文本相似度计算方法[J]. 情报学报, 2013, 32(7): 716-723. 31 杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012, 56(9): 114-119. 32 李湘东, 徐朋, 黄莉, 等. 基于KNN算法的文本自动分类方法研究——以学术期刊栏目自动归类为例[J]. 图书情报知识, 2010(4): 71-76. 33 李湘东, 阮涛. 内容相近类目实现自动分类时相关分类技术的比较研究——以《中图法》E271和E712.51为例[J]. 图书馆杂志, 2018, 37(6): 11-21, 30. 34 章成志, 李卓, 储荷婷. 基于全文内容的学术论文研究方法自动分类研究[J]. 情报学报, 2020, 39(8): 852-862. 35 孔洁. 基于深度学习与《中国图书馆分类法》的文献自动分类系统研究[J]. 新世纪图书馆, 2021(5): 51-56. 36 邓三鸿, 傅余洋子, 王昊. 基于LSTM模型的中文图书多标签分类研究[J]. 数据分析与知识发现, 2017, 1(7): 52-60. 37 Chen G B, Ye D H, Xing Z C, et al. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization[C]// Proceedings of the 2017 International Joint Conference on Neural Networks. Piscataway: IEEE, 2017: 2377-2383. 38 Chen Q Y, Du J C, Allot A, et al. LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022, 19(5): 2584-2595. 39 田亮, 李博闻, 章成志. 基于学术论文全文的跨语言研究方法自动分类研究[J]. 图书馆建设, 2022(1): 75-86. 40 赵海燕, 曹杰, 陈庆奎, 等. 层次多标签文本分类方法[J]. 小型微型计算机系统, 2022, 43(4): 673-683. 41 Cai L J, Hofmann T. Hierarchical document categorization with support vector machines[C]// Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2004: 78-87. 42 Wang J. An extensive study on automated Dewey Decimal Classification[J]. Journal of the American Society for Information Science and Technology, 2009, 60(11): 2269-2286. 43 Sinha K, Dong Y, Cheung J C K, et al. A hierarchical neural attention-based text classifier[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 817-823. 44 Xiao L, Huang X, Chen B, et al. Label-specific document representation for multi-label text classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 466-475. 45 王岳, 李雅文, 李昂. 科技资源文本层次多标签分类方法[J]. 计算机工程与应用, 2023, 59(13): 92-98. 46 Ma Y L, Liu X F, Zhao L J, et al. Hybrid embedding-based text representation for hierarchical multi-label text classification[J]. Expert Systems with Applications, 2022, 187: 115905. 47 刘津, 乔宝榆, 朱腾翌, 等. 基于BERT-GAT的科技论文审稿专家推荐算法研究[J]. 电力信息与通信技术, 2022, 20(7): 75-82. 48 Xu L L, Teng S J, Zhao R Y, et al. Hierarchical multi-label text classification with horizontal and vertical category correlations[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 2459-2468. 49 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 50 Reimers N, Gurevych I. Sentence-BERT: sentence Embeddings using Siamese BERT-Networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992. 51 Mnih V, Heess N, Graves A, et al. Recurrent models of visual attention[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 2204-2212. 52 Hu W M, Gao J, Li B, et al. Anomaly detection using local kernel density estimation and context-based regression[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(2): 218-233. 53 Liu Y. Fine-tune BERT for extractive summarization[OL]. (2019-09-05) [2023-04-24]. https://arxiv.org/pdf/1903.10318.pdf. 54 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 55 Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007. 56 董美, 常志军. 一种面向中医领域科技文献的实体关系抽取方法[J]. 图书情报工作, 2022, 66(18): 105-113. 57 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128. 58 韩普, 顾亮, 叶东宇, 等. 基于多任务和迁移学习的中文医学文献实体识别研究[J]. 数据分析与知识发现, 2023, 7(9): 136-145. 59 Vandenhende S, Georgoulis S, van Gansbeke W, et al. Multi-task learning for dense prediction tasks: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(7): 3614-3633. 60 赵旸, 张智雄, 刘欢, 等. 基于BERT模型的中文医学文献分类研究[J]. 数据分析与知识发现, 2020, 4(8): 41-49. 61 刘莹. 《中国图书馆分类法》(第五版)中乐谱文献分类的优化[J]. 图书馆研究与工作, 2023(1): 44-49. 62 沈立力, 姜鹏, 王静. 基于BERT模型的中文期刊文献自动分类实践研究[J]. 图书馆杂志, 2022, 41(5): 109-118, 135. 63 李琰, 喻佳洁, 李幼平. 循证科学: 构建突破学科界限的会聚共生体系[J]. 中国循证医学杂志, 2019, 19(5): 505-509. 64 操玉杰, 毛进, 潘荣清, 等. 学科交叉研究的演化阶段特征分析——以医学信息学为例[J]. 数据分析与知识发现, 2019, 3(5): 107-116. 65 齐燕, 高东平. 特定学科跨学科交叉发展态势分析: 以医学为例[J]. 情报杂志, 2020, 39(9): 200-207. 66 石磊, 阮选敏, 魏瑞斌, 等. 基于序列到序列模型的生成式文本摘要研究综述[J]. 情报学报, 2019, 38(10): 1102-1116. 67 刘继, 顾凤云. 基于BERT与BiLSTM混合方法的网络舆情非平衡文本情感分析[J]. 情报杂志, 2022, 41(4): 104-110. 68 田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 计算机应用, 2023, 43(9) : 2707-2714. 69 Hugging Face. bert-base-chinese[Z/OL]. [2023-04-25]. https://huggingface.co/bert-base-chinese. 70 Prechelt L. Early stopping—but when?[M]// Neural Networks: Tricks of the Trade. Heidelberg: Springer, 2012: 53-67. 71 张志剑, 刘政昊, 马费成. 面向互联网舆情事件的企业风险识别——基于KGANN模型[J]. 工程管理科技前沿, 2022, 41(1): 65-73. 72 张洋, 叶月, 张宗翔, 等. 基于GBDT的学术会议替代计量学评价模型研究[J]. 情报学报, 2019, 38(11): 1150-1159. 73 Sun C, Qiu X P, Xu Y G, et al. How to fine-tune BERT for text classification?[C]// Proceedings of the 18th China National Conference on Computational Linguistics. Cham: Springer, 2019: 194-206. 74 Li B H, Zhou H, He J X, et al. On the sentence embeddings from pre-trained language models[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9119-9130. 75 杨春霞, 马文文, 陈启岗, 等. 融合CNN-SAM与GAT的多标签文本分类模型[J]. 计算机工程与应用, 2023, 59(5): 106-114. 76 Reimers N, Gurevych I. Why comparing single performance scores does not allow to draw conclusions about machine learning approaches[OL]. (2018-03-26) [2023-04-24]. http://arxiv.org/pdf/1803.09578.pdf. 77 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24. 78 鲍彤, 章成志. ChatGPT中文信息抽取能力测评——以三种典型的抽取任务为例[J]. 数据分析与知识发现, 2023, 7(9): 1-11. |
|
|
|