|
|
Knowledge Entity Extraction Method Combining Semantic Enhancement and Knowledge Distillation for Academic Literature |
Wang Yulong1,2, Qin Chunxiu1,2, Ma Xubu1,2, Lyu Shuyue1,2, Li Fan1,2 |
1.School of Economics and Management, Xidian University, Xi’an 710126 2.Shaanxi Information Resources Research Center, Xi’an 710126 |
|
|
Abstract The accurate identification and extraction of diverse knowledge entities from large volumes of academic literature is crucial for meeting the needs of researchers and advancing fine-grained knowledge discovery. To address the issues of data sparsity and imbalances in domain-specific entities within academic literature, this study proposes an improved method that combines semantic enhancement and knowledge distillation. First, this method introduces a semantic-enhanced teacher model. By constructing an embedding representation method that integrates SciBERT, a pretrained language model based on BERT (bidirectional encoder representations from transformers), and ELMo (embeddings from language models), global semantics and dynamic word-level information are effectively combined. This approach generates more comprehensive semantic representations. Hence, it enhances the ability of the teacher model to capture complex contextual information in domain-specific academic literature. Moreover, a domain-specific pre-trained word embedding model is used to select the top n words or phrases that are most semantically related to the knowledge entities. Attention and gating mechanisms are then applied to dynamically weight the enhanced semantic information, thus effectively addressing data sparsity and the challenge of modeling long-tail entity categories. Next, a set of heterogeneous single-entity teacher models is employed to generate probability distributions across the aggregated dataset. These distributions are then used to guide the training of a student model. Finally, this study validates the effectiveness of the proposed method using three publicly available datasets from the field of materials science. Experimental results demonstrated that the proposed method achieved the highest micro F1 and macro F1 scores across three datasets in the field of materials science. Moreover, the proposed method exhibits significant robustness and generalization capabilities, particularly under scenarios of entity data sparsity and imbalance.
|
Received: 02 September 2024
|
|
|
|
1 Ren J J, Wang F, Li M L. Dynamics and characteristics of interdisciplinary research in scientific breakthroughs: case studies of Nobel-winning research in the past 120 years[J]. Scientometrics, 2023, 128(8): 4383-4419. 2 李广建, 袁钺. 基于深度学习的科技文献知识单元抽取研究综述[J]. 数据分析与知识发现, 2023, 7(7): 1-17. 3 Ding Y, Song M, Han J, et al. Entitymetrics: measuring the impact of entities[J]. PLoS One, 2013, 8(8): e71416. 4 Zhang C Z, Mayr P, Lu W, et al. Guest editorial: extraction and evaluation of knowledge entities in the age of artificial intelligence[J]. Aslib Journal of Information Management, 2023, 75(3): 433-437. 5 Ma Y Q, Liu J W, Lu W, et al. From “what” to “how”: extracting the procedural scientific information toward the metric-optimization in AI[J]. Information Processing & Management, 2023, 60(3): 103315. 6 刘春丽, 陈爽. 科学文献中的知识实体抽取与评价研究综述[J]. 现代情报, 2023, 43(12): 143-163. 7 Tang M J, Li T, Gao W, et al. AttenSy-SNER: software knowledge entity extraction with syntactic features and semantic augmentation information[J]. Complex & Intelligent Systems, 2023, 9(1): 25-39. 8 Nie Y Y, Tian Y H, Wan X, et al. Named entity recognition for social media texts with semantic augmentation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 1383-1391. 9 毛亮, 赵林均, 余敦辉, 等. 基于知识蒸馏的企业命名实体识别模型[J]. 计算机工程, 2023, 49(5): 90-96. 10 李锦辉, 刘继. 基于知识蒸馏模型的文本情感分析[J]. 软件工程, 2024, 27(4): 27-32. 11 司兆峰, 齐洪钢. 知识蒸馏方法研究与应用综述[J]. 中国图象图形学报, 2023, 28(9): 2817-2832. 12 李莉, 奚雪峰, 盛胜利, 等. 深度学习中文命名实体识别研究进展[J]. 计算机工程与应用, 2023, 59(24): 46-69. 13 Moscato V, Postiglione M, Sansone C, et al. TaughtNet: learning multi-task biomedical named entity recognition from single-task teachers[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 27(5): 2512-2523. 14 Yoon W, So C H, Lee J, et al. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition[J]. BMC Bioinformatics, 2019, 20(Suppl 10): Article No.249. 15 顾佼佼, 翟一琛, 姬嗣愚, 等. 基于BERT和知识蒸馏的航空维修领域命名实体识别[J]. 电子测量技术, 2023, 46(3): 19-24. 16 王奎芳, 吕璐成, 孙文君, 等. 基于大模型知识蒸馏的专利技术功效词自动抽取方法研究: 以车联网V2X领域为例[J]. 数据分析与知识发现, 2024, 8(8-9): 144-156. 17 尹宝生, 周澎. 融合标签知识的中文医学命名实体识别[J]. 计算机科学, 2024, 51(6A): 140-146. 18 Zhang H, Zhang C Z, Wang Y Z. Revealing the technology development of natural language processing: a scientific entity-centric perspective[J]. Information Processing & Management, 2024, 61(1): 103574. 19 刘悦, 刘大晖, 葛献远, 等. 高质量的材料科学文本挖掘数据集构建方法[J]. 物理学报, 2023, 72(7): 128-141. 20 Wang T B, Huang R Y, Hu N, et al. Chinese named entity recognition method based on dictionary semantic knowledge enhancement[J]. IEICE Transactions on Information and Systems, 2023, 106(5): 1010-1017. 21 陈曙东, 罗超, 欧阳小叶, 等. 基于动态词典匹配的语义增强中文命名实体识别算法[J]. 无线电工程, 2021, 51(7): 519-525. 22 张天宇, 孙媛媛, 杜文玉, 等. 基于语义边界增强的司法命名实体识别[J]. 清华大学学报(自然科学版), 2024, 64(5): 749-759. 23 Liu P P, Li H, Wang Z G, et al. Multi-features based semantic augmentation networks for named entity recognition in threat intelligence[C]// Proceedings of the 26th International Conference on Pattern Recognition. Piscataway: IEEE, 2022: 1557-1563. 24 赵红磊, 唐焕玲, 张玉, 等. k-best维特比解耦合知识蒸馏的命名实体识别模型[J]. 计算机科学与探索, 2024, 18(3): 780-794. 25 赵宇博, 张丽萍, 闫盛, 等. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 计算机应用, 2024, 44(8): 2421-2429. 26 Zhou H W, Liu Z, Lang C K, et al. Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation[J]. BMC Bioinformatics, 2021, 22(1): Article No.295. 27 Liang S N, Gong M, Pei J, et al. Reinforced iterative knowledge distillation for cross-lingual named entity recognition[C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2021: 3231-3239. 28 Zheng X W, Du H J, Luo X W, et al. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework[J]. BMC Bioinformatics, 2022, 23(1): Article No.501. 29 王晋涛, 秦昂, 张元, 等. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332. 30 章成志, 谢雨欣, 张恒. 学术文献全文内容中的方法实体细粒度抽取及演化分析研究[J]. 情报学报, 2023, 42(8): 952-966. 31 杨美芳, 杨波. 基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究[J]. 数据分析与知识发现, 2022, 6(9): 86-99. 32 余同瑞, 金冉, 韩晓臻, 等. 自然语言处理预训练模型的研究综述[J]. 计算机工程与应用, 2020, 56(23): 12-22. 33 Yan H, Deng B C, Li X N, et al. TENER: adapting transformer encoder for named entity recognition[OL]. (2019-12-10). https://arxiv.org/pdf/1911.04474. 34 彭文智, 肖蓉, 安先跨, 等. 基于注意力机制补足实体缺陷的文档级关系抽取方法[J]. 中文信息学报, 2024, 38(8): 93-102. 35 Li Z J, Lian Y C, Ma X Y, et al. Bio-semantic relation extraction with attention-based external knowledge reinforcement[J]. BMC Bioinformatics, 2020, 21(1): Article No.213. 36 叶瀚, 李欣, 孙海春. 结合门控机制的卷积网络实体缺失检测方法[J]. 计算机科学, 2023, 50(5): 262-269. 37 刘悦, 姚文轩, 刘大晖, 等. 高质量文本数据驱动的命名实体识别加速镍基单晶高温合金材料知识发现[J]. 金属学报, 2024, 60(10): 1429-1438. 38 Weston L, Tshitoyan V, Dagdelen J, et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature[J]. Journal of Chemical Information and Modeling, 2019, 59(9): 3692-3702. 39 Yang X J, Zhuo Y, Zuo J L, et al. PcMSP: a dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text[C]// Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg: Association for Computational Linguistics, 2022: 6033-6046. 40 Schrader T P, Finco M, Grünewald S, et al. MuLMS: a multi-layer annotated text corpus for information extraction in the materials science domain[C]// Proceedings of the Second Workshop on Information Extraction from Scientific Publications. Stroudsburg: Association for Computational Linguistics, 2023: 84-100. 责任编辑 魏瑞斌) |
|
|
|