Automatic Generative Information Science Term Extraction and Multidimensional Linked Knowledge Mining
Hu Haotian1,2,3, Deng Sanhong2,3, Kong Ling4,5, Yan Xiaohui2,3, Yang Wenxia2,3, Wang Dongbo3,5, Shen Si3,6
1.Jiangsu Academy of Agricultural Sciences, Nanjing 210014 2.School of Information Management, Nanjing University, Nanjing 210023 3.Key Laboratory of Data Engineering and Knowledge Services in Provincial Universities (Nanjing University), Nanjing 210023 4.School of Information Management, Shandong University of Technology, Zibo 255049 5.College of Information Management, Nanjing Agricultural University, Nanjing 210095 6.School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094
摘要情报学术语承载了情报学科基础知识与核心概念。从概念维度梳理与分析情报学术语对推动学科发展、助力下游知识挖掘任务具有重要意义。面对数量快速增长的科技文献,自动术语抽取替代了人工筛选,但现有方法严重依赖大规模标注数据集,难以迁移至低资源场景。本文设计了一种生成式情报学术语抽取方法(generative term extraction for information science,GTX-IS),将传统基于序列标注的抽取式任务转化为序列到序列的生成式任务。结合小样本学习策略与有监督微调,提升面向特定任务的文本生成能力,能够在低资源有标签数据集场景下较为精准地抽取情报学术语。对于抽取结果,本文进一步开展了情报学领域术语发现及多维知识挖掘。综合运用全文科学计量与信息计量方法,从术语自身、术语间关联、时间信息等维度,对术语的出现频次、生命周期、共现信息等进行统计分析与知识挖掘。采用社会网络分析方法,结合时间维度特征,从术语角度出发,完善期刊的动态简介,探究情报学研究热点、演变历程和未来发展趋势。本文方法在术语抽取实验中的表现超越了全部13种主流生成式和抽取式模型,展现出较强的小样本学习能力,为领域信息抽取提供了新的思路。
胡昊天, 邓三鸿, 孔玲, 闫晓慧, 杨文霞, 王东波, 沈思. 生成式情报学术语自动抽取与多维关联知识挖掘研究[J]. 情报学报, 2024, 43(5): 588-600.
Hu Haotian, Deng Sanhong, Kong Ling, Yan Xiaohui, Yang Wenxia, Wang Dongbo, Shen Si. Automatic Generative Information Science Term Extraction and Multidimensional Linked Knowledge Mining. 情报学报, 2024, 43(5): 588-600.
1 向露, 周玉, 宗成庆. 基于中英文单语术语库的双语术语对齐方法[J]. 中国科技术语, 2022, 24(1): 14-25. 2 陈芬, 苏新宁. 我国情报学学科发展现状与未来思考[J]. 情报学报, 2019, 38(9): 988-996. 3 王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828. 4 汪琳, 王昊, 李晓敏, 等. 融合学习扩展的非遗陶瓷工艺领域术语库构建及应用[J]. 图书馆论坛, 2024, 44(2): 66-78. 5 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418. 6 张卫, 王昊, 邓三鸿, 等. 面向数字人文的古诗文本情感术语抽取与应用研究[J]. 中国图书馆学报, 2021, 47(4): 113-131. 7 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901. 8 Black S, Biderman S, Hallahan E, et al. GPT-NeoX-20B: an open-source autoregressive language model[C] //Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models. Stroudsburg: Association for Computational Linguistics, 2022: 95-136. 9 Gao L, Biderman S, Black S, et al. The Pile: an 800GB dataset of diverse text for language modeling[OL]. (2020-12-31). https://arxiv.org/pdf/2101.00027. 10 Zhang S S, Roller S, Goyal N, et al. OPT: open pre-trained transformer language models[OL]. (2022-06-21). https://arxiv.org/pdf/2205.01068. 11 Le Scao T, Fan A, Akiki C, et al. BLOOM: a 176B-parameter open-access multilingual language model[OL]. (2023-06-27). https://arxiv.org/pdf/2211.05100. 12 Lauren?on H, Saulnier L, Wang T, et al. The bigscience ROOTS corpus: a 1.6TB composite multilingual dataset[C]//Proceedings of the 36th Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2022: 31809-31826. 13 Muennighoff N, Wang T, Sutawika L, et al. Crosslingual generalization through multitask finetuning[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 15991-16111. 14 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 15 任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12): 123-136. 16 Oral B, Emekligil E, Arslan S, et al. Information extraction from text intensive and visually rich banking documents[J]. Information Processing & Management, 2020, 57(6): 102361. 17 Yang X, Bian J, Hogan W R, et al. Clinical concept extraction using transformers[J]. Journal of the American Medical Informatics Association, 2020, 27(12): 1935-1942. 18 Du J C, Xiang Y, Sankaranarayanapillai M, et al. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning[J]. Journal of the American Medical Informatics Association, 2021, 28(7): 1393-1400. 19 陈美杉, 夏晨曦. 肝癌患者在线提问的命名实体识别研究: 一种基于迁移学习的方法[J]. 数据分析与知识发现, 2019, 3(12): 61-69. 20 刘浏, 伊凡, 王东波, 等. iSchools培养计划知识挖掘下的情报学教育及人才培养[J]. 情报理论与实践, 2021, 44(2): 26-32. 21 Li P C, Liu Q K, Cheng Q K, et al. Data set entity recognition based on distant supervision[J]. The Electronic Library, 2021, 39(3): 435-449. 22 Fan B, Fan W G, Smith C, et al. Adverse drug event detection and extraction from open data: a deep learning approach[J]. Information Processing & Management, 2020, 57(1): 102131. 23 Fan Y D, Zhou S C, Li Y F, et al. Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text[J]. Journal of the American Medical Informatics Association, 2021, 28(3): 569-577. 24 马娜, 张智雄, 吴朋民. 基于特征融合的术语型引用对象自动识别方法研究[J]. 数据分析与知识发现, 2020, 4(1): 89-98. 25 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75. 26 翟羽佳, 田静文, 赵玥. 基于BERT-BiLSTM-CRF模型的算法术语抽取与创新演化路径构建研究[J]. 情报科学, 2022, 40(4): 71-78. 27 Ding L P, Zhang Z X, Liu H, et al. Automatic keyphrase extraction from scientific Chinese medical abstracts based on character-level sequence labeling[J]. Journal of Data and Information Science, 2021, 6(3): 35-57. 28 刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54. 29 Touvron H, Martin L, Stone K, et al. LLaMA 2: open foundation and fine-tuned chat models[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09288. 30 Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: efficient finetuning of quantized LLMs[OL]. (2023-05-23). http://arxiv.org/pdf/2305.14314. 31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 32 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3165-3120. 33 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). http://arxiv.org/pdf/ 1907.11692. 责任编辑 王克平)