|
|
Automatic Generative Information Science Term Extraction and Multidimensional Linked Knowledge Mining |
Hu Haotian1,2,3, Deng Sanhong2,3, Kong Ling4,5, Yan Xiaohui2,3, Yang Wenxia2,3, Wang Dongbo3,5, Shen Si3,6 |
1.Jiangsu Academy of Agricultural Sciences, Nanjing 210014 2.School of Information Management, Nanjing University, Nanjing 210023 3.Key Laboratory of Data Engineering and Knowledge Services in Provincial Universities (Nanjing University), Nanjing 210023 4.School of Information Management, Shandong University of Technology, Zibo 255049 5.College of Information Management, Nanjing Agricultural University, Nanjing 210095 6.School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094 |
|
|
Abstract Information science terminology conveys the basic knowledge and core concepts of information science discipline. It is thus of great significance to sort out and analyze information science terms from the basic concepts to promote the development of the discipline and assist downstream knowledge mining tasks. With the rapidly growing amount of scientific and technological literature, automatic term extraction has replaced manual screening, but existing methods rely heavily on large-scale labeled datasets, making it difficult to migrate to low-resource scenarios. This study designs a Generative Term eXtraction for Information Science (GTX-IS) method, which transforms the traditional extraction task based on sequence labeling into a sequence-to-sequence generative task. Combined with few-shot learning strategies and supervised fine-tuning, it improves the ability to generate text for specific tasks and can more accurately extract information science terms in low-resource scenarios. For the extraction results, this study further develops term discovery and multi-dimensional knowledge mining in the field of information science, and comprehensively uses full-text informetric and scientometric methods to conduct statistical analysis and knowledge mining on the frequency of occurrence, life cycle, and co-occurrence information of terms from the dimensions of the term itself, the relationship between terms, and time information. Using the social network analysis method, combined with the characteristics of the time dimension, this study improves the dynamic profile of journals, facilitating the exploration of the research hotspots, evolution process, and future development trends of information science. The proposed method surpasses all 13 baseline generative and extractive models, showing a strong few-shot learning ability, and provides a new idea for domain information extraction.
|
Received: 14 August 2023
|
|
|
|
1 向露, 周玉, 宗成庆. 基于中英文单语术语库的双语术语对齐方法[J]. 中国科技术语, 2022, 24(1): 14-25. 2 陈芬, 苏新宁. 我国情报学学科发展现状与未来思考[J]. 情报学报, 2019, 38(9): 988-996. 3 王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828. 4 汪琳, 王昊, 李晓敏, 等. 融合学习扩展的非遗陶瓷工艺领域术语库构建及应用[J]. 图书馆论坛, 2024, 44(2): 66-78. 5 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418. 6 张卫, 王昊, 邓三鸿, 等. 面向数字人文的古诗文本情感术语抽取与应用研究[J]. 中国图书馆学报, 2021, 47(4): 113-131. 7 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901. 8 Black S, Biderman S, Hallahan E, et al. GPT-NeoX-20B: an open-source autoregressive language model[C] //Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models. Stroudsburg: Association for Computational Linguistics, 2022: 95-136. 9 Gao L, Biderman S, Black S, et al. The Pile: an 800GB dataset of diverse text for language modeling[OL]. (2020-12-31). https://arxiv.org/pdf/2101.00027. 10 Zhang S S, Roller S, Goyal N, et al. OPT: open pre-trained transformer language models[OL]. (2022-06-21). https://arxiv.org/pdf/2205.01068. 11 Le Scao T, Fan A, Akiki C, et al. BLOOM: a 176B-parameter open-access multilingual language model[OL]. (2023-06-27). https://arxiv.org/pdf/2211.05100. 12 Lauren?on H, Saulnier L, Wang T, et al. The bigscience ROOTS corpus: a 1.6TB composite multilingual dataset[C]//Proceedings of the 36th Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2022: 31809-31826. 13 Muennighoff N, Wang T, Sutawika L, et al. Crosslingual generalization through multitask finetuning[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 15991-16111. 14 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 15 任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12): 123-136. 16 Oral B, Emekligil E, Arslan S, et al. Information extraction from text intensive and visually rich banking documents[J]. Information Processing & Management, 2020, 57(6): 102361. 17 Yang X, Bian J, Hogan W R, et al. Clinical concept extraction using transformers[J]. Journal of the American Medical Informatics Association, 2020, 27(12): 1935-1942. 18 Du J C, Xiang Y, Sankaranarayanapillai M, et al. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning[J]. Journal of the American Medical Informatics Association, 2021, 28(7): 1393-1400. 19 陈美杉, 夏晨曦. 肝癌患者在线提问的命名实体识别研究: 一种基于迁移学习的方法[J]. 数据分析与知识发现, 2019, 3(12): 61-69. 20 刘浏, 伊凡, 王东波, 等. iSchools培养计划知识挖掘下的情报学教育及人才培养[J]. 情报理论与实践, 2021, 44(2): 26-32. 21 Li P C, Liu Q K, Cheng Q K, et al. Data set entity recognition based on distant supervision[J]. The Electronic Library, 2021, 39(3): 435-449. 22 Fan B, Fan W G, Smith C, et al. Adverse drug event detection and extraction from open data: a deep learning approach[J]. Information Processing & Management, 2020, 57(1): 102131. 23 Fan Y D, Zhou S C, Li Y F, et al. Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text[J]. Journal of the American Medical Informatics Association, 2021, 28(3): 569-577. 24 马娜, 张智雄, 吴朋民. 基于特征融合的术语型引用对象自动识别方法研究[J]. 数据分析与知识发现, 2020, 4(1): 89-98. 25 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75. 26 翟羽佳, 田静文, 赵玥. 基于BERT-BiLSTM-CRF模型的算法术语抽取与创新演化路径构建研究[J]. 情报科学, 2022, 40(4): 71-78. 27 Ding L P, Zhang Z X, Liu H, et al. Automatic keyphrase extraction from scientific Chinese medical abstracts based on character-level sequence labeling[J]. Journal of Data and Information Science, 2021, 6(3): 35-57. 28 刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54. 29 Touvron H, Martin L, Stone K, et al. LLaMA 2: open foundation and fine-tuned chat models[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09288. 30 Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: efficient finetuning of quantized LLMs[OL]. (2023-05-23). http://arxiv.org/pdf/2305.14314. 31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 32 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3165-3120. 33 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). http://arxiv.org/pdf/ 1907.11692. 责任编辑 王克平) |
|
|
|