|
|
An Exploration of the Novelty Measurement Task of Scientific Literature Driven by a Large Language Model |
Zhang Lin1,2,3, Li Sijia1,2, Shi Shunshun1,2, Gou Zhenyu1,2, Huang Ying1,2,3 |
1.School of Information Management, Wuhan University, Wuhan 430072 2.Center for Science, Technology & Education Assessment (CSTEA), Wuhan University, Wuhan 430072 3.Centre for R&D Monitoring (ECOOM) and Department of MSI, KU Leuven, Leuven B- 3000 |
|
|
Abstract To analyze the usability of large language models in the novelty measurement task of scientific literature, a large language model-driven novelty measurement method for scientific literature is proposed in this paper; it is based on research questions, methods, and conclusions, as well as other knowledge units of scientific literature. In this study, a prompt template is designed for the task of extracting knowledge units from scientific literature, and the Qwen2-72B-Instruct open-source large language model is modified with supervised fine-tuning (SFT) and direct preference optimization (DPO) techniques to extract knowledge units of questions, methods, and conclusions from the literature. The semantic embedding of knowledge units is realized, and the average aggregation idea is introduced to realize the semantic embedding of knowledge unit combinations. Further, the novelty of the “new” papers is measured by comparing the semantic embedding vectors between the new papers and old reference paper collections. The experimental results show that the fine-tuned model performs better than the benchmark model in extracting knowledge units from scientific literature. Compared with existing methods for calculating the novelty of papers, the scientific literature novelty measurement model based on knowledge units proposed in this paper can capture more refined novelty differences at the semantic level of the knowledge unit combination. Overall, the novelty measurement method for scientific literature driven by the large language model can better complete the novelty measurement task of scientific literature and enrich the novelty measurement method for scientific papers. In this study, experiments are only carried out on an abstract collection of Chinese papers in computer science and technology, and usability in other fields must be discussed further. Moreover, human assistance is still needed to improve the interpretability and reliability of results when using large language models.
|
Received: 02 December 2024
|
|
|
|
1 陆伟, 刘寅鹏, 石湘, 等. 大模型驱动的学术文本挖掘——推理端指令策略构建及能力评测[J]. 情报学报, 2024, 43(8): 946-959. 2 梁福军. 英文科技论文规范写作与编辑[M]. 北京: 清华大学出版社, 2014. 3 Liang W X, Zhang Y H, Cao H C, et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis[OL]. (2023-10-03). https://arxiv.org/pdf/2310.01783. 4 王雅琪, 曹树金. ChatGPT用于论文创新性评价的效果及可行性分析[J]. 情报资料工作, 2023, 44(5): 28-38. 5 唐晓波, 朱婧, 杜鑫. 基于知识元语义组合差异的专利新颖性细粒度测度方法——以工业机器人领域为例[J]. 情报理论与实践, 2023, 46(11): 154-163, 195. 6 沈雪莹, 欧石燕. 科学文献知识单元抽取及应用研究: 梳理与展望[J]. 情报理论与实践, 2022, 45(12): 195-207. 7 陆伟, 王玉琦, 罗卓然, 等. 基于双层时序网络的学术论文创新度量研究[J]. 复杂科学管理, 2023(2): 15-32. 8 安欣, 徐硕, 叶书路, 等. 面向全文本的微观实体抽取及扩散研究[J]. 图书馆论坛, 2021, 41(3): 42-49. 9 章成志, 谢雨欣, 张恒. 学术文献全文内容中的方法实体细粒度抽取及演化分析研究[J]. 情报学报, 2023, 42(8): 952-966. 10 章成志, 谢雨欣, 宋云天. 学术文本中细粒度知识实体的关联分析[J]. 图书馆论坛, 2021, 41(3): 12-20. 11 李贺, 杜杏叶. 基于知识元的学术论文内容创新性智能化评价研究[J]. 图书情报工作, 2020, 64(1): 93-104. 12 Wang Z Y, Shen X Y, Huang R, et al. Extracting method knowledge elements from scientific literature: a rule-based approach[J]. Proceedings of the Association for Information Science and Technology, 2019, 56(1): 805-807. 13 曹树金, 曹茹烨. 情报学论文创新性评价研究——LDA和SVM融合方法的应用[J]. 图书情报知识, 2022, 39(4): 56-67. 14 Duck G, Kovacevic A, Robertson D L, et al. Ambiguity and variability of database and software names in bioinformatics[J]. Journal of Biomedical Semantics, 2015, 6: 29. 15 Lin L, Wang D, Shen S. Extraction of thesis research conclusion sentences in academic literature[C]// Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents. Aachen: CEUR-WS.org, 2021: 74-76. 16 Mesbah S, Lofi C, Torre M V, et al. TSE-NER: an iterative approach for long-tail entity extraction in scientific publications[C]// Proceedings of the 17th International Semantic Web Conference. Cham: Springer, 2018: 127-143. 17 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70. 18 车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687. 19 Bornmann L, Wu L F, Ettl C. The use of ChatGPT for identifying disruptive papers in science: a first exploration[J]. Scientometrics, 2024, 129(11): 7161-7165. 20 Nishikawa K, Koshiba H. Exploring the applicability of large language models to citation context analysis[J]. Scientometrics, 2024, 129(11): 6751-6777. 21 Cui W T, Xiao M, Wang L D, et al. Automated taxonomy alignment via large language models: bridging the gap between knowledge domains[J]. Scientometrics, 2024, 129(9): 5287-5312. 22 洪贇, 叶鹰, 佟彤. 国内外大语言模型的图书情报应用探讨[J]. 图书馆理论与实践, 2024(2): 72-80. 23 陈建青. 对我国学术论文创新性评审的几点思考[J]. 青年记者, 2013(18): 33-35. 24 侯剑华, 王东毅. 基于SAO-ADV模型的学术论文创新性的测度方法研究[J]. 情报理论与实践, 2020, 43(11): 129-136. 25 Kaufer D S, Geisler C. Novelty in academic writing[J]. Written Communication, 1989, 6(3): 286-311. 26 周露阳. 论审评学术论文创新因素的指标体系[J]. 编辑学报, 2006, 18(1): 68-70. 27 Lee Y N, Walsh J P, Wang J. Creativity in scientific teams: Unpacking novelty and impact[J]. Research Policy, 2015, 44(3): 684-697. 28 李晶, 杨雪, 苏秋丹, 等. 基于知识单元理论的科技成果创新性测度研究述评[J]. 现代情报, 2023, 43(8): 161-177. 29 黄迪汉. 浅谈科技论文的新颖性和科学性[M]// 科技期刊编辑研究文集(第三集). 成都: 四川科学技术出版社, 1994: 103-105. 30 魏绪秋, 申力旭. 学术论文创新性研究述评[J]. 图书情报知识, 2022, 39(4): 68-79. 31 Mishra S, Torvik V I. Quantifying conceptual novelty in the biomedical literature[J]. D-Lib Magazine, 2016, 22(9/10). DOI: 10.1045/september2016-mishra. 32 Arthur W B. The nature of technology: what it is and how it evolves[M]. New York: Simon and Schuster, 2009. 33 Boudreau K J, Guinan E C, Lakhani K R, et al. Looking across and looking beyond the knowledge frontier: intellectual distance, novelty, and resource allocation in science[J]. Management Science, 2016, 62(10): 2765-2783. 34 Uzzi B, Mukherjee S, Stringer M, et al. Atypical combinations and scientific impact[J]. Science, 2013, 342(6157): 468-472. 35 Matsumoto K, Shibayama S, Kang B, et al. Introducing a novelty indicator for scientific research: validating the knowledge-based combinatorial approach[J]. Scientometrics, 2021, 126(8): 6891-6915. 36 Wang J, Veugelers R, Stephan P. Bias against novelty in science: a cautionary tale for users of bibliometric indicators[J]. Research Policy, 2017, 46(8): 1416-1436. 37 Chen C H, Mayanglambam S D, Hsu F Y, et al. Novelty paper recommendation using citation authority diffusion[C]// Proceedings of the 16th International Conference on Technologies and Applications of Artificial Intelligence. Piscataway: IEEE, 2011: 126-131. 38 Tahamtan I, Bornmann L. Creativity in science and the link to cited references: is the creative potential of papers reflected in their cited references?[J]. Journal of Informetrics, 2018, 12(3): 906-930. 39 Tahamtan I, Bornmann L. Core elements in the process of citing publications: conceptual overview of the literature[J]. Journal of Informetrics, 2018, 12(1): 203-216. 40 朱大明. 参考文献的主要作用与学术论文的创新性评审[J]. 编辑学报, 2004, 16(2): 91-92. 41 索传军, 赖海媚. 学术论文问题知识元的类型与描述规则[J]. 中国图书馆学报, 2021, 47(2): 95-109. 42 李姗, 单磊, 崔雷. 不同被引频次论文主题词组合特征及其与论文新颖性关系的研究——以免疫学ESI指标为例[J]. 情报理论与实践, 2021, 44(1): 162-167. 43 Jeon D, Lee J, Ahn J M, et al. Measuring the novelty of scientific publications: a fastText and local outlier factor approach[J]. Journal of Informetrics, 2023, 17(4): 101450. 44 逯万辉, 谭宗颖. 学术成果主题新颖性测度方法研究——基于Doc2Vec和HMM算法[J]. 数据分析与知识发现, 2018, 2(3): 22-29. 45 杨建林, 钱玲飞. 基于关键词对逆文档频率的主题新颖度度量方法[J]. 情报理论与实践, 2013, 36(3): 99-102. 46 Amplayo R K, Hong S L, Song M. Network-based approach to detect novelty of scholarly literature[J]. Information Sciences, 2018, 422: 542-557. 47 Luo Z R, Lu W, He J G, et al. Combination of research questions and methods: a new measurement of scientific novelty[J]. Journal of Informetrics, 2022, 16(2): 101282. 48 罗卓然, 陆伟, 蔡乐, 等. 学术文本词汇功能识别——在论文新颖性度量上的应用[J]. 情报学报, 2022, 41(7): 720-732. 49 钱佳佳, 罗卓然, 陆伟. 基于问题-方法组合的科技论文新颖性度量与创新类型识别[J]. 图书情报工作, 2021, 65(14): 82-89. 50 戎军涛, 索传军, 周彦廷, 等. 基于创新知识元谱系的学术论文新颖性测度研究[J]. 图书情报工作, 2024, 68(1): 27-38. 51 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24. 52 时宗彬, 朱丽雅, 乐小虬. 基于本地大语言模型和提示工程的材料信息抽取方法研究[J]. 数据分析与知识发现, 2024, 8(7): 23-31. 53 黄俊涛. 科技领域知识图谱构建技术研究[D]. 北京: 北方工业大学, 2024. 54 王喆. 深度学习推荐系统[M]. 北京: 电子工业出版社, 2020. 55 汪雪锋, 于慧妍, 郑思佳, 等. 学术论文创新质量评价研究——以多能干细胞技术为例[J]. 数据分析与知识发现, 2024, 8(5): 127-138. 56 詹媛. 我国科技期刊学术影响力逐年上升[N]. 光明日报, 2024-12-20(8). 57 Li Y D, Zhang Y Q, Zhao Z, et al. CSL: a large-scale Chinese scientific literature dataset[C]// Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 3917-2923. 58 Yang A, Yang B S, Hui B Y, et al. Qwen2 technical report[OL]. (2024-09-10). https://arxiv.org/pdf/2407.10671. 59 Bai J Z, Bai S, Chu Y F, et al. Qwen technical report[OL]. (2023-09-28). https://arxiv.org/pdf/2309.16609. 60 张吉玉, 张均胜. 考虑时序的单篇科技文献新颖性评估方法[J]. 图书情报工作, 2022, 66(17): 93-105. 61 逯万辉, 苏金燕, 余倩. 学术成果主题新颖性与学术引用的相关关系研究[J]. 情报资料工作, 2018, 39(6): 68-73. |
|
|
|