大模型驱动的学术文本挖掘

doi:10.3772/j.issn.1000-0135.2024.08.006

情报学报

2024, Vol. 43

Issue (8): 946-959 DOI: 10.3772/j.issn.1000-0135.2024.08.006

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

大模型驱动的学术文本挖掘

陆伟^1,2, 刘寅鹏^1,2, 石湘^1,2, 刘家伟^1,2, 程齐凯^1,2, 黄永^1,2, 汪磊^1,2

1.武汉大学信息管理学院，武汉 430072
2.武汉大学信息检索与知识挖掘研究所，武汉 430072

Large Language Model-Driven Academic Text Mining: Construction and Evaluation of Inference-End Prompting Strategy

Lu Wei^1,2, Liu Yinpeng^1,2, Shi Xiang^1,2, Liu Jiawei^1,2, Cheng Qikai^1,2, Huang Yong^1,2, Wang Lei^1,2

1.School of Information Management, Wuhan University, Wuhan 430072
2.Information Retrieval and Knowledge Mining Laboratory, Wuhan University, Wuhan 430072

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (3212 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要大型语言模型突出的任务理解和指令遵循能力，使用户可以通过简单的指令交互完成复杂的信息处理任务。科技文献分析领域正在积极探索大模型的应用，但尚未形成对指令工程技术和模型能力边界的系统性研究。本文以学术文本挖掘任务为切入点，从上下文学习、思维链推理等角度设计推理端指令策略，构建了涵盖文本分类、信息抽取、文本推理和文本生成4个能力维度共6项任务的大模型学术文本挖掘专业能力评测框架，并选取了7个国内外主流的指令调优模型进行实验，对比了不同指令策略的适用范围和不同参数模型的专业能力。实验结果表明，少样本、思维链等复杂指令策略在分类任务上的应用效果并不显著，而在抽取、生成等难度较高的任务上表现良好。千亿级参数规模的大模型经过指令引导，能够取得与充分训练的深度学习模型相近的效果，但对于十亿级或百亿级规模大模型，推理端的指令策略存在明显上限。为了实现大模型向科技情报领域的深层嵌入，现阶段仍需在调优端对模型参数进行领域化适配。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	陆伟
	刘寅鹏
	石湘
	刘家伟
	程齐凯
	黄永
	汪磊

关键词 ：大模型, 学术文本挖掘, 指令工程, 能力评测

收稿日期: 2024-01-08

基金资助:国家自然科学基金重点项目“数智赋能的科技信息资源与知识管理理论变革”（72234005）；国家自然科学基金面上项目“基于机器阅读理解的科学命题文本论证逻辑识别”（72174157）。

作者简介: 陆伟，男，1974年生，博士，教授，博士生导师，研究方向为信息检索、数据智能、人机协同，E-mail： weilu@whu.edu.cn；刘寅鹏，男，1998年生，博士研究生，研究方向为文本挖掘、文档智能；石湘，男，1998年生，博士研究生，研究方向为文本挖掘、文档智能；刘家伟，男，1994年生，博士，研究方向为信息检索、信息安全；程齐凯，男，1989年生，博士，副教授，研究方向为文本挖掘、信息检索；黄永，男，1991年生，博士，副教授，研究方向为文本挖掘、科学计量；汪磊，男，2000年生，硕士研究生，研究方向为信息抽取、文本挖掘；

引用本文:

陆伟, 刘寅鹏, 石湘, 刘家伟, 程齐凯, 黄永, 汪磊. 大模型驱动的学术文本挖掘[J]. 情报学报, 2024, 43(8): 946-959.
Lu Wei, Liu Yinpeng, Shi Xiang, Liu Jiawei, Cheng Qikai, Huang Yong, Wang Lei. Large Language Model-Driven Academic Text Mining: Construction and Evaluation of Inference-End Prompting Strategy. 情报学报, 2024, 43(8): 946-959.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2024.08.006 或 https://qbxb.istic.ac.cn/CN/Y2024/V43/I8/946

1 张智雄, 于改红, 刘熠, 等. ChatGPT对文献情报工作的影响[J]. 数据分析与知识发现, 2023, 7(3): 36-42.
2 Huang S Z, Qian J J, Huang Y, et al. Disclosing the relationship between citation structure and future impact of a publication[J]. Journal of the Association for Information Science and Technology, 2022, 73(7): 1025-1042.
3 王鑫, 程齐凯, 马永强, 等. 基于层次注意力网络的论证区间识别研究[J]. 情报工程, 2020, 6(3): 52-62.
4 程齐凯, 李信, 陆伟. 领域无关学术文献词汇功能标准化数据集构建及分析[J]. 情报科学, 2019, 37(7): 41-47.
5 Ma Y Q, Liu J W, Lu W, et al. From “what” to “how”: extracting the procedural scientific information toward the metric-optimization in AI[J]. Information Processing & Management, 2023, 60(3): 103315.
6 陆伟, 马永强, 刘家伟, 等. 数智赋能的科研创新——基于数智技术的创新辅助框架探析[J]. 情报学报, 2023, 42(9): 1009-1017.
7 陆伟, 汪磊, 程齐凯, 等. 数智赋能信息资源管理新路径: 指令工程的概念、内涵和发展[J]. 图书情报知识, 2024, 41(1): 6-11.
8 Ding N, Qin Y J, Yang G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220-235.
9 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010.
10 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
11 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12): 1320-1329.
12 Jiang Y, Meng R, Huang Y, et al. Generating keyphrases for readers: a controllable keyphrase generation framework[J]. Journal of the Association for Information Science and Technology, 2023, 74(7): 759-774.
13 Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877-1901.
14 Liu J W, Xiong Z, Jiang Y, et al. Low-resource multi-granularity academic function recognition based on multiple prompt knowledge[OL]. (2023-05-05) [2023-12-25]. https://arxiv.org/pdf/2305.03287.
15 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70.
16 张恒, 赵毅, 章成志. 基于SciBERT与ChatGPT数据增强的研究流程段落识别[J]. 情报理论与实践, 2024, 47(1): 164-172, 153.
17 Wang Q Y, Downey D, Ji H, et al. Learning to generate novel scientific directions with contextualized literature-based discovery[OL]. (2023-10-12) [2023-12-25]. https://arxiv.org/pdf/2305.14259v3.
18 Zhao W X, Zhou K, Li J Y, et al. A survey of large language models[OL]. (2023-11-24) [2023-12-25]. https://arxiv.org/pdf/2303.18223.
19 Kojima T, Gu S S, Reid M, et al. Large language models are zero-shot reasoners[J]. Advances in Neural Information Processing Systems, 2022, 35: 22199-22213.
20 Wei J, Wang X Z, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]// Proceedings of the 36th Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2022: 24824-24837.
21 Qin C W, Zhang A, Zhang Z S, et al. Is ChatGPT a general-purpose natural language processing task solver?[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 1339-1384.
22 Nori H, Lee Y T, Zhang S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine[OL]. (2023-11-28) [2023-12-25]. https://arxiv.org/pdf/2311.16452.
23 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24.
24 Suzgun M, Scales N, Sch?rli N, et al. Challenging BIG-bench tasks and whether chain-of-thought can solve them[C]// Proceedings of the 61st Annual Conference of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13003-13051.
25 Dernoncourt F, Lee J Y. PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts[C]// Proceedings of the 8th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 308-313.
26 王佳敏, 陆伟, 刘家伟, 等. 多层次融合的学术文本结构功能识别研究[J]. 图书情报工作, 2019, 63(13): 95-104.
27 Bird S, Dale R, Dorr B J, et al. The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics[C]// Proceedings of the Sixth International Conference on Language Resources and Evaluation. Stroudsburg: Association for Computational Linguistics, 2008: 1755-1759.
28 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232.
29 Sadat M, Caragea C. SciNLI: a corpus for natural language inference on scientific text[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 7399-7409.
30 Taori R, Gulrajani I, Zhang T Y, et al. Alpaca: a strong, replicable instruction-following model[EB/OL]. [2023-07-18]. https://crfm.stanford.edu/2023/03/13/alpaca.html.
31 Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2022: 27730-27744.
32 Du Z X, Qian Y J, Liu X, et al. GLM: general language model pretraining with autoregressive blank infilling[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 320-335.
33 Sun Y, Wang S H, Feng S K, et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation[OL]. (2021-07-05) [2023-07-18]. https://arxiv.org/pdf/2107.02137.
34 Cohan A, Ammar W, van Zuylen M, et al. Structural scaffolds for citation intent classification in scientific publications[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 3586-3596.