面向可溯源文本生成的科技文献伪反馈训练数据合成研究

doi:10.3772/j.issn.1000-0135.2025.07.004

情报学报

2025, Vol. 44

Issue (7): 830-845 DOI: 10.3772/j.issn.1000-0135.2025.07.004

情报理论与方法

本期目录 | 过刊浏览 | 高级检索

面向可溯源文本生成的科技文献伪反馈训练数据合成研究

马永强^1,2, 刘家伟^1,2, 高影繁³

1.武汉大学信息管理学院，武汉 430072
2.武汉大学智能与创新治理研究所，武汉 430072
3.中国科学技术信息研究所，北京 100038

Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature

Ma Yongqiang^1,2, Liu Jiawei^1,2, Gao Yingfan³

1.School of Information Management, Wuhan University, Wuhan 430072
2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072
3.Institute of Scientific and Technical Information of China, Beijing 100038

摘要
图/表
参考文献
相关文章 (4)

全文: PDF (6052 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要在学术文本中插入恰当的引文标识是学术写作的基本规范，可以帮助读者验证文本内容的真实性。引文标识符可以用于实现内容溯源、保证内容可验证性。在学术场景中，现有大语言模型普遍缺乏内置的内容溯源机制，导致所生成学术文本的可验证性不足。当前，借助领域数据集来优化大模型是主流的研究思路。然而，在优化模型可溯源性方面，基于人类撰写的学术文本所构建的训练集存在内在一致性不足、引文标注行为差异性大等问题，基于大模型的数据合成方法在数据多样性方面也存在局限性。为此，本文提出了一种面向可溯源学术文本的引文标识符体系与评测方法，用于分析大模型所生成学术文本的可溯源性。然后，从训练数据的角度，针对可溯源的学术文本生成，本文提出了一种两阶段伪反馈训练数据合成方法，兼顾大模型标注文本和人类标注文本的特性，构建高质量、多样化的训练数据。研究结果表明，采用本文构建的合成数据训练的小模型，能够生成更具可溯源性的学术文本；通过第二阶段的伪反馈进一步优化数据分布和任务多样性，有助于增强模型的泛化能力。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	马永强
	刘家伟
	高影繁

关键词 ：大语言模型, 数据合成, 学术多文档摘要, 文本可溯源性

收稿日期: 2024-11-24

基金资助:新一代人工智能国家科技重大专项项目“高可靠科技文献智能引擎关键技术研发与示范应用”（2023ZD0121500）；国家自然科学基金项目“基于知识融合的科技文献大模型可靠性增强技术研究”（72404212）。

作者简介: 马永强，男，1997年生，博士研究生，研究方向为学术文本智能生成；刘家伟，通信作者，男，1994年生，博士，博士后，研究方向为信息检索、信息安全，E-mail：laujames2017@whu.edu.cn；高影繁，女，1974年生，博士，研究员，研究方向为智能信息处理；

引用本文:

马永强, 刘家伟, 高影繁. 面向可溯源文本生成的科技文献伪反馈训练数据合成研究[J]. 情报学报, 2025, 44(7): 830-845.
Ma Yongqiang, Liu Jiawei, Gao Yingfan. Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature. 情报学报, 2025, 44(7): 830-845.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.07.004 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I7/830

1 OpenAI. GPT-4 technical report[R/OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774.
2 Berens P, Cranmer K, Lawrence N D, et al. AI for science: an emerging agenda[OL]. (2023-03-09). https://arxiv.org/pdf/2303.04217.
3 Van N R. How language-generation AIs could transform science[J]. Nature, 2022, 605(7908): 21.
4 陆伟, 马永强, 刘家伟, 等. 数智赋能的科研创新——基于数智技术的创新辅助框架探析[J]. 情报学报, 2023, 42(9): 1009-1017.
5 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70.
6 Huang J, Chang K. Citation: a key to building responsible and accountable large language models[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 464-473.
7 Zhang J J, Bai Y S, Lv X, et al. LongCite: enabling LLMs to generate fine-grained citations in long-context QA[OL]. (2024-09-10). https://arxiv.org/pdf/2409.02897.
8 Gao L Y, Dai Z Y, Pasupat P, et al. RARR: researching and revising what language models say, using language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 16477-16508.
9 邱均平, 肖博轩, 徐中阳, 等. 国内外图书情报领域数据引用特征的多维度分析[J]. 情报理论与实践, 2022, 45(9): 44-50, 36.
10 Merton R K. The Matthew effect in science, II: cumulative advantage and the symbolism of intellectual property[J]. ISIS, 1988, 79(4): 606-623.
11 Li Z Y, Zhu H X, Lu Z R, et al. Synthetic data generation with large language models for text classification: potential and limitations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 10443-10461.
12 Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120.
13 Wang Y Z, Kordi Y, Mishra S, et al. Self-Instruct: aligning language models with self-generated instructions[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13484-13508.
14 Hu Y, Wan X J. Automatic generation of related work sections in scientific papers: an optimization approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1624-1633.
15 Liu J C, Zhang Q, Shi C Y, et al. Causal intervention for abstractive related work generation[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 2148-2159.
16 Llama Team, AI@Meta. The Llama 3 herd of models[OL]. (2024-11-23). https://arxiv.org/pdf/2407.21783.
17 Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B[OL]. (2023-10-10). https://arxiv.org/pdf/2310.06825.
18 Gao T Y, Yen H, Yu J T, et al. Enabling large language models to generate text with citations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 6465-6488.
19 Rashkin H, Nikolaev V, Lamm M, et al. Measuring attribution in natural language generation models[J]. Computational Linguistics, 2023, 49(4): 777-840.
20 Bohnet B, Tran V Q, Verga P, et al. Attributed question answering: evaluation and modeling for attributed large language models[OL]. (2023-02-10). https://arxiv.org/pdf/2212.08037.
21 Peng B L, Li C Y, He P C, et al. Instruction tuning with GPT-4[OL]. (2023-04-06). https://arxiv.org/pdf/2304.03277.
22 Taori R, Gulrajani I, Zhang T Y, et al. Stanford Alpaca: an instruction-following LLaMA model[R/OL]. (2023-03-13). https://github.com/tatsu-lab/stanford_alpaca.
23 Tang R X, Han X T, Jiang X Q, et al. Does synthetic data generation of LLMs help clinical text mining?[OL]. (2023-04-10). https://arxiv.org/pdf/2303.04360.
24 Gandhi S, Gala R, Viswanathan V, et al. Better synthetic data by retrieving and transforming existing datasets[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 6453-6466.
25 Abdin M, Aneja J, Behl H, et al. Phi-4 technical report[R/OL]. (2024-12-12). https://arxiv.org/pdf/2412.08905.
26 Hui B Y, Yang J, Cui Z Y, et al. Qwen2.5-Coder technical report[R/OL]. (2024-11-12). https://arxiv.org/pdf/2409.12186.
27 Kim S, Suk J, Yue X, et al. Evaluating language models as synthetic data generators[OL]. (2024-12-04). https://arxiv.org/pdf/2412.03679.
28 Trinh T H, Wu Y H, Le Q V, et al. Solving Olympiad geometry without human demonstrations[J]. Nature, 2024, 625(7995): 476-482.
29 Docekal M, Fajcik M, Smrz P. OARelatedWork: a large-scale dataset of related work sections with full-texts from open access sources[OL]. (2024-05-03). https://arxiv.org/pdf/2405.01930.
30 Lin C Y. Rouge: a package for automatic evaluation of summaries[C]// Proceedings of the ACL Workshop on Text Summarization Branches Out. Stroudsburg: Association for Computational Linguistics, 2004: 74-81.
31 Hu E J, Shen Y L, Wallis P, et al. LoRA: low-rank adaptation of large language models[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-26. 责任编辑冯家琪）