面向可溯源文本生成的科技文献伪反馈训练数据合成研究

doi:10.3772/j.issn.1000-0135.2025.07.004

情报学报

2025, Vol. 44

Issue (7): 830-845 DOI: 10.3772/j.issn.1000-0135.2025.07.004

Intelligence Theories and Methods

Current Issue | Archive | Adv Search

Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature

Ma Yongqiang^1,2, Liu Jiawei^1,2, Gao Yingfan³

1.School of Information Management, Wuhan University, Wuhan 430072
2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072
3.Institute of Scientific and Technical Information of China, Beijing 100038

Abstract
Figure/Table
References
Related Citation (4)

Download: PDF (6052 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract The insertion of appropriate citation identifiers into academic texts is a fundamental norm in academic writing that enables readers to verify and trace the origins of information. These identifiers serve as critical tools for attribution and significantly enhance content verifiability. In academic contexts, contemporary large language models (LLMs) generally lack built-in attribution mechanisms when generating scientific text, resulting in limited transparency and accountability of the produced content. Although optimizing these models using human-annotated datasets is a standard approach, this method faces significant challenges in addressing attribution capabilities. Training sets derived from human-authored academic texts have inherent limitations, including insufficient internal consistency and considerable variation in citation practices across different authors and disciplines. In addition, data synthesis methods that rely on LLMs encounter constraints in terms of data diversity. To address these issues, this paper introduces a citation identifier framework and evaluation method for highly attributable scientific texts, aimed at analyzing the attribution of LLM-generated scientific content. For training data construction, this paper proposes a two-stage pseudo-feedback training data synthesis approach that balances the characteristics of both LLM- and human-annotated texts, thereby generating high-quality and diverse training data for attributable scientific text generation. Experimental results demonstrate that small models trained on the synthesized data developed in this study significantly enhance the attribution metrics of LLM-generated scientific texts. Furthermore, it was found that optimizing the data distribution and task diversity through a second stage of pseudo-feedback contributed to improved model generalization.

Key words： large language models data synthesis scientific multi-document summarization text attributable

Received: 24 November 2024

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Ma Yongqiang
	Liu Jiawei
	Gao Yingfan

Cite this article:

Ma Yongqiang,Liu Jiawei,Gao Yingfan. Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature[J]. 情报学报, 2025, 44(7): 830-845.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2025.07.004 OR https://qbxb.istic.ac.cn/EN/Y2025/V44/I7/830

1 OpenAI. GPT-4 technical report[R/OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774.
2 Berens P, Cranmer K, Lawrence N D, et al. AI for science: an emerging agenda[OL]. (2023-03-09). https://arxiv.org/pdf/2303.04217.
3 Van N R. How language-generation AIs could transform science[J]. Nature, 2022, 605(7908): 21.
4 陆伟, 马永强, 刘家伟, 等. 数智赋能的科研创新——基于数智技术的创新辅助框架探析[J]. 情报学报, 2023, 42(9): 1009-1017.
5 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70.
6 Huang J, Chang K. Citation: a key to building responsible and accountable large language models[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 464-473.
7 Zhang J J, Bai Y S, Lv X, et al. LongCite: enabling LLMs to generate fine-grained citations in long-context QA[OL]. (2024-09-10). https://arxiv.org/pdf/2409.02897.
8 Gao L Y, Dai Z Y, Pasupat P, et al. RARR: researching and revising what language models say, using language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 16477-16508.
9 邱均平, 肖博轩, 徐中阳, 等. 国内外图书情报领域数据引用特征的多维度分析[J]. 情报理论与实践, 2022, 45(9): 44-50, 36.
10 Merton R K. The Matthew effect in science, II: cumulative advantage and the symbolism of intellectual property[J]. ISIS, 1988, 79(4): 606-623.
11 Li Z Y, Zhu H X, Lu Z R, et al. Synthetic data generation with large language models for text classification: potential and limitations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 10443-10461.
12 Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120.
13 Wang Y Z, Kordi Y, Mishra S, et al. Self-Instruct: aligning language models with self-generated instructions[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13484-13508.
14 Hu Y, Wan X J. Automatic generation of related work sections in scientific papers: an optimization approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1624-1633.
15 Liu J C, Zhang Q, Shi C Y, et al. Causal intervention for abstractive related work generation[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 2148-2159.
16 Llama Team, AI@Meta. The Llama 3 herd of models[OL]. (2024-11-23). https://arxiv.org/pdf/2407.21783.
17 Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B[OL]. (2023-10-10). https://arxiv.org/pdf/2310.06825.
18 Gao T Y, Yen H, Yu J T, et al. Enabling large language models to generate text with citations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 6465-6488.
19 Rashkin H, Nikolaev V, Lamm M, et al. Measuring attribution in natural language generation models[J]. Computational Linguistics, 2023, 49(4): 777-840.
20 Bohnet B, Tran V Q, Verga P, et al. Attributed question answering: evaluation and modeling for attributed large language models[OL]. (2023-02-10). https://arxiv.org/pdf/2212.08037.
21 Peng B L, Li C Y, He P C, et al. Instruction tuning with GPT-4[OL]. (2023-04-06). https://arxiv.org/pdf/2304.03277.
22 Taori R, Gulrajani I, Zhang T Y, et al. Stanford Alpaca: an instruction-following LLaMA model[R/OL]. (2023-03-13). https://github.com/tatsu-lab/stanford_alpaca.
23 Tang R X, Han X T, Jiang X Q, et al. Does synthetic data generation of LLMs help clinical text mining?[OL]. (2023-04-10). https://arxiv.org/pdf/2303.04360.
24 Gandhi S, Gala R, Viswanathan V, et al. Better synthetic data by retrieving and transforming existing datasets[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 6453-6466.
25 Abdin M, Aneja J, Behl H, et al. Phi-4 technical report[R/OL]. (2024-12-12). https://arxiv.org/pdf/2412.08905.
26 Hui B Y, Yang J, Cui Z Y, et al. Qwen2.5-Coder technical report[R/OL]. (2024-11-12). https://arxiv.org/pdf/2409.12186.
27 Kim S, Suk J, Yue X, et al. Evaluating language models as synthetic data generators[OL]. (2024-12-04). https://arxiv.org/pdf/2412.03679.
28 Trinh T H, Wu Y H, Le Q V, et al. Solving Olympiad geometry without human demonstrations[J]. Nature, 2024, 625(7995): 476-482.
29 Docekal M, Fajcik M, Smrz P. OARelatedWork: a large-scale dataset of related work sections with full-texts from open access sources[OL]. (2024-05-03). https://arxiv.org/pdf/2405.01930.
30 Lin C Y. Rouge: a package for automatic evaluation of summaries[C]// Proceedings of the ACL Workshop on Text Summarization Branches Out. Stroudsburg: Association for Computational Linguistics, 2004: 74-81.
31 Hu E J, Shen Y L, Wallis P, et al. LoRA: low-rank adaptation of large language models[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-26. 责任编辑冯家琪）

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3