|
|
Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature |
Ma Yongqiang1,2, Liu Jiawei1,2, Gao Yingfan3 |
1.School of Information Management, Wuhan University, Wuhan 430072 2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072 3.Institute of Scientific and Technical Information of China, Beijing 100038 |
|
|
Abstract The insertion of appropriate citation identifiers into academic texts is a fundamental norm in academic writing that enables readers to verify and trace the origins of information. These identifiers serve as critical tools for attribution and significantly enhance content verifiability. In academic contexts, contemporary large language models (LLMs) generally lack built-in attribution mechanisms when generating scientific text, resulting in limited transparency and accountability of the produced content. Although optimizing these models using human-annotated datasets is a standard approach, this method faces significant challenges in addressing attribution capabilities. Training sets derived from human-authored academic texts have inherent limitations, including insufficient internal consistency and considerable variation in citation practices across different authors and disciplines. In addition, data synthesis methods that rely on LLMs encounter constraints in terms of data diversity. To address these issues, this paper introduces a citation identifier framework and evaluation method for highly attributable scientific texts, aimed at analyzing the attribution of LLM-generated scientific content. For training data construction, this paper proposes a two-stage pseudo-feedback training data synthesis approach that balances the characteristics of both LLM- and human-annotated texts, thereby generating high-quality and diverse training data for attributable scientific text generation. Experimental results demonstrate that small models trained on the synthesized data developed in this study significantly enhance the attribution metrics of LLM-generated scientific texts. Furthermore, it was found that optimizing the data distribution and task diversity through a second stage of pseudo-feedback contributed to improved model generalization.
|
Received: 24 November 2024
|
|
|
|
1 OpenAI. GPT-4 technical report[R/OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774. 2 Berens P, Cranmer K, Lawrence N D, et al. AI for science: an emerging agenda[OL]. (2023-03-09). https://arxiv.org/pdf/2303.04217. 3 Van N R. How language-generation AIs could transform science[J]. Nature, 2022, 605(7908): 21. 4 陆伟, 马永强, 刘家伟, 等. 数智赋能的科研创新——基于数智技术的创新辅助框架探析[J]. 情报学报, 2023, 42(9): 1009-1017. 5 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70. 6 Huang J, Chang K. Citation: a key to building responsible and accountable large language models[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 464-473. 7 Zhang J J, Bai Y S, Lv X, et al. LongCite: enabling LLMs to generate fine-grained citations in long-context QA[OL]. (2024-09-10). https://arxiv.org/pdf/2409.02897. 8 Gao L Y, Dai Z Y, Pasupat P, et al. RARR: researching and revising what language models say, using language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 16477-16508. 9 邱均平, 肖博轩, 徐中阳, 等. 国内外图书情报领域数据引用特征的多维度分析[J]. 情报理论与实践, 2022, 45(9): 44-50, 36. 10 Merton R K. The Matthew effect in science, II: cumulative advantage and the symbolism of intellectual property[J]. ISIS, 1988, 79(4): 606-623. 11 Li Z Y, Zhu H X, Lu Z R, et al. Synthetic data generation with large language models for text classification: potential and limitations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 10443-10461. 12 Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120. 13 Wang Y Z, Kordi Y, Mishra S, et al. Self-Instruct: aligning language models with self-generated instructions[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13484-13508. 14 Hu Y, Wan X J. Automatic generation of related work sections in scientific papers: an optimization approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1624-1633. 15 Liu J C, Zhang Q, Shi C Y, et al. Causal intervention for abstractive related work generation[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 2148-2159. 16 Llama Team, AI@Meta. The Llama 3 herd of models[OL]. (2024-11-23). https://arxiv.org/pdf/2407.21783. 17 Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B[OL]. (2023-10-10). https://arxiv.org/pdf/2310.06825. 18 Gao T Y, Yen H, Yu J T, et al. Enabling large language models to generate text with citations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 6465-6488. 19 Rashkin H, Nikolaev V, Lamm M, et al. Measuring attribution in natural language generation models[J]. Computational Linguistics, 2023, 49(4): 777-840. 20 Bohnet B, Tran V Q, Verga P, et al. Attributed question answering: evaluation and modeling for attributed large language models[OL]. (2023-02-10). https://arxiv.org/pdf/2212.08037. 21 Peng B L, Li C Y, He P C, et al. Instruction tuning with GPT-4[OL]. (2023-04-06). https://arxiv.org/pdf/2304.03277. 22 Taori R, Gulrajani I, Zhang T Y, et al. Stanford Alpaca: an instruction-following LLaMA model[R/OL]. (2023-03-13). https://github.com/tatsu-lab/stanford_alpaca. 23 Tang R X, Han X T, Jiang X Q, et al. Does synthetic data generation of LLMs help clinical text mining?[OL]. (2023-04-10). https://arxiv.org/pdf/2303.04360. 24 Gandhi S, Gala R, Viswanathan V, et al. Better synthetic data by retrieving and transforming existing datasets[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 6453-6466. 25 Abdin M, Aneja J, Behl H, et al. Phi-4 technical report[R/OL]. (2024-12-12). https://arxiv.org/pdf/2412.08905. 26 Hui B Y, Yang J, Cui Z Y, et al. Qwen2.5-Coder technical report[R/OL]. (2024-11-12). https://arxiv.org/pdf/2409.12186. 27 Kim S, Suk J, Yue X, et al. Evaluating language models as synthetic data generators[OL]. (2024-12-04). https://arxiv.org/pdf/2412.03679. 28 Trinh T H, Wu Y H, Le Q V, et al. Solving Olympiad geometry without human demonstrations[J]. Nature, 2024, 625(7995): 476-482. 29 Docekal M, Fajcik M, Smrz P. OARelatedWork: a large-scale dataset of related work sections with full-texts from open access sources[OL]. (2024-05-03). https://arxiv.org/pdf/2405.01930. 30 Lin C Y. Rouge: a package for automatic evaluation of summaries[C]// Proceedings of the ACL Workshop on Text Summarization Branches Out. Stroudsburg: Association for Computational Linguistics, 2004: 74-81. 31 Hu E J, Shen Y L, Wallis P, et al. LoRA: low-rank adaptation of large language models[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-26. 责任编辑 冯家琪) |
|
|
|