Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature
Ma Yongqiang1,2, Liu Jiawei1,2, Gao Yingfan3
1.School of Information Management, Wuhan University, Wuhan 430072 2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072 3.Institute of Scientific and Technical Information of China, Beijing 100038
马永强, 刘家伟, 高影繁. 面向可溯源文本生成的科技文献伪反馈训练数据合成研究[J]. 情报学报, 2025, 44(7): 830-845.
Ma Yongqiang, Liu Jiawei, Gao Yingfan. Research on a Pseudo-Feedback Training Data Generation Method for Attributable Text Generation in Scientific Literature. 情报学报, 2025, 44(7): 830-845.
1 OpenAI. GPT-4 technical report[R/OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774. 2 Berens P, Cranmer K, Lawrence N D, et al. AI for science: an emerging agenda[OL]. (2023-03-09). https://arxiv.org/pdf/2303.04217. 3 Van N R. How language-generation AIs could transform science[J]. Nature, 2022, 605(7908): 21. 4 陆伟, 马永强, 刘家伟, 等. 数智赋能的科研创新——基于数智技术的创新辅助框架探析[J]. 情报学报, 2023, 42(9): 1009-1017. 5 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70. 6 Huang J, Chang K. Citation: a key to building responsible and accountable large language models[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 464-473. 7 Zhang J J, Bai Y S, Lv X, et al. LongCite: enabling LLMs to generate fine-grained citations in long-context QA[OL]. (2024-09-10). https://arxiv.org/pdf/2409.02897. 8 Gao L Y, Dai Z Y, Pasupat P, et al. RARR: researching and revising what language models say, using language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 16477-16508. 9 邱均平, 肖博轩, 徐中阳, 等. 国内外图书情报领域数据引用特征的多维度分析[J]. 情报理论与实践, 2022, 45(9): 44-50, 36. 10 Merton R K. The Matthew effect in science, II: cumulative advantage and the symbolism of intellectual property[J]. ISIS, 1988, 79(4): 606-623. 11 Li Z Y, Zhu H X, Lu Z R, et al. Synthetic data generation with large language models for text classification: potential and limitations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 10443-10461. 12 Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2023, 120(30): e2305016120. 13 Wang Y Z, Kordi Y, Mishra S, et al. Self-Instruct: aligning language models with self-generated instructions[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13484-13508. 14 Hu Y, Wan X J. Automatic generation of related work sections in scientific papers: an optimization approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1624-1633. 15 Liu J C, Zhang Q, Shi C Y, et al. Causal intervention for abstractive related work generation[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 2148-2159. 16 Llama Team, AI@Meta. The Llama 3 herd of models[OL]. (2024-11-23). https://arxiv.org/pdf/2407.21783. 17 Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B[OL]. (2023-10-10). https://arxiv.org/pdf/2310.06825. 18 Gao T Y, Yen H, Yu J T, et al. Enabling large language models to generate text with citations[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 6465-6488. 19 Rashkin H, Nikolaev V, Lamm M, et al. Measuring attribution in natural language generation models[J]. Computational Linguistics, 2023, 49(4): 777-840. 20 Bohnet B, Tran V Q, Verga P, et al. Attributed question answering: evaluation and modeling for attributed large language models[OL]. (2023-02-10). https://arxiv.org/pdf/2212.08037. 21 Peng B L, Li C Y, He P C, et al. Instruction tuning with GPT-4[OL]. (2023-04-06). https://arxiv.org/pdf/2304.03277. 22 Taori R, Gulrajani I, Zhang T Y, et al. Stanford Alpaca: an instruction-following LLaMA model[R/OL]. (2023-03-13). https://github.com/tatsu-lab/stanford_alpaca. 23 Tang R X, Han X T, Jiang X Q, et al. Does synthetic data generation of LLMs help clinical text mining?[OL]. (2023-04-10). https://arxiv.org/pdf/2303.04360. 24 Gandhi S, Gala R, Viswanathan V, et al. Better synthetic data by retrieving and transforming existing datasets[C]// Proceedings of the Conference on Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 6453-6466. 25 Abdin M, Aneja J, Behl H, et al. Phi-4 technical report[R/OL]. (2024-12-12). https://arxiv.org/pdf/2412.08905. 26 Hui B Y, Yang J, Cui Z Y, et al. Qwen2.5-Coder technical report[R/OL]. (2024-11-12). https://arxiv.org/pdf/2409.12186. 27 Kim S, Suk J, Yue X, et al. Evaluating language models as synthetic data generators[OL]. (2024-12-04). https://arxiv.org/pdf/2412.03679. 28 Trinh T H, Wu Y H, Le Q V, et al. Solving Olympiad geometry without human demonstrations[J]. Nature, 2024, 625(7995): 476-482. 29 Docekal M, Fajcik M, Smrz P. OARelatedWork: a large-scale dataset of related work sections with full-texts from open access sources[OL]. (2024-05-03). https://arxiv.org/pdf/2405.01930. 30 Lin C Y. Rouge: a package for automatic evaluation of summaries[C]// Proceedings of the ACL Workshop on Text Summarization Branches Out. Stroudsburg: Association for Computational Linguistics, 2004: 74-81. 31 Hu E J, Shen Y L, Wallis P, et al. LoRA: low-rank adaptation of large language models[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-26. 责任编辑 冯家琪)