Optimization of LLM's Generation Strategies Based on Rich Semantic Tokens
Cheng Qikai1,2, Shi Xiang1,2, Yu Fengchang1,2, Huang Shengzhi1,2
1.School of Information Management, Wuhan University, Wuhan 430072 2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072
1 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70. 2 Zhou Z X, Ning X F, Hong K, et al. A survey on efficient inference for large language models[OL]. (2024-07-19). https://arxiv.org/pdf/2404.14294. 3 Li Z Q, Liu Y H, Su Y X, et al. Prompt compression for large language models: a survey[C]// Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2025: 7182-7195. 4 Gao Y F, Xiong Y, Gao X Y, et al. Retrieval-augmented generation for large language models: a survey[OL]. (2024-03-27). https://arxiv.org/pdf/2312.10997. 5 Sreenivas S T, Muralidharan S, Joshi R, et al. LLM pruning and distillation in practice: the minitron approach[OL]. (2024-12-09). https://arxiv.org/pdf/2408.11796. 6 Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: efficient finetuning of quantized LLMs[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 10088-10115. 7 Jiang A Q, Sablayrolles A, Roux A, et al. Mixtral of experts[OL]. (2024-01-08). https://arxiv.org/pdf/2401.04088. 8 Kwon W, Li Z H, Zhuang S Y, et al. Efficient memory management for large language model serving with PagedAttention[C]// Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM Press, 2023: 611-626. 9 陆伟, 刘寅鹏, 石湘, 等. 大模型驱动的学术文本挖掘——推理端指令策略构建及能力评测[J]. 情报学报, 2024, 43(8): 946-959. 10 赵浜, 曹树金. 国内外生成式AI大模型执行情报领域典型任务的测试分析[J]. 情报资料工作, 2023, 44(5): 6-17. 11 钱力, 张智雄, 伍大勇, 等. 科技文献大模型: 方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6): 45-58. 12 OpenAI. GPT-4 technical report[OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774. 13 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 14 Yang A, Yang B S, Hui B Y, et al. Qwen2 technical report[R/OL]. (2024-09-10). https://arxiv.org/pdf/2407.10671. 15 Liu H T, Li C Y, Wu Q Y, et al. Visual instruction tuning[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 34892-34916. 16 Bai J Z, Bai S, Yang S S, et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond[OL]. (2023-10-13). https://arxiv.org/pdf/2308.12966. 17 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 18 Wu C F, Yin S M, Qi W Z, et al. Visual ChatGPT: talking, drawing and editing with visual foundation models[OL]. (2023-03-08). https://arxiv.org/pdf/2303.04671. 19 Schick T, Dwivedi-Yu J, Dessì R, et al. Toolformer: language models can teach themselves to use tools[C]// Proceedings of the Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 68539-68551. 20 Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding[J]. Proceedings of Machine Learning Research, 2023, 202: 19274-19286. 21 Chen C, Borgeaud S, Irving G, et al. Accelerating large language model decoding with speculative sampling[OL]. (2023-02-03). https://arxiv.org/pdf/2302.01318. 22 Zhou Y C, Lyu K F, Rawat A S, et al. DistillSpec: improving speculative decoding via knowledge distillation[OL]. (2024-03-31). https://arxiv.org/pdf/2310.08461. 23 Du C X, Jiang J, Xu Y C, et al. GliDe with a CaPE: a low-hassle method to accelerate speculative decoding[OL]. (2024-02-03). https://arxiv.org/pdf/2402.02082. 24 He Z Y, Zhong Z X, Cai T L, et al. REST: retrieval-based speculative decoding[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2024: 1582-1595. 25 Ou J, Chen Y M, Tian W H. Lossless acceleration of large language model via adaptive n-gram parallel decoding[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2024: 10-22. 26 Cai T L, Li Y H, Geng Z Y, et al. Medusa: simple LLM inference acceleration framework with multiple decoding heads[J]. Proceedings of Machine Learning Research, 2024, 235: 5209-5235. 27 Xia H M, Ge T, Wang P Y, et al. Speculative decoding: exploiting speculative execution for accelerating seq2seq generation[C]// Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 3909-3925. 28 Kim S, Mangalam K, Moon S, et al. Speculative decoding with big little decoder[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 39236-39256. 29 Lan T, Cai D, Wang Y, et al. Copy is all you need[C]// Proceedings of the Eleventh International Conference on Learning Representations. Appleton: ICLR, 2023: 1-16. 30 Cohen N, Kalinsky O, Ziser Y, et al. WikiSum: coherent summarization dataset for efficient human-evaluation[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 212-219. 31 Fei Z W, Shen X Y, Zhu D W, et al. LawBench: benchmarking legal knowledge of large language models[C]// Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2024: 7933-7962. 32 Ben Abacha A, Demner-Fushman D. On the summarization of consumer health questions[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 2228-2234. 33 Chen W, Li Z W, Fang H Y, et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets[J]. Bioinformatics, 2023, 39(1): btac817.