|
|
Optimization of LLM's Generation Strategies Based on Rich Semantic Tokens |
Cheng Qikai1,2, Shi Xiang1,2, Yu Fengchang1,2, Huang Shengzhi1,2 |
1.School of Information Management, Wuhan University, Wuhan 430072 2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan 430072 |
|
|
Abstract In recent years, general-purpose large language model (LLM) technologies have made significant progress. However, their application in information science still faces challenges, including low inference efficiency and insufficient task adaptability. To address these issues, this paper systematically analyzes the generation mechanism of LLMs and introduces the concept of “Rich Semantic Tokens,” which describes tokens or token sequences that LLMs tend to generate during the process, characterized by semantic aggregation, contextual dependence, or task relevance. Based on this concept, we propose a collaborative generation strategy between large and small models, driven by generation preferences. Through the mining of Rich Semantic Tokens, a copying mechanism, and a dynamic validation strategy, we enable collaboration between small and large models, promoting a shift from word-by-word generation to the simultaneous generation of multiple tokens, thus enhancing generation efficiency and task adaptability. This study evaluated the proposed generation optimization strategy across three dimensions: generation performance, generalizability, and generation efficiency. Experimental results demonstrate that this strategy outperforms traditional generation optimization methods in multiple domain-specific tasks, including law, medicine, and news encyclopedias. This study provides a new theoretical foundation and practical pathway for optimizing LLM generation, improving task adaptability, and constructing trustworthy and reliable LLMs.
|
Received: 24 November 2024
|
|
|
|
1 陆伟, 刘家伟, 马永强, 等. ChatGPT为代表的大模型对信息资源管理的影响[J]. 图书情报知识, 2023, 40(2): 6-9, 70. 2 Zhou Z X, Ning X F, Hong K, et al. A survey on efficient inference for large language models[OL]. (2024-07-19). https://arxiv.org/pdf/2404.14294. 3 Li Z Q, Liu Y H, Su Y X, et al. Prompt compression for large language models: a survey[C]// Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2025: 7182-7195. 4 Gao Y F, Xiong Y, Gao X Y, et al. Retrieval-augmented generation for large language models: a survey[OL]. (2024-03-27). https://arxiv.org/pdf/2312.10997. 5 Sreenivas S T, Muralidharan S, Joshi R, et al. LLM pruning and distillation in practice: the minitron approach[OL]. (2024-12-09). https://arxiv.org/pdf/2408.11796. 6 Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: efficient finetuning of quantized LLMs[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 10088-10115. 7 Jiang A Q, Sablayrolles A, Roux A, et al. Mixtral of experts[OL]. (2024-01-08). https://arxiv.org/pdf/2401.04088. 8 Kwon W, Li Z H, Zhuang S Y, et al. Efficient memory management for large language model serving with PagedAttention[C]// Proceedings of the 29th Symposium on Operating Systems Principles. New York: ACM Press, 2023: 611-626. 9 陆伟, 刘寅鹏, 石湘, 等. 大模型驱动的学术文本挖掘——推理端指令策略构建及能力评测[J]. 情报学报, 2024, 43(8): 946-959. 10 赵浜, 曹树金. 国内外生成式AI大模型执行情报领域典型任务的测试分析[J]. 情报资料工作, 2023, 44(5): 6-17. 11 钱力, 张智雄, 伍大勇, 等. 科技文献大模型: 方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6): 45-58. 12 OpenAI. GPT-4 technical report[OL]. (2024-03-04). https://arxiv.org/pdf/2303.08774. 13 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 14 Yang A, Yang B S, Hui B Y, et al. Qwen2 technical report[R/OL]. (2024-09-10). https://arxiv.org/pdf/2407.10671. 15 Liu H T, Li C Y, Wu Q Y, et al. Visual instruction tuning[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 34892-34916. 16 Bai J Z, Bai S, Yang S S, et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond[OL]. (2023-10-13). https://arxiv.org/pdf/2308.12966. 17 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 18 Wu C F, Yin S M, Qi W Z, et al. Visual ChatGPT: talking, drawing and editing with visual foundation models[OL]. (2023-03-08). https://arxiv.org/pdf/2303.04671. 19 Schick T, Dwivedi-Yu J, Dessì R, et al. Toolformer: language models can teach themselves to use tools[C]// Proceedings of the Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 68539-68551. 20 Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding[J]. Proceedings of Machine Learning Research, 2023, 202: 19274-19286. 21 Chen C, Borgeaud S, Irving G, et al. Accelerating large language model decoding with speculative sampling[OL]. (2023-02-03). https://arxiv.org/pdf/2302.01318. 22 Zhou Y C, Lyu K F, Rawat A S, et al. DistillSpec: improving speculative decoding via knowledge distillation[OL]. (2024-03-31). https://arxiv.org/pdf/2310.08461. 23 Du C X, Jiang J, Xu Y C, et al. GliDe with a CaPE: a low-hassle method to accelerate speculative decoding[OL]. (2024-02-03). https://arxiv.org/pdf/2402.02082. 24 He Z Y, Zhong Z X, Cai T L, et al. REST: retrieval-based speculative decoding[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2024: 1582-1595. 25 Ou J, Chen Y M, Tian W H. Lossless acceleration of large language model via adaptive n-gram parallel decoding[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2024: 10-22. 26 Cai T L, Li Y H, Geng Z Y, et al. Medusa: simple LLM inference acceleration framework with multiple decoding heads[J]. Proceedings of Machine Learning Research, 2024, 235: 5209-5235. 27 Xia H M, Ge T, Wang P Y, et al. Speculative decoding: exploiting speculative execution for accelerating seq2seq generation[C]// Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 3909-3925. 28 Kim S, Mangalam K, Moon S, et al. Speculative decoding with big little decoder[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 39236-39256. 29 Lan T, Cai D, Wang Y, et al. Copy is all you need[C]// Proceedings of the Eleventh International Conference on Learning Representations. Appleton: ICLR, 2023: 1-16. 30 Cohen N, Kalinsky O, Ziser Y, et al. WikiSum: coherent summarization dataset for efficient human-evaluation[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 212-219. 31 Fei Z W, Shen X Y, Zhu D W, et al. LawBench: benchmarking legal knowledge of large language models[C]// Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2024: 7933-7962. 32 Ben Abacha A, Demner-Fushman D. On the summarization of consumer health questions[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 2228-2234. 33 Chen W, Li Z W, Fang H Y, et al. A benchmark for automatic medical consultation system: frameworks, tasks and datasets[J]. Bioinformatics, 2023, 39(1): btac817. |
|
|
|