Review of Large Language Model Evaluation Studies: Current Status, Applications, Challenges, and Trends
Zhao Xue1,2, Zhang Hai1,2, Wang Dongbo1,2
1.College of Information Management, Nanjing Agricultural University, Nanjing 210095 2.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095
摘要大语言模型(large language model,LLM)评测应包含于科学评价体系之中,探究大语言模型评测相关概念内涵,理析其研究现状、应用、局限和趋势,以期推动大语言模型评测研究与应用。本文探讨大语言模型评测相关概念内涵,全面追踪现有关大语言模型评测的研究进展,运用归纳法对现有研究进行分类,分析大语言模型评测研究的现状、应用、局限及发展趋势。研究发现,评测基准已达上百种,涉及大语言模型的理解与生成、知识、伦理安全、多模态等多方面能力。相关研究聚焦评测大语言模型的通用能力,并不断向垂直领域拓展,但目前存在评测体系亟待建立、数据集丰富度不足、评测方法单一等局限。建立科学统一的评价体系、开展多模态评测研究、拓展垂直领域应用评测、与用户研究相结合将成为未来大语言模型评测的前沿课题。
赵雪, 张海, 王东波. 大语言模型评测研究现状、应用、问题与趋势分析[J]. 情报学报, 2025, 44(8): 1058-1074.
Zhao Xue, Zhang Hai, Wang Dongbo. Review of Large Language Model Evaluation Studies: Current Status, Applications, Challenges, and Trends. 情报学报, 2025, 44(8): 1058-1074.
1 郭喨. 人工智能革命与人类命运[N]. 中国社会科学报, 2018-06-05(6). 2 段雨晨. 以人工智能赋能高质量发展[J]. 红旗文稿, 2024(7): 26-28. 3 科技部副部长吴朝晖: 人工智能将成为第四次工业革命的标准[EB/OL]. (2024-03-24) [2024-06-25]. https://cn.chinadaily.com.cn/a/202403/24/WS66002149a3109f7860dd6ba3.html. 4 朱丹浩, 赵志枭, 张一平, 等. 面向古文自然语言处理生成任务的大语言模型评测研究[J]. 信息资源管理学报, 2024, 14(5): 45-58. 5 赵志枭, 胡蝶, 刘畅, 等. 人文社科领域中文通用大模型性能评测[J]. 图书情报工作, 2024, 68(13): 132-143. 6 周立炜, 饶高琦. 大语言模型中文语体能力评测研究[J]. 语言文字应用, 2024(1): 69-82. 7 赵雪, 赵志枭, 孙凤兰, 等. 面向语言文学领域的大语言模型性能评测研究[J]. 外语电化教学, 2023(6): 57-65, 114. 8 Chang Y P, Wang X, Wang J D, et al. A survey on evaluation of large language models[J]. ACM Transactions on Intelligent Systems and Technology, 2024, 15(3): Article No.39. 9 Guo Z S, Jin R R, Liu C, et al. Evaluating large language models: a comprehensive survey[OL]. (2023-11-25). https://arxiv.org/pdf/2310.19736. 10 罗文, 王厚峰. 大语言模型评测综述[J]. 中文信息学报, 2024, 38(1): 1-23. 11 邱均平, 文庭孝. 评价学: 理论·方法·实践[M]. 北京: 科学出版社, 2010. 12 高洁, 方征. 评价、评估、考核、监测: 教育评价若干同位概念辨析及启示[J]. 教育发展研究, 2022, 42(19): 75-84. 13 Suchman E A. Action for what? A critique of evaluative research[M]// The Organization, Management and Tactics of Social Research. Cambridge: Schenkman Publishing Company, 1971: 97-130. 14 Carter R K. Clients’ resistance to negative findings and the latent conservative function of evaluation studies[J]. The American Sociologist, 1971, 6(2): 118-124. 15 Stufflebeam D L. Evaluation as enlightenment for decision-making[M]// Educational Evaluation: Theory and Practice. Belmont: Wadsworth Publishing Company, 1973: 143-147. 16 Scriven M. Evaluation thesaurus[M]. Thousand Oaks: Sage Publications, 1991. 17 Patton M Q. Utilization-focused evaluation: the new century text[M]. Thousand Oaks: Sage Publications, 1997. 18 Vedung E. Public policy and program evaluation[M]. New Brunswick: Transaction Publishers, 1997. 19 Weiss C H. Evaluation: methods for studying programs and policies[M]. 2nd ed. Englewood Cliffs: Prentice Hall, 1998. 20 Preskill H, Torres R. Evaluative inquiry for learning in organizations[M]. Thousand Oaks: Sage Publications, 1999. 21 Rossi P H, Lipsey M W, Lipsey M W, et al. Evaluation: a systematic approach[M]. Thousand Oaks: Sage Publications, 2004. 22 Donaldson S I, Christie C A. Emerging career opportunities in the transdiscipline of evaluation science[M]// Applied Psychology: New Frontiers and Rewarding Careers. London: Psychology Press, 2006: 243-259. 23 Russ-Eft D, Preskill H. Evaluation in organizations: a systematic approach to enhancing learning, performance, and change[M]. New York: Basic Books, 2009. 24 Chen H T. Practical program evaluation: theory-driven evaluation and the integrated evaluation perspective[M]. 2nd ed. Thousand Oaks: Sage Publications, 2015. 25 Wanzer D L. What is evaluation: perspectives of how evaluation differs (or not) from research[J]. American Journal of Evaluation, 2020, 42(1): 28-46. 26 The Glossary of Education Reform. Assessment[EB/OL]. (2015-11-10) [2024-05-25]. https://www.edglossary.org/assessment/. 27 Qi Z P, Liu B H, Zhang S Y, et al. A simple and efficient baseline for zero-shot generative classification[OL]. (2024-12-17). https://arxiv.org/pdf/2412.12594. 28 Li A C, Prabhudesai M, Duggal S, et al. Your diffusion model is secretly a zero-shot classifier[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 2206-2217. 29 Deng C R, Zhu D Y, Li K C, et al. Causal diffusion transformers for generative modeling[OL]. (2024-12-17). https://arxiv.org/pdf/2412.12095. 30 Nair R, Tseng G, Rolf E, et al. Classification drives geographic bias in street scene segmentation[OL]. (2024-12-24). https://arxiv.org/pdf/2412.11061. 31 张奇, 桂韬, 郑锐, 等. 大规模语言模型: 从理论到实践[M]. 北京: 电子工业出版社, 2024. 32 赵鑫, 李军毅, 周昆, 等. 大语言模型[M]. 北京: 高等教育出版社, 2024. 33 Dua D, Wang Y Z, Dasigi P, et al. DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 2368-2378. 34 Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 2383-2392. 35 Joshi M, Choi E, Weld D, et al. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1601-1611. 36 Weston J, Bordes A, Chopra S, et al. Towards AI-complete question answering: a set of prerequisite toy tasks[OL]. (2015-12-31). https://arxiv.org/pdf/1502.05698. 37 FitzGerald J, Hench C, Peris C, et al. MASSIVE: a 1M-example multilingual natural language understanding dataset with 51 typologically-diverse languages[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 4277-4302. 38 Bandarkar L, Liang D, Muller B, et al. The Belebele benchmark: a parallel reading comprehension dataset in 122 language variants[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 749-775. 39 Riccardi N, Desai R H. The Two Word Test: a semantic benchmark for large language models[OL]. (2023-06-07). https://arxiv.org/pdf/2306.04610. 40 Tao Z W, Jin Z, Bai X Y, et al. EvEval: a comprehensive evaluation of event semantics for large language models[OL]. (2023-05-24). https://arxiv.org/pdf/2305.15268. 41 C-SEM[EB/OL]. [2025-07-24]. https://github.com/flageval-baai/FlagEval/blob/master/csem/README-zh.md. 42 Kang D, Hovy E. Style is not a single variable: case studies for cross-stylistic language understanding[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 2376-2387. 43 Choi M, Pei J X, Kumar S, et al. Do LLMs understand social knowledge? Evaluating the sociability of large language models with SocKET benchmark[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 11370-11403. 44 Zellers R, Bisk Y, Schwartz R, et al. Swag: a large-scale adversarial dataset for grounded commonsense inference[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 93-104. 45 Zellers R, Holtzman A, Bisk Y, et al. HellaSwag: can a machine really finish your sentence?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4791-4800. 46 Geva M, Khashabi D, Segal E, et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 346-361. 47 Bisk Y, Zellers R, Le Bras R, et al. PIQA: reasoning about physical commonsense in natural language[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 7432-7439. 48 Levesque H, Davis E, Morgenstern L. The Winograd schema challenge[C]// Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning. Toronto: Association for the Advancement of Artificial Intelligence, 2012: 552-561. 49 Sakaguchi K, Le Bras R, Bhagavatula C, et al. WinoGrande: an adversarial Winograd schema challenge at scale[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 8732-8740. 50 Gordon A S, Kozareva Z, Roemmele M. SemEval-2012 Task 7: choice of plausible alternatives: an evaluation of commonsense causal reasoning[C]// Proceedings of the 1st Joint Conference on Lexical and Computational Semantics. Stroudsburg: Association for Computational Linguistics, 2012: 394-398. 51 Fu Y, Ou L T, Chen M Y, et al. Chain-of-thought hub: a continuous effort to measure large language models’ reasoning performance[OL]. (2023-05-26). https://arxiv.org/pdf/2305.17306v1. 52 Chollet F. On the measure of intelligence[OL]. (2019-11-25). https://arxiv.org/pdf/1911.01547. 53 Nie Y X, Williams A, Dinan E, et al. Adversarial NLI: a new benchmark for natural language understanding[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4885-4901. 54 Bai Y S, Lv X, Zhang J J, et al. LongBench: a bilingual, multitask benchmark for long context understanding[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 3119-3137. 55 Kwan W C, Zeng X S, Wang Y F, et al. M4LE: a multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 15568-15592. 56 An C X, Gong S S, Zhong M, et al. L-Eval: instituting standardized evaluation for long context language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 14388-14411. 57 Dong Z C, Tang T Y, Li J Y, et al. BAMBOO: a comprehensive benchmark for evaluating long text modeling capacities of large language models[C]// Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. ELRA and ICCL, 2024: 2086-2099. 58 Tang X R, Zong Y M, Phang J, et al. Struc-Bench: are large language models really good at generating complex structured data?[OL]. (2024-04-04). https://arxiv.org/pdf/2309.08963. 59 Manakul P, Liusie A, Gales M J F. MQAG: multiple-choice question answering and generation for assessing information consistency in summarization[C]// Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 39-53. 60 Zheng L M, Chiang W L, Sheng Y, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 46595-46623. 61 Qiao D, Wu C F, Liang Y B, et al. GameEval: evaluating LLMs on conversational games[OL]. (2023-08-19). https://arxiv.org/pdf/2308.10032. 62 Chiang W L, Zheng L M, Sheng Y, et al. Chatbot Arena: benchmarking LLMs in the wild with Elo ratings[OL]. (2024-03-07). https://arxiv.org/pdf/2403.04132. 63 Liang Y B, Duan N, Gong Y Y, et al. XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding and generation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 6008-6018. 64 Yao Y, Dong Q X, Guan J, et al. CUGE: a Chinese language understanding and generation evaluation benchmark[OL]. (2022-06-14). https://arxiv.org/pdf/2112.13610. 65 Zhang X T, Li C Y, Zong Y, et al. Evaluating the performance of large language models on GAOKAO benchmark[OL]. (2024-02-24). https://arxiv.org/pdf/2305.12474. 66 Yu J F, Wang X Z, Tu S Q, et al. KoLA: carefully benchmarking world knowledge of large language models[OL]. (2024-07-01). https://arxiv.org/pdf/2306.09296. 67 Zhang W X, Aljunied S M, Gao C, et al. M3Exam: a multilingual, multimodal, multilevel benchmark for examining large language models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 5484-5505. 68 Zeng H. Measuring massive multitask Chinese understanding[OL]. (2023-05-15). https://arxiv.org/pdf/2304.12986. 69 Zhong W J, Cui R X, Guo Y D, et al. AGIEval: a human-centric benchmark for evaluating foundation models[C]// Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 2299-2314. 70 Sawada T, Paleka D, Havrilla A, et al. ARB: advanced reasoning benchmark for large language models[OL]. (2023-07-28). https://arxiv.org/pdf/2307.13692. 71 Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding[OL]. (2021-01-12). https://arxiv.org/pdf/2009.03300. 72 Gu Z H, Zhu X X, Ye H N, et al. Xiezhi: an ever-updating benchmark for holistic domain knowledge evaluation[C]// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 18099-18107. 73 Li H N, Zhang Y X, Koto F, et al. CMMLU: measuring massive multitask language understanding in Chinese[C]// Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 11260-11285. 74 Huang Y Z, Bai Y Z, Zhu Z H, et al. C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 62991-63010. 75 Liu C, Jin R R, Ren Y Q, et al. M3KE: a massive multi-level multi-subject knowledge evaluation benchmark for Chinese large language models[OL]. (2023-05-21). https://arxiv.org/pdf/2305.10263. 76 Zeng H, Xue J Y, Hao M, et al. Evaluating the generation capabilities of large Chinese language models[OL]. (2024-01-30). https://arxiv.org/pdf/2308.04823. 77 Parrish A, Chen A, Nangia N, et al. BBQ: a hand-built bias benchmark for question answering[C]// Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: Association for Computational Linguistics, 2022: 2086-2105. 78 Dhamala J, Sun T, Kumar V, et al. BOLD: dataset and metrics for measuring biases in open-ended language generation[C]// Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. New York: ACM Press, 2021: 862-872. 79 Li T, Khashabi D, Khot T, et al. UNQOVERing stereotyping biases via underspecified questions[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 3475-3489. 80 Liu H C, Dacon J, Fan W Q, et al. Does gender matter? Towards fairness in dialogue systems[C]// Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4403-4416. 81 Huang Y F, Xiong D Y. CBBQ: a Chinese bias benchmark dataset curated with human-AI collaboration for large language models[C]// Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Stroudsburg: Association for Computational Linguistics, 2024: 2917-2929. 82 Zhao J X, Fang M, Shi Z J, et al. CHBias: bias evaluation and mitigation of Chinese conversational language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13538-13556. 83 Xu G H, Liu J Y, Yan M, et al. CValues: measuring the values of Chinese large language models from safety to responsibility[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09705. 84 Xu L, Zhao K K, Zhu L, et al. SC-Safety: a multi-round open-ended question adversarial safety benchmark for large language models in Chinese[OL]. (2023-10-09). https://arxiv.org/pdf/2310.05818. 85 Zhang Z X, Lei L Q, Wu L D, et al. SafetyBench: evaluating the safety of large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 15537-15553. 86 Huang Y, Zhang Q H, Yu P S, et al. TrustGPT: a benchmark for trustworthy and responsible large language models[OL]. (2023-06-20). https://arxiv.org/pdf/2306.11507. 87 Liu Y, Yao Y S, Ton J F, et al. Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment[OL]. (2024-03-21). https://arxiv.org/pdf/2308.05374. 88 Honovich O, Aharoni R, Herzig J, et al. TRUE: re-evaluating factual consistency evaluation[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2022: 3905-3920. 89 Yue X, Ni Y S, Zhang K, et al. MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 9556-9567. 90 Fu C Y, Chen P X, Shen Y H, et al. MME: a comprehensive evaluation benchmark for multimodal large language models[OL]. (2024-03-17). https://arxiv.org/pdf/2306.13394. 91 Liu Y, Duan H D, Zhang Y H, et al. MMBench: is your multi-modal model an all-around player?[C]// Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2025: 216-233. 92 Xu P, Shao W Q, Zhang K P, et al. LVLM-eHub: a comprehensive evaluation benchmark for large vision-language models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(3): 1877-1893. 93 Li Z J, Wang Y, Du M F, et al. ReForm-Eval: evaluating large vision language models via unified re-formulation of task-oriented benchmarks[C]// Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM Press, 2024: 1971-1980. 94 Petsiuk V, Siemenn A E, Surbehera S, et al. Human evaluation of text-to-image models on a multi-task benchmark[OL]. (2022-11-22). https://arxiv.org/pdf/2211.12112. 95 Saharia C, Chan W, Saxena S, et al. Photorealistic text-to-image diffusion models with deep language understanding[OL]. (2022-05-23). https://arxiv.org/pdf/2205.11487. 96 Ye Q H, Xu H Y, Xu G H, et al. mPLUG-Owl: modularization empowers large language models with multimodality[OL]. (2024-03-29). https://arxiv.org/pdf/2304.14178. 97 Cho J, Zala A, Bansal M. DALL-Eval: probing the reasoning skills and social biases of text-to-image generation models[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 3020-3031. 98 Bakr E M, Sun P Z, Shen X Q, et al. HRS-Bench: holistic, reliable and scalable benchmark for text-to-image models[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 19984-19996. 99 Lee T, Yasunaga M, Meng C L, et al. Holistic evaluation of text-to-image models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 69981-70011. 100 Lu P, Bansal H, Xia T, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts[OL]. (2024-01-21). https://arxiv.org/pdf/2310.02255. 101 Basu S, Saberi M, Bhardwaj S, et al. EditVal: benchmarking diffusion based text-guided image editing methods[OL]. (2023-10-03). https://arxiv.org/pdf/2310.02426. 102 Wang S, Saharia C, Montgomery C, et al. Imagen editor and EditBench: advancing and evaluating text-guided image inpainting[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18359-18369. 103 Huang K Y, Sun K Y, Xie E Z, et al. T2I-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 78723-78747. 104 Park D H, Azadi S, Liu X H, et al. Benchmark for compositional text-to-image synthesis[C/OL]// Proceedings of the 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks. San Diego: NeurIPS, 2021. https://openreview.net/pdf?id=bKBhQhPeKaF. 105 Dinh T M, Nguyen R, Hua B S. TISE: bag of metrics for text-to-image synthesis evaluation[C]// Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 594-609. 106 Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255. 107 Huang Z Q, He Y N, Yu J S, et al. VBench: comprehensive benchmark suite for video generative models[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 21807-21818. 108 Liu Y X, Li L, Ren S H, et al. FETV: a benchmark for fine-grained evaluation of open-domain text-to-video generation[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 62352-62387. 109 Liu Y F, Cun X D, Liu X B, et al. EvalCrafter: benchmarking and evaluating large video generation models[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 22139-22149. 110 P?tr?ucean V, Smaira L, Gupta A, et al. Perception Test: a diagnostic benchmark for multimodal video models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 42748-42761. 111 SuperCLUE-Video[EB/OL]. [2025-07-24]. https://github.com/ CLUEbenchmark/SuperCLUE-Video. 112 Yang S W, Chi P H, Chuang Y S, et al. SUPERB: speech processing universal performance benchmark[C]// Proceedings of the 22nd Annual Conference of the International Speech Communication Association. ISCA, 2021: 3161-3165. 113 Tsai H S, Chang H J, Huang W C, et al. SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 8479-8492. 114 Shi J T, Berrebbi D, Chen W, et al. ML-SUPERB: multilingual speech universal performance benchmark[C]// Proceedings of the Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 884-888. 115 Conneau A, Ma M, Khanuja S, et al. FLEURS: few-shot learning evaluation of universal representations of speech[C]// Proceedings of the 2022 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2023: 798-805. 116 Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models[OL]. (2023-10-01). https://arxiv.org/pdf/2211.09110. 117 Wang A, Singh A, Michael J, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding[C]// Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg: Association for Computational Linguistics, 2018: 353-355. 118 Wang A, Pruksachatkun Y, Nangia N, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 3266-3280. 119 Chia Y K, Hong P F, Bing L D, et al. InstructEval: towards holistic evaluation of instruction-tuned large language models[C]// Proceedings of the First Edition of the Workshop on the Scaling Behavior of Large Language Models. Stroudsburg: Association for Computational Linguistics, 2024: 35-64. 120 SuperCLUE: 中文大模型综合性测评基准[EB/OL]. [2025-07-24]. https://superclueai.com/. 121 Li Y Y, Zhao J Q, Zheng D, et al. CLEVA: Chinese language models evaluation platform[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2023: 186-217. 122 FlagEval大模型评测平台[EB/OL]. [2025-07-24]. https://flageval.baai.ac.cn/#/home. 123 OpenCompass evaluation system[EB/OL]. [2025-07-24]. https://opencompass.org.cn/home. 124 Ahuja K, Diddee H, Hada R, et al. MEGA: multilingual evaluation of generative AI[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 4232-4267. 125 Dalvi F, Hasanain M, Boughorbel S, et al. LLMeBench: a flexible framework for accelerating LLMs benchmarking[C]// Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2024: 214-222. 126 Abdelali A, Mubarak H, Chowdhury S A, et al. LAraBench: benchmarking Arabic AI with large language models[C]// Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 487-520. 127 Srivastava A, Rastogi A, Rao A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models[OL]. (2023-06-12). https://arxiv.org/pdf/2206.04615. 128 Suzgun M, Scales N, Sch?rli N, et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them[C]// Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: Association for Computational Linguistics, 2023: 13003-13051. 129 Kiela D, Bartolo M, Nie Y X, et al. Dynabench: rethinking benchmarking in NLP[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2021: 4110-4124. 130 Chinese-llm-benchmark[EB/OL]. [2025-07-24]. https://github.com/jeinlee1991/chinese-llm-benchmark. 131 Hu J J, Ruder S, Siddhant A, et al. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 4411-4421. 132 Ruder S, Constant N, Botha J, et al. XTREME-R: towards more challenging and nuanced multilingual evaluation[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 10215-10245. 133 Chalkidis I, Jana A, Hartung D, et al. LexGLUE: a benchmark dataset for legal language understanding in English[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4310-4330. 134 Guha N, Ho D E, Nyarko J, et al. LegalBench: prototyping a collaborative benchmark for legal reasoning[OL]. (2022-09-13). https://arxiv.org/pdf/2209.06120. 135 Hwang W, Lee D, Cho K, et al. A multi-task benchmark for Korean legal language understanding and judgement prediction[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2022: 32537-32551. 136 Yue S B, Chen W, Wang S Y, et al. DISC-LawLLM: fine-tuning large language models for intelligent legal services[OL]. (2023-09-23). https://arxiv.org/pdf/2309.11325. 137 许建峰, 刘程远, 况琨, 等. 法律大模型评测指标和测评方法[J]. 中国人工智能学会通讯, 2024, 14(2): 15-22. 138 Niklaus J, Matoshi V, Rani P, et al. LEXTREME: a multi-lingual and multi-task benchmark for the legal domain[C]// Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 3016-3054. 139 Chalkidis I, Pasini T, Zhang S, et al. FairLex: a multilingual benchmark for evaluating fairness in legal text processing[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4389-4406. 140 Hendrycks D, Burns C, Chen A Y, et al. CUAD: an expert-annotated NLP dataset for legal contract review[OL]. (2021-11-08). https://arxiv.org/pdf/2103.06268. 141 Peng Y F, Yan S K, Lu Z Y. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets[C]// Proceedings of the 18th BioNLP Workshop and Shared Task. Stroudsburg: Association for Computational Linguistics, 2019: 58-65. 142 Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2021, 3(1): Article No.2. 143 Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge[J]. Nature, 2023, 620(7972): 172-180. 144 Goodwin T R, Demner-Fushman D. Clinical language understanding evaluation (CLUE)[OL]. (2022-09-28). https://arxiv.org/pdf/2209.14377. 145 Zhang N Y, Chen M S, Bi Z, et al. CBLUE: a Chinese biomedical language understanding evaluation benchmark[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 7888-7915. 146 Shah R S, Chawla K, Eidnani D, et al. When FLUE meets FLANG: benchmarks and large pre-trained language model for financial domain[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 2322-2335. 147 Xie Q Q, Han W G, Zhang X, et al. PIXIU: a large language model, instruction data and evaluation benchmark for finance[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 33469-33484. 148 Lu D K, Wu H K, Liang J Q, et al. BBT-Fin: comprehensive construction of Chinese financial domain pre-trained language model, corpus and benchmark[OL]. (2023-02-26). https://arxiv.org/pdf/2302.09432. 149 Chen W, Wang Q S, Long Z F, et al. DISC-FinLLM: a Chinese financial large language model based on multiple experts fine-tuning[OL]. (2023-10-25). https://arxiv.org/pdf/2310.15205. 150 Guo X, Xia H T, Liu Z W, et al. FinEval: a Chinese financial domain knowledge evaluation benchmark for large language models[C]// Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2025: 6258-6292. 151 Hendrycks D, Basart S, Kadavath S, et al. Measuring coding challenge competence with APPS[OL]. (2021-11-08). https://arxiv.org/pdf/2105.09938. 152 Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[OL]. (2021-07-14). https://arxiv.org/pdf/2107.03374. 153 Austin J, Odena A, Nye M, et al. Program synthesis with large language models[OL]. (2021-08-21). https://arxiv.org/pdf/2108.07732. 154 Lu S, Guo D Y, Ren S, et al. CodeXGLUE: a machine learning benchmark dataset for code understanding and generation[OL]. (2021-03-16). https://arxiv.org/pdf/2102.04664. 155 Liu Y H, Pei C H, Xu L L, et al. OpsEval: a comprehensive IT operations benchmark suite for large language models[OL]. (2024-08-23). https://arxiv.org/pdf/2310.07637. 156 Elangovan A, He J Y, Verspoor K. Memorization vs. generalization: quantifying data leakage in NLP performance evaluation[C]// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2021: 1325-1335. 157 Ravaut M, Ding B S, Jiao F K, et al. How much are large language models contaminated? A comprehensive survey and the LLMSanitize library[OL]. (2024-03-28). https://arxiv.org/htrnl/2404.00699v3.