大语言模型评测研究现状、应用、问题与趋势分析

doi:10.3772/j.issn.1000-0135.2025.08.010

情报学报

2025, Vol. 44

Issue (8): 1058-1074 DOI: 10.3772/j.issn.1000-0135.2025.08.010

情报综述与述评

本期目录 | 过刊浏览 | 高级检索

大语言模型评测研究现状、应用、问题与趋势分析

赵雪^1,2, 张海^1,2, 王东波^1,2

1.南京农业大学信息管理学院，南京 210095
2.南京农业大学人文与社会计算研究中心，南京 210095

Review of Large Language Model Evaluation Studies: Current Status, Applications, Challenges, and Trends

Zhao Xue^1,2, Zhang Hai^1,2, Wang Dongbo^1,2

1.College of Information Management, Nanjing Agricultural University, Nanjing 210095
2.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (939 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要大语言模型（large language model，LLM）评测应包含于科学评价体系之中，探究大语言模型评测相关概念内涵，理析其研究现状、应用、局限和趋势，以期推动大语言模型评测研究与应用。本文探讨大语言模型评测相关概念内涵，全面追踪现有关大语言模型评测的研究进展，运用归纳法对现有研究进行分类，分析大语言模型评测研究的现状、应用、局限及发展趋势。研究发现，评测基准已达上百种，涉及大语言模型的理解与生成、知识、伦理安全、多模态等多方面能力。相关研究聚焦评测大语言模型的通用能力，并不断向垂直领域拓展，但目前存在评测体系亟待建立、数据集丰富度不足、评测方法单一等局限。建立科学统一的评价体系、开展多模态评测研究、拓展垂直领域应用评测、与用户研究相结合将成为未来大语言模型评测的前沿课题。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	赵雪
	张海
	王东波

关键词 ：大语言模型, 科学评测, 人工智能, 评测基准, 体系构建

收稿日期: 2024-07-17

基金资助:江苏省社会科学基金后期资助项目“人文社会科学大语言模型构建及应用研究”（23HQBO63）。

作者简介: 赵雪，女，1993年生，博士研究生，主要研究方向为自然语言处理与文本挖掘、智能信息组织；张海，男，1988年生，博士研究生，副教授，主要研究方向为用户信息行为、数字人文；王东波，通信作者，博士，教授，博士生导师，主要研究方向为自然语言处理与文本挖掘、智能信息组织，E-mail：db.wang@njau.edu.cn；

引用本文:

赵雪, 张海, 王东波. 大语言模型评测研究现状、应用、问题与趋势分析[J]. 情报学报, 2025, 44(8): 1058-1074.
Zhao Xue, Zhang Hai, Wang Dongbo. Review of Large Language Model Evaluation Studies: Current Status, Applications, Challenges, and Trends. 情报学报, 2025, 44(8): 1058-1074.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.08.010 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I8/1058

1 郭喨. 人工智能革命与人类命运[N]. 中国社会科学报, 2018-06-05(6).
2 段雨晨. 以人工智能赋能高质量发展[J]. 红旗文稿, 2024(7): 26-28.
3 科技部副部长吴朝晖: 人工智能将成为第四次工业革命的标准[EB/OL]. (2024-03-24) [2024-06-25]. https://cn.chinadaily.com.cn/a/202403/24/WS66002149a3109f7860dd6ba3.html.
4 朱丹浩, 赵志枭, 张一平, 等. 面向古文自然语言处理生成任务的大语言模型评测研究[J]. 信息资源管理学报, 2024, 14(5): 45-58.
5 赵志枭, 胡蝶, 刘畅, 等. 人文社科领域中文通用大模型性能评测[J]. 图书情报工作, 2024, 68(13): 132-143.
6 周立炜, 饶高琦. 大语言模型中文语体能力评测研究[J]. 语言文字应用, 2024(1): 69-82.
7 赵雪, 赵志枭, 孙凤兰, 等. 面向语言文学领域的大语言模型性能评测研究[J]. 外语电化教学, 2023(6): 57-65, 114.
8 Chang Y P, Wang X, Wang J D, et al. A survey on evaluation of large language models[J]. ACM Transactions on Intelligent Systems and Technology, 2024, 15(3): Article No.39.
9 Guo Z S, Jin R R, Liu C, et al. Evaluating large language models: a comprehensive survey[OL]. (2023-11-25). https://arxiv.org/pdf/2310.19736.
10 罗文, 王厚峰. 大语言模型评测综述[J]. 中文信息学报, 2024, 38(1): 1-23.
11 邱均平, 文庭孝. 评价学: 理论·方法·实践[M]. 北京: 科学出版社, 2010.
12 高洁, 方征. 评价、评估、考核、监测: 教育评价若干同位概念辨析及启示[J]. 教育发展研究, 2022, 42(19): 75-84.
13 Suchman E A. Action for what? A critique of evaluative research[M]// The Organization, Management and Tactics of Social Research. Cambridge: Schenkman Publishing Company, 1971: 97-130.
14 Carter R K. Clients’ resistance to negative findings and the latent conservative function of evaluation studies[J]. The American Sociologist, 1971, 6(2): 118-124.
15 Stufflebeam D L. Evaluation as enlightenment for decision-making[M]// Educational Evaluation: Theory and Practice. Belmont: Wadsworth Publishing Company, 1973: 143-147.
16 Scriven M. Evaluation thesaurus[M]. Thousand Oaks: Sage Publications, 1991.
17 Patton M Q. Utilization-focused evaluation: the new century text[M]. Thousand Oaks: Sage Publications, 1997.
18 Vedung E. Public policy and program evaluation[M]. New Brunswick: Transaction Publishers, 1997.
19 Weiss C H. Evaluation: methods for studying programs and policies[M]. 2nd ed. Englewood Cliffs: Prentice Hall, 1998.
20 Preskill H, Torres R. Evaluative inquiry for learning in organizations[M]. Thousand Oaks: Sage Publications, 1999.
21 Rossi P H, Lipsey M W, Lipsey M W, et al. Evaluation: a systematic approach[M]. Thousand Oaks: Sage Publications, 2004.
22 Donaldson S I, Christie C A. Emerging career opportunities in the transdiscipline of evaluation science[M]// Applied Psychology: New Frontiers and Rewarding Careers. London: Psychology Press, 2006: 243-259.
23 Russ-Eft D, Preskill H. Evaluation in organizations: a systematic approach to enhancing learning, performance, and change[M]. New York: Basic Books, 2009.
24 Chen H T. Practical program evaluation: theory-driven evaluation and the integrated evaluation perspective[M]. 2nd ed. Thousand Oaks: Sage Publications, 2015.
25 Wanzer D L. What is evaluation: perspectives of how evaluation differs (or not) from research[J]. American Journal of Evaluation, 2020, 42(1): 28-46.
26 The Glossary of Education Reform. Assessment[EB/OL]. (2015-11-10) [2024-05-25]. https://www.edglossary.org/assessment/.
27 Qi Z P, Liu B H, Zhang S Y, et al. A simple and efficient baseline for zero-shot generative classification[OL]. (2024-12-17). https://arxiv.org/pdf/2412.12594.
28 Li A C, Prabhudesai M, Duggal S, et al. Your diffusion model is secretly a zero-shot classifier[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 2206-2217.
29 Deng C R, Zhu D Y, Li K C, et al. Causal diffusion transformers for generative modeling[OL]. (2024-12-17). https://arxiv.org/pdf/2412.12095.
30 Nair R, Tseng G, Rolf E, et al. Classification drives geographic bias in street scene segmentation[OL]. (2024-12-24). https://arxiv.org/pdf/2412.11061.
31 张奇, 桂韬, 郑锐, 等. 大规模语言模型: 从理论到实践[M]. 北京: 电子工业出版社, 2024.
32 赵鑫, 李军毅, 周昆, 等. 大语言模型[M]. 北京: 高等教育出版社, 2024.
33 Dua D, Wang Y Z, Dasigi P, et al. DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 2368-2378.
34 Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 2383-2392.
35 Joshi M, Choi E, Weld D, et al. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1601-1611.
36 Weston J, Bordes A, Chopra S, et al. Towards AI-complete question answering: a set of prerequisite toy tasks[OL]. (2015-12-31). https://arxiv.org/pdf/1502.05698.
37 FitzGerald J, Hench C, Peris C, et al. MASSIVE: a 1M-example multilingual natural language understanding dataset with 51 typologically-diverse languages[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 4277-4302.
38 Bandarkar L, Liang D, Muller B, et al. The Belebele benchmark: a parallel reading comprehension dataset in 122 language variants[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 749-775.
39 Riccardi N, Desai R H. The Two Word Test: a semantic benchmark for large language models[OL]. (2023-06-07). https://arxiv.org/pdf/2306.04610.
40 Tao Z W, Jin Z, Bai X Y, et al. EvEval: a comprehensive evaluation of event semantics for large language models[OL]. (2023-05-24). https://arxiv.org/pdf/2305.15268.
41 C-SEM[EB/OL]. [2025-07-24]. https://github.com/flageval-baai/FlagEval/blob/master/csem/README-zh.md.
42 Kang D, Hovy E. Style is not a single variable: case studies for cross-stylistic language understanding[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 2376-2387.
43 Choi M, Pei J X, Kumar S, et al. Do LLMs understand social knowledge? Evaluating the sociability of large language models with SocKET benchmark[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 11370-11403.
44 Zellers R, Bisk Y, Schwartz R, et al. Swag: a large-scale adversarial dataset for grounded commonsense inference[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 93-104.
45 Zellers R, Holtzman A, Bisk Y, et al. HellaSwag: can a machine really finish your sentence?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 4791-4800.
46 Geva M, Khashabi D, Segal E, et al. Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 346-361.
47 Bisk Y, Zellers R, Le Bras R, et al. PIQA: reasoning about physical commonsense in natural language[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 7432-7439.
48 Levesque H, Davis E, Morgenstern L. The Winograd schema challenge[C]// Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning. Toronto: Association for the Advancement of Artificial Intelligence, 2012: 552-561.
49 Sakaguchi K, Le Bras R, Bhagavatula C, et al. WinoGrande: an adversarial Winograd schema challenge at scale[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 8732-8740.
50 Gordon A S, Kozareva Z, Roemmele M. SemEval-2012 Task 7: choice of plausible alternatives: an evaluation of commonsense causal reasoning[C]// Proceedings of the 1st Joint Conference on Lexical and Computational Semantics. Stroudsburg: Association for Computational Linguistics, 2012: 394-398.
51 Fu Y, Ou L T, Chen M Y, et al. Chain-of-thought hub: a continuous effort to measure large language models’ reasoning performance[OL]. (2023-05-26). https://arxiv.org/pdf/2305.17306v1.
52 Chollet F. On the measure of intelligence[OL]. (2019-11-25). https://arxiv.org/pdf/1911.01547.
53 Nie Y X, Williams A, Dinan E, et al. Adversarial NLI: a new benchmark for natural language understanding[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4885-4901.
54 Bai Y S, Lv X, Zhang J J, et al. LongBench: a bilingual, multitask benchmark for long context understanding[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 3119-3137.
55 Kwan W C, Zeng X S, Wang Y F, et al. M4LE: a multi-ability multi-range multi-task multi-domain long-context evaluation benchmark for large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 15568-15592.
56 An C X, Gong S S, Zhong M, et al. L-Eval: instituting standardized evaluation for long context language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 14388-14411.
57 Dong Z C, Tang T Y, Li J Y, et al. BAMBOO: a comprehensive benchmark for evaluating long text modeling capacities of large language models[C]// Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. ELRA and ICCL, 2024: 2086-2099.
58 Tang X R, Zong Y M, Phang J, et al. Struc-Bench: are large language models really good at generating complex structured data?[OL]. (2024-04-04). https://arxiv.org/pdf/2309.08963.
59 Manakul P, Liusie A, Gales M J F. MQAG: multiple-choice question answering and generation for assessing information consistency in summarization[C]// Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 39-53.
60 Zheng L M, Chiang W L, Sheng Y, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 46595-46623.
61 Qiao D, Wu C F, Liang Y B, et al. GameEval: evaluating LLMs on conversational games[OL]. (2023-08-19). https://arxiv.org/pdf/2308.10032.
62 Chiang W L, Zheng L M, Sheng Y, et al. Chatbot Arena: benchmarking LLMs in the wild with Elo ratings[OL]. (2024-03-07). https://arxiv.org/pdf/2403.04132.
63 Liang Y B, Duan N, Gong Y Y, et al. XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding and generation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 6008-6018.
64 Yao Y, Dong Q X, Guan J, et al. CUGE: a Chinese language understanding and generation evaluation benchmark[OL]. (2022-06-14). https://arxiv.org/pdf/2112.13610.
65 Zhang X T, Li C Y, Zong Y, et al. Evaluating the performance of large language models on GAOKAO benchmark[OL]. (2024-02-24). https://arxiv.org/pdf/2305.12474.
66 Yu J F, Wang X Z, Tu S Q, et al. KoLA: carefully benchmarking world knowledge of large language models[OL]. (2024-07-01). https://arxiv.org/pdf/2306.09296.
67 Zhang W X, Aljunied S M, Gao C, et al. M3Exam: a multilingual, multimodal, multilevel benchmark for examining large language models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 5484-5505.
68 Zeng H. Measuring massive multitask Chinese understanding[OL]. (2023-05-15). https://arxiv.org/pdf/2304.12986.
69 Zhong W J, Cui R X, Guo Y D, et al. AGIEval: a human-centric benchmark for evaluating foundation models[C]// Findings of the Association for Computational Linguistics: NAACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 2299-2314.
70 Sawada T, Paleka D, Havrilla A, et al. ARB: advanced reasoning benchmark for large language models[OL]. (2023-07-28). https://arxiv.org/pdf/2307.13692.
71 Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding[OL]. (2021-01-12). https://arxiv.org/pdf/2009.03300.
72 Gu Z H, Zhu X X, Ye H N, et al. Xiezhi: an ever-updating benchmark for holistic domain knowledge evaluation[C]// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 18099-18107.
73 Li H N, Zhang Y X, Koto F, et al. CMMLU: measuring massive multitask language understanding in Chinese[C]// Findings of the Association for Computational Linguistics: ACL 2024. Stroudsburg: Association for Computational Linguistics, 2024: 11260-11285.
74 Huang Y Z, Bai Y Z, Zhu Z H, et al. C-Eval: a multi-level multi-discipline Chinese evaluation suite for foundation models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 62991-63010.
75 Liu C, Jin R R, Ren Y Q, et al. M3KE: a massive multi-level multi-subject knowledge evaluation benchmark for Chinese large language models[OL]. (2023-05-21). https://arxiv.org/pdf/2305.10263.
76 Zeng H, Xue J Y, Hao M, et al. Evaluating the generation capabilities of large Chinese language models[OL]. (2024-01-30). https://arxiv.org/pdf/2308.04823.
77 Parrish A, Chen A, Nangia N, et al. BBQ: a hand-built bias benchmark for question answering[C]// Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: Association for Computational Linguistics, 2022: 2086-2105.
78 Dhamala J, Sun T, Kumar V, et al. BOLD: dataset and metrics for measuring biases in open-ended language generation[C]// Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. New York: ACM Press, 2021: 862-872.
79 Li T, Khashabi D, Khot T, et al. UNQOVERing stereotyping biases via underspecified questions[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 3475-3489.
80 Liu H C, Dacon J, Fan W Q, et al. Does gender matter? Towards fairness in dialogue systems[C]// Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4403-4416.
81 Huang Y F, Xiong D Y. CBBQ: a Chinese bias benchmark dataset curated with human-AI collaboration for large language models[C]// Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Stroudsburg: Association for Computational Linguistics, 2024: 2917-2929.
82 Zhao J X, Fang M, Shi Z J, et al. CHBias: bias evaluation and mitigation of Chinese conversational language models[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 13538-13556.
83 Xu G H, Liu J Y, Yan M, et al. CValues: measuring the values of Chinese large language models from safety to responsibility[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09705.
84 Xu L, Zhao K K, Zhu L, et al. SC-Safety: a multi-round open-ended question adversarial safety benchmark for large language models in Chinese[OL]. (2023-10-09). https://arxiv.org/pdf/2310.05818.
85 Zhang Z X, Lei L Q, Wu L D, et al. SafetyBench: evaluating the safety of large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 15537-15553.
86 Huang Y, Zhang Q H, Yu P S, et al. TrustGPT: a benchmark for trustworthy and responsible large language models[OL]. (2023-06-20). https://arxiv.org/pdf/2306.11507.
87 Liu Y, Yao Y S, Ton J F, et al. Trustworthy LLMs: a survey and guideline for evaluating large language models’ alignment[OL]. (2024-03-21). https://arxiv.org/pdf/2308.05374.
88 Honovich O, Aharoni R, Herzig J, et al. TRUE: re-evaluating factual consistency evaluation[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2022: 3905-3920.
89 Yue X, Ni Y S, Zhang K, et al. MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 9556-9567.
90 Fu C Y, Chen P X, Shen Y H, et al. MME: a comprehensive evaluation benchmark for multimodal large language models[OL]. (2024-03-17). https://arxiv.org/pdf/2306.13394.
91 Liu Y, Duan H D, Zhang Y H, et al. MMBench: is your multi-modal model an all-around player?[C]// Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2025: 216-233.
92 Xu P, Shao W Q, Zhang K P, et al. LVLM-eHub: a comprehensive evaluation benchmark for large vision-language models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(3): 1877-1893.
93 Li Z J, Wang Y, Du M F, et al. ReForm-Eval: evaluating large vision language models via unified re-formulation of task-oriented benchmarks[C]// Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM Press, 2024: 1971-1980.
94 Petsiuk V, Siemenn A E, Surbehera S, et al. Human evaluation of text-to-image models on a multi-task benchmark[OL]. (2022-11-22). https://arxiv.org/pdf/2211.12112.
95 Saharia C, Chan W, Saxena S, et al. Photorealistic text-to-image diffusion models with deep language understanding[OL]. (2022-05-23). https://arxiv.org/pdf/2205.11487.
96 Ye Q H, Xu H Y, Xu G H, et al. mPLUG-Owl: modularization empowers large language models with multimodality[OL]. (2024-03-29). https://arxiv.org/pdf/2304.14178.
97 Cho J, Zala A, Bansal M. DALL-Eval: probing the reasoning skills and social biases of text-to-image generation models[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 3020-3031.
98 Bakr E M, Sun P Z, Shen X Q, et al. HRS-Bench: holistic, reliable and scalable benchmark for text-to-image models[C]// Proceedings of the 19th IEEE International Conference on Computer Vision. Piscataway: IEEE, 2023: 19984-19996.
99 Lee T, Yasunaga M, Meng C L, et al. Holistic evaluation of text-to-image models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 69981-70011.
100 Lu P, Bansal H, Xia T, et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts[OL]. (2024-01-21). https://arxiv.org/pdf/2310.02255.
101 Basu S, Saberi M, Bhardwaj S, et al. EditVal: benchmarking diffusion based text-guided image editing methods[OL]. (2023-10-03). https://arxiv.org/pdf/2310.02426.
102 Wang S, Saharia C, Montgomery C, et al. Imagen editor and EditBench: advancing and evaluating text-guided image inpainting[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18359-18369.
103 Huang K Y, Sun K Y, Xie E Z, et al. T2I-CompBench: a comprehensive benchmark for open-world compositional text-to-image generation[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 78723-78747.
104 Park D H, Azadi S, Liu X H, et al. Benchmark for compositional text-to-image synthesis[C/OL]// Proceedings of the 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks. San Diego: NeurIPS, 2021. https://openreview.net/pdf?id=bKBhQhPeKaF.
105 Dinh T M, Nguyen R, Hua B S. TISE: bag of metrics for text-to-image synthesis evaluation[C]// Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 594-609.
106 Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255.
107 Huang Z Q, He Y N, Yu J S, et al. VBench: comprehensive benchmark suite for video generative models[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 21807-21818.
108 Liu Y X, Li L, Ren S H, et al. FETV: a benchmark for fine-grained evaluation of open-domain text-to-video generation[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 62352-62387.
109 Liu Y F, Cun X D, Liu X B, et al. EvalCrafter: benchmarking and evaluating large video generation models[C]// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 22139-22149.
110 P?tr?ucean V, Smaira L, Gupta A, et al. Perception Test: a diagnostic benchmark for multimodal video models[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 42748-42761.
111 SuperCLUE-Video[EB/OL]. [2025-07-24]. https://github.com/ CLUEbenchmark/SuperCLUE-Video.
112 Yang S W, Chi P H, Chuang Y S, et al. SUPERB: speech processing universal performance benchmark[C]// Proceedings of the 22nd Annual Conference of the International Speech Communication Association. ISCA, 2021: 3161-3165.
113 Tsai H S, Chang H J, Huang W C, et al. SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 8479-8492.
114 Shi J T, Berrebbi D, Chen W, et al. ML-SUPERB: multilingual speech universal performance benchmark[C]// Proceedings of the Annual Conference of the International Speech Communication Association, Dublin, Ireland, 2023: 884-888.
115 Conneau A, Ma M, Khanuja S, et al. FLEURS: few-shot learning evaluation of universal representations of speech[C]// Proceedings of the 2022 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2023: 798-805.
116 Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models[OL]. (2023-10-01). https://arxiv.org/pdf/2211.09110.
117 Wang A, Singh A, Michael J, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding[C]// Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg: Association for Computational Linguistics, 2018: 353-355.
118 Wang A, Pruksachatkun Y, Nangia N, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 3266-3280.
119 Chia Y K, Hong P F, Bing L D, et al. InstructEval: towards holistic evaluation of instruction-tuned large language models[C]// Proceedings of the First Edition of the Workshop on the Scaling Behavior of Large Language Models. Stroudsburg: Association for Computational Linguistics, 2024: 35-64.
120 SuperCLUE: 中文大模型综合性测评基准[EB/OL]. [2025-07-24]. https://superclueai.com/.
121 Li Y Y, Zhao J Q, Zheng D, et al. CLEVA: Chinese language models evaluation platform[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2023: 186-217.
122 FlagEval大模型评测平台[EB/OL]. [2025-07-24]. https://flageval.baai.ac.cn/#/home.
123 OpenCompass evaluation system[EB/OL]. [2025-07-24]. https://opencompass.org.cn/home.
124 Ahuja K, Diddee H, Hada R, et al. MEGA: multilingual evaluation of generative AI[C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2023: 4232-4267.
125 Dalvi F, Hasanain M, Boughorbel S, et al. LLMeBench: a flexible framework for accelerating LLMs benchmarking[C]// Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2024: 214-222.
126 Abdelali A, Mubarak H, Chowdhury S A, et al. LAraBench: benchmarking Arabic AI with large language models[C]// Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 487-520.
127 Srivastava A, Rastogi A, Rao A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models[OL]. (2023-06-12). https://arxiv.org/pdf/2206.04615.
128 Suzgun M, Scales N, Sch?rli N, et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them[C]// Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: Association for Computational Linguistics, 2023: 13003-13051.
129 Kiela D, Bartolo M, Nie Y X, et al. Dynabench: rethinking benchmarking in NLP[C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2021: 4110-4124.
130 Chinese-llm-benchmark[EB/OL]. [2025-07-24]. https://github.com/jeinlee1991/chinese-llm-benchmark.
131 Hu J J, Ruder S, Siddhant A, et al. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 4411-4421.
132 Ruder S, Constant N, Botha J, et al. XTREME-R: towards more challenging and nuanced multilingual evaluation[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 10215-10245.
133 Chalkidis I, Jana A, Hartung D, et al. LexGLUE: a benchmark dataset for legal language understanding in English[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4310-4330.
134 Guha N, Ho D E, Nyarko J, et al. LegalBench: prototyping a collaborative benchmark for legal reasoning[OL]. (2022-09-13). https://arxiv.org/pdf/2209.06120.
135 Hwang W, Lee D, Cho K, et al. A multi-task benchmark for Korean legal language understanding and judgement prediction[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2022: 32537-32551.
136 Yue S B, Chen W, Wang S Y, et al. DISC-LawLLM: fine-tuning large language models for intelligent legal services[OL]. (2023-09-23). https://arxiv.org/pdf/2309.11325.
137 许建峰, 刘程远, 况琨, 等. 法律大模型评测指标和测评方法[J]. 中国人工智能学会通讯, 2024, 14(2): 15-22.
138 Niklaus J, Matoshi V, Rani P, et al. LEXTREME: a multi-lingual and multi-task benchmark for the legal domain[C]// Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 3016-3054.
139 Chalkidis I, Pasini T, Zhang S, et al. FairLex: a multilingual benchmark for evaluating fairness in legal text processing[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4389-4406.
140 Hendrycks D, Burns C, Chen A Y, et al. CUAD: an expert-annotated NLP dataset for legal contract review[OL]. (2021-11-08). https://arxiv.org/pdf/2103.06268.
141 Peng Y F, Yan S K, Lu Z Y. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets[C]// Proceedings of the 18th BioNLP Workshop and Shared Task. Stroudsburg: Association for Computational Linguistics, 2019: 58-65.
142 Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2021, 3(1): Article No.2.
143 Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge[J]. Nature, 2023, 620(7972): 172-180.
144 Goodwin T R, Demner-Fushman D. Clinical language understanding evaluation (CLUE)[OL]. (2022-09-28). https://arxiv.org/pdf/2209.14377.
145 Zhang N Y, Chen M S, Bi Z, et al. CBLUE: a Chinese biomedical language understanding evaluation benchmark[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 7888-7915.
146 Shah R S, Chawla K, Eidnani D, et al. When FLUE meets FLANG: benchmarks and large pre-trained language model for financial domain[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 2322-2335.
147 Xie Q Q, Han W G, Zhang X, et al. PIXIU: a large language model, instruction data and evaluation benchmark for finance[C]// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2023: 33469-33484.
148 Lu D K, Wu H K, Liang J Q, et al. BBT-Fin: comprehensive construction of Chinese financial domain pre-trained language model, corpus and benchmark[OL]. (2023-02-26). https://arxiv.org/pdf/2302.09432.
149 Chen W, Wang Q S, Long Z F, et al. DISC-FinLLM: a Chinese financial large language model based on multiple experts fine-tuning[OL]. (2023-10-25). https://arxiv.org/pdf/2310.15205.
150 Guo X, Xia H T, Liu Z W, et al. FinEval: a Chinese financial domain knowledge evaluation benchmark for large language models[C]// Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2025: 6258-6292.
151 Hendrycks D, Basart S, Kadavath S, et al. Measuring coding challenge competence with APPS[OL]. (2021-11-08). https://arxiv.org/pdf/2105.09938.
152 Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[OL]. (2021-07-14). https://arxiv.org/pdf/2107.03374.
153 Austin J, Odena A, Nye M, et al. Program synthesis with large language models[OL]. (2021-08-21). https://arxiv.org/pdf/2108.07732.
154 Lu S, Guo D Y, Ren S, et al. CodeXGLUE: a machine learning benchmark dataset for code understanding and generation[OL]. (2021-03-16). https://arxiv.org/pdf/2102.04664.
155 Liu Y H, Pei C H, Xu L L, et al. OpsEval: a comprehensive IT operations benchmark suite for large language models[OL]. (2024-08-23). https://arxiv.org/pdf/2310.07637.
156 Elangovan A, He J Y, Verspoor K. Memorization vs. generalization: quantifying data leakage in NLP performance evaluation[C]// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2021: 1325-1335.
157 Ravaut M, Ding B S, Jiao F K, et al. How much are large language models contaminated? A comprehensive survey and the LLMSanitize library[OL]. (2024-03-28). https://arxiv.org/htrnl/2404.00699v3.