合规视角下的<bold>GAI</bold>训练数据质量关键属性及度量指标研究

doi:10.3772/j.issn.1000-0135.2026.05.005

情报学报

2026, Vol. 45

Issue (5): 678-688 DOI: 10.3772/j.issn.1000-0135.2026.05.005

情报理论与方法

本期目录 | 过刊浏览 | 高级检索

合规视角下的GAI训练数据质量关键属性及度量指标研究

邝苗苗¹, 安小米^1,2,3, 雷鸣¹, 刘红岩⁴

1.中国人民大学信息资源管理学院，北京 100872
2.数据工程与知识工程教育部重点实验室，北京 100872
3.中国人民大学智慧城市研究中心，北京 100872
4.清华大学经济管理学院管理科学与工程系，北京 100084

Key Attributes and Indicators of GAI Training Data Quality from a Compliance Perspective

Kuang Miaomiao¹, An Xiaomi^1,2,3, Lei Ming¹, Liu Hongyan⁴

1.School of Information Resource Management, Renmin University of China, Beijing 100872
2.Key Laboratory of Data Engineering and Knowledge Engineering, Beijing 100872
3.Smart City Research Centre, Renmin University of China, Beijing 100872
4.Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing 100084

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (720 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要为了构建负责任的人工智能（artificial intelligence，AI），各国纷纷出台法律法规、政策和标准，对生成式人工智能（generative artificial intelligence，GAI）训练数据的质量治理和监管提出明确要求。然而，目前尚缺少达到要求的技术规范和准则，现有文献鲜有从合规视角深入开展GAI训练数据质量关键属性及度量指标的研究。为此，本文从合规视角解析法律政策和标准等规范性文件中关于训练数据质量的合规要求，识别出GAI训练数据质量的四大关键属性——准确性、多样性、代表性和真实性，并从过程和结果双维度对其进行特征分析。经过4名研究人员的3轮筛选，初步构建由25个指标构成的GAI训练数据关键属性度量指标框架。随后，采用混合研究方法（实地调研、专家研讨和问卷调查法）对指标进行3轮迭代验证。最终，20个指标被验证有效，5个指标被剔除，6个指标被增设，由此形成包含26个指标的GAI训练数据质量度量指标体系。本文提出了合规视角下的GAI训练数据质量关键属性及度量指标，不仅为GAI训练数据质量的合规检验提供了与既有法律法规、政策和标准相一致的符合性依据，降低了合规判定的门槛与成本，也为实现GAI训练数据质量认证的兼容互认及治理体系的标准化建设提供了可行路径。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	邝苗苗
	安小米
	雷鸣
	刘红岩

关键词 ：生成式人工智能, 训练数据质量, 合规要求, 度量指标, 负责任AI

收稿日期: 2025-07-30

基金资助:国家社会科学基金重大项目“我国政府数据治理与利用能力研究”（20＆ZD161）。

作者简介: 邝苗苗，1997年生，博士研究生，主要研究方向为数据治理及标准化；安小米，通信作者，1965年生，博士，教授，主要研究方向为大数据治理、标准化协同治理，E-mail：anxiaomi@ruc.edu.cn；雷鸣，1989年生，博士研究生，主要研究方向为数据故事化；刘红岩，1968年生，博士，教授，主要研究方向为商务智能、推荐系统、金融科技、计算机视觉等；

引用本文:

邝苗苗, 安小米, 雷鸣, 刘红岩. 合规视角下的GAI训练数据质量关键属性及度量指标研究[J]. 情报学报, 2026, 45(5): 678-688.
Kuang Miaomiao, An Xiaomi, Lei Ming, Liu Hongyan. Key Attributes and Indicators of GAI Training Data Quality from a Compliance Perspective. 情报学报, 2026, 45(5): 678-688.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2026.05.005 或 https://qbxb.istic.ac.cn/CN/Y2026/V45/I5/678

1 习近平: 推动我国新一代人工智能健康发展[EB/OL]. (2018-10-31) [2025-03-12]. https://www.spp.gov.cn/tt/201810/t20181031_397224.shtml.
2 张凌寒. 加快建设人工智能大模型中文训练数据语料库[J]. 人民论坛·学术前沿, 2024(13): 57-71.
3 OECD. AI principles[EB/OL]. [2025-03-13]. https://www.oecd.org/en/topics/ai-principles.html.
4 全球人工智能治理倡议[EB/OL]. (2023-10-20) [2025-03-18]. https://www.mfa.gov.cn/web/wjb_673085/zzjg_673183/jks_674633/ fywj_674643/202310/t20231020_11164831.shtml.
5 生成式人工智能服务管理暂行办法[EB/OL]. (2023-07-10) [2025-03-18]. https://www.gov.cn/zhengce/zhengceku/202307/content_6891752.htm.
6 TC260-003《生成式人工智能服务安全基本要求》[S/OL]. (2024-03-01) [2025-03-19]. https://www.tc260.org.cn/upload/2024-03-01/1709282398070082466.pdf.
7 Wang R Y, Strong D M. Beyond accuracy: what data quality means to data consumers[J]. Journal of Management Information Systems, 1996, 12(4): 5-33.
8 安小米, 黄婕, 许济沧, 等. 全景式大数据质量评估指标框架构建研究[J]. 管理科学学报, 2023, 26(5): 138-153.
9 邝苗苗, 安小米, 雷鸣, 等. 人工智能训练数据真实性: 概念体系构建及合规要求分析[J]. 情报理论与实践, 2025, 48(7): 65-73.
10 丁道勤. 生成式人工智能训练阶段的数据法律问题及其立法建议[J]. 行政法学研究, 2024(6): 16-28.
11 钟海燕, 黄运康. 生成式大模型训练数据的法律规制——以比例原则为分析视角[J]. 信息安全与通信保密, 2024(7): 99-108.
12 林伟. 人工智能数据安全风险及应对[J]. 情报杂志, 2022, 41(10): 105-111, 88.
13 陈兵, 傅小鸥. 生成式人工智能数据训练的法治基调及展开[J]. 辽宁师范大学学报(社会科学版), 2024, 47(3): 1-10.
14 ISO/IEC TR 25005-2:2025 Information technology—Data use in smart cities—Part 2: use case analysis and common considerations[S/OL]. https://www.iso.org/standard/86195.html.
15 王春晖. 专家解读｜为什么要成立世界数据组织WDO?[EB/OL]. (2026-03-30) [2026-04-02]. https://mp.weixin.qq.com/s/pF5KW2LiSfRFSkDrUdmVcg.
16 ISO/IEC 20547-3:2020 Information technology—Big data reference architecture—Part 3: reference architecture[S/OL]. https://www.iso.org/standard/71277.html.
17 ISO/IEC TR 29119-11:2020 Software and systems engineering—Software testing—Part 11: guidelines on the testing of AI-based systems[S/OL]. https://www.iso.org/standard/79016.html.
18 ISO 24143:2022 Information and documentation—Information governance—Concept and principles[S/OL]. https://www.iso.org/standard/77915.html.
19 关于印发《“互联网+”人工智能三年行动实施方案》的通知[EB/OL]. (2016-05-18) [2025-03-15]. https://zfxxgk.ndrc.gov.cn/web/iteminfo.jsp?id=2482.
20 我国发布《治理原则》发展负责任的人工智能[EB/OL]. (2019-12-16) [2025-03-15]. https://www.cfis.cn/2019-12/16/c_1125351405. htm.
21 《新一代人工智能伦理规范》发布[EB/OL]. (2021-09-26) [2025-03-18]. https://www.most.gov.cn/kjbgz/202109/t20210926_177063.html.
22 Regulation EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (text with EEA relevance)[A/OL]. (2024-07-12) [2025-03-18]. https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng.
23 国家人工智能产业综合标准化体系建设指南(2024版)[Z/OL]. (2024-06-05)[2025-03-18]. https://www.gov.cn/zhengce/zhengceku/ 202407/P020240702716282797987.pdf.
24 ISO/IEC 5259-2:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 2: data quality measures[S/OL]. https://www.iso.org/standard/81860.html.
25 ISO/IEC 5259-3:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 3: data quality management requirements and guidelines[S/OL]. https://www.iso.org/standard/81092.html.
26 ISO/IEC 5259-4:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 4: data quality process framework[S/OL]. https://www.iso.org/standard/81093.html.
27 GB/T 36344—2018 信息技术数据质量评价指标[S/OL]. https://openstd.samr.gov.cn/bzgk/std/showGb?type=online&hcno=D12140ED FD3967960F51BD1A05645FE7&request_locale=zh.
28 ISO/IEC 5259-1:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 1: overview, terminology, and examples[S/OL]. https://www.iso.org/standard/81088.html.
29 Report on artificial intelligence in criminal law and its use by the police and judicial authorities in criminal matters[R/OL]. (2021-07-13) [2025-03-19]. https://www.europarl.europa.eu/doceo/document/A-9-2021-0232_EN.html.
30 ISO/IEC 25024:2015 Systems and software engineering—Systems and software quality requirements and evaluation (SQuaRE)—Measurement of data quality[S/OL]. https://www.iso.org/standard/35749.html.
31 ISO/IEC TR 5469:2024 Artificial intelligence—Functional safety and AI systems[S/OL]. https://www.iso.org/standard/81283.html.
32 Report on artificial intelligence in a digital age[R/OL]. (2022-04-05) [2025-03-18]. https://www.europarl.europa.eu/doceo/document/A-9-2022-0088_EN.html.
33 Framework of ethical aspects of artificial intelligence, robotics and related technologies[EB/OL]. [2025-03-18]. https://www.europarl.europa.eu/legislative-train/theme-a-europe-fit-for-the-digital-age/file-ai-ethical-framework.
34 YY/T 1833.2-2022 人工智能医疗器械质量要求和评价第2部分: 数据集通用要求[S/OL]. https://std.samr.gov.cn/hb/search/stdHBDetailed?id=E538DE5AEECF3527E05397BE0A0AF2A4.
35 ISO/IEC 8183:2023 Information technology—Artificial intelligence—Data life cycle framework[S/OL]. https://www.iso.org/standard/83002.html.
36 Dancy C L, Saucier P K. AI and blackness: toward moving beyond bias and representation[J]. IEEE Transactions on Technology and Society, 2022, 3(1): 31-40.
37 U.S. Department of Education, Office of Educational Technology. Artificial intelligence and the future of teaching and learning: insights and recommendations[R/OL]. Washington DC, 2023. https://digital.library.unt.edu/ark:/67531/metadc2114121/m1/2/.
38 ISO/IEC 5338:2023 Information technology—Artificial intelligence—AI system life cycle processes[S/OL]. https://www.iso.org/standard/81118.html#:~:text=This%20document%20defines%20a%20set%20of%20processes%20and,systems%20based%20on%20machine%20learning%20and%20heuristic%20systems.
39 ISO/IEC TR 24368:2022 Information technology—Artificial intelligence—Overview of ethical and societal concerns[S/OL]. https://www.iso.org/standard/78507.html.
40 Park J S, Bernstein M S, Brewer R N, et al. Understanding the representation and representativeness of age in AI data sets[C]// Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. New York: ACM Press, 2021: 834-842.
41 Dervishaj J. AI HLEG - Assessment list for trustworthy artificial intelligence (ALTAI)[EB/OL]. (2020-07-23) [2025-03-18]. https://futurium.ec.europa.eu/en/european-ai-alliance/document/ai-hleg-assessment-list-trustworthy-artificial-intelligence-altai.
42 ISO/IEC TS 22424-1:2020 Digital publishing—EPUB3 preservation—Part 1: principles[S/OL]. https://www.iso.org/standard/73163. html.
43 NIST. AI risk management framework[EB/OL]. [2025-03-18]. https://www.nist.gov/itl/ai-risk-management-framework.
44 Regulation EU) 2022/2065 of the European Parliament and of the Council of 19 October 2022 on a single market for digital services and amending directive 2000/31/EC (Digital Services Act) (text with EEA relevance)[A/OL]. (2022-10-27) [2025-03-18]. https://eur-lex.europa.eu/eli/reg/2022/2065/oj/eng.
45 Chen D Y, Huang Y L, Ma Z J, et al. Data-juicer: a one-stop data processing system for large language models[C]// Proceedings of the Companion of the 2024 International Conference on Management of Data. New York: ACM Press, 2024: 120-134.
46 Google. Palm 2 technical report[PP/OL]. V3. arXiv (2023-09-13) [2025-03-18]. https://arxiv.org/pdf/2305.10403.
47 Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models[PP/OL]. V3. arXiv (2023-11-13) [2025-03-18]. https://arxiv.org/pdf/2304.03738.
48 Huang L, Yu W J, Ma W T, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2): Article No.42.
49 Wang Z G, Zhong W J, Wang Y F, et al. Data management for training large language models: a survey[PP/OL]. V3. arXiv (2024-08-02) [2025-03-18]. https://arxiv.org/pdf/2312.01700.
50 Juraska J, Walker M. Characterizing variation in crowd-sourced data for training neural language generators to produce stylistically varied outputs[C]// Proceedings of the 11th International Conference on Natural Language Generation. Stroudsburg: Association for Computational Linguistics, 2018: 441-450.
51 Dada A, Chen A K, Peng C, et al. On the impact of cross-domain data on German language models[C]// Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 13801-13813.
52 Kim J, Asai A, Ilharco G, et al. TaskWeb: selecting better source tasks for multi-task NLP[PP/OL]. V2. arXiv (2023-12-04) [2025-03-18]. https://arxiv.org/pdf/2305.13256.