Key Attributes and Indicators of GAI Training Data Quality from a Compliance Perspective
Kuang Miaomiao1, An Xiaomi1,2,3, Lei Ming1, Liu Hongyan4
1.School of Information Resource Management, Renmin University of China, Beijing 100872 2.Key Laboratory of Data Engineering and Knowledge Engineering, Beijing 100872 3.Smart City Research Centre, Renmin University of China, Beijing 100872 4.Department of Management Science and Engineering, School of Economics and Management, Tsinghua University, Beijing 100084
邝苗苗, 安小米, 雷鸣, 刘红岩. 合规视角下的GAI训练数据质量关键属性及度量指标研究[J]. 情报学报, 2026, 45(5): 678-688.
Kuang Miaomiao, An Xiaomi, Lei Ming, Liu Hongyan. Key Attributes and Indicators of GAI Training Data Quality from a Compliance Perspective. 情报学报, 2026, 45(5): 678-688.
1 习近平: 推动我国新一代人工智能健康发展[EB/OL]. (2018-10-31) [2025-03-12]. https://www.spp.gov.cn/tt/201810/t20181031_397224.shtml. 2 张凌寒. 加快建设人工智能大模型中文训练数据语料库[J]. 人民论坛·学术前沿, 2024(13): 57-71. 3 OECD. AI principles[EB/OL]. [2025-03-13]. https://www.oecd.org/en/topics/ai-principles.html. 4 全球人工智能治理倡议[EB/OL]. (2023-10-20) [2025-03-18]. https://www.mfa.gov.cn/web/wjb_673085/zzjg_673183/jks_674633/ fywj_674643/202310/t20231020_11164831.shtml. 5 生成式人工智能服务管理暂行办法[EB/OL]. (2023-07-10) [2025-03-18]. https://www.gov.cn/zhengce/zhengceku/202307/content_6891752.htm. 6 TC260-003《生成式人工智能服务安全基本要求》[S/OL]. (2024-03-01) [2025-03-19]. https://www.tc260.org.cn/upload/2024-03-01/1709282398070082466.pdf. 7 Wang R Y, Strong D M. Beyond accuracy: what data quality means to data consumers[J]. Journal of Management Information Systems, 1996, 12(4): 5-33. 8 安小米, 黄婕, 许济沧, 等. 全景式大数据质量评估指标框架构建研究[J]. 管理科学学报, 2023, 26(5): 138-153. 9 邝苗苗, 安小米, 雷鸣, 等. 人工智能训练数据真实性: 概念体系构建及合规要求分析[J]. 情报理论与实践, 2025, 48(7): 65-73. 10 丁道勤. 生成式人工智能训练阶段的数据法律问题及其立法建议[J]. 行政法学研究, 2024(6): 16-28. 11 钟海燕, 黄运康. 生成式大模型训练数据的法律规制——以比例原则为分析视角[J]. 信息安全与通信保密, 2024(7): 99-108. 12 林伟. 人工智能数据安全风险及应对[J]. 情报杂志, 2022, 41(10): 105-111, 88. 13 陈兵, 傅小鸥. 生成式人工智能数据训练的法治基调及展开[J]. 辽宁师范大学学报(社会科学版), 2024, 47(3): 1-10. 14 ISO/IEC TR 25005-2:2025 Information technology—Data use in smart cities—Part 2: use case analysis and common considerations[S/OL]. https://www.iso.org/standard/86195.html. 15 王春晖. 专家解读|为什么要成立世界数据组织WDO?[EB/OL]. (2026-03-30) [2026-04-02]. https://mp.weixin.qq.com/s/pF5KW2LiSfRFSkDrUdmVcg. 16 ISO/IEC 20547-3:2020 Information technology—Big data reference architecture—Part 3: reference architecture[S/OL]. https://www.iso.org/standard/71277.html. 17 ISO/IEC TR 29119-11:2020 Software and systems engineering—Software testing—Part 11: guidelines on the testing of AI-based systems[S/OL]. https://www.iso.org/standard/79016.html. 18 ISO 24143:2022 Information and documentation—Information governance—Concept and principles[S/OL]. https://www.iso.org/standard/77915.html. 19 关于印发《“互联网+”人工智能三年行动实施方案》的通知[EB/OL]. (2016-05-18) [2025-03-15]. https://zfxxgk.ndrc.gov.cn/web/iteminfo.jsp?id=2482. 20 我国发布《治理原则》 发展负责任的人工智能[EB/OL]. (2019-12-16) [2025-03-15]. https://www.cfis.cn/2019-12/16/c_1125351405. htm. 21 《新一代人工智能伦理规范》发布[EB/OL]. (2021-09-26) [2025-03-18]. https://www.most.gov.cn/kjbgz/202109/t20210926_177063.html. 22 Regulation EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence and amending regulations (EC) No 300/2008, (EU) No 167/2013, (EU) No 168/2013, (EU) 2018/858, (EU) 2018/1139 and (EU) 2019/2144 and directives 2014/90/EU, (EU) 2016/797 and (EU) 2020/1828 (Artificial Intelligence Act) (text with EEA relevance)[A/OL]. (2024-07-12) [2025-03-18]. https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng. 23 国家人工智能产业综合标准化体系建设指南(2024版)[Z/OL]. (2024-06-05)[2025-03-18]. https://www.gov.cn/zhengce/zhengceku/ 202407/P020240702716282797987.pdf. 24 ISO/IEC 5259-2:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 2: data quality measures[S/OL]. https://www.iso.org/standard/81860.html. 25 ISO/IEC 5259-3:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 3: data quality management requirements and guidelines[S/OL]. https://www.iso.org/standard/81092.html. 26 ISO/IEC 5259-4:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 4: data quality process framework[S/OL]. https://www.iso.org/standard/81093.html. 27 GB/T 36344—2018 信息技术 数据质量评价指标[S/OL]. https://openstd.samr.gov.cn/bzgk/std/showGb?type=online&hcno=D12140ED FD3967960F51BD1A05645FE7&request_locale=zh. 28 ISO/IEC 5259-1:2024 Artificial intelligence—Data quality for analytics and machine learning (ML)—Part 1: overview, terminology, and examples[S/OL]. https://www.iso.org/standard/81088.html. 29 Report on artificial intelligence in criminal law and its use by the police and judicial authorities in criminal matters[R/OL]. (2021-07-13) [2025-03-19]. https://www.europarl.europa.eu/doceo/document/A-9-2021-0232_EN.html. 30 ISO/IEC 25024:2015 Systems and software engineering—Systems and software quality requirements and evaluation (SQuaRE)—Measurement of data quality[S/OL]. https://www.iso.org/standard/35749.html. 31 ISO/IEC TR 5469:2024 Artificial intelligence—Functional safety and AI systems[S/OL]. https://www.iso.org/standard/81283.html. 32 Report on artificial intelligence in a digital age[R/OL]. (2022-04-05) [2025-03-18]. https://www.europarl.europa.eu/doceo/document/A-9-2022-0088_EN.html. 33 Framework of ethical aspects of artificial intelligence, robotics and related technologies[EB/OL]. [2025-03-18]. https://www.europarl.europa.eu/legislative-train/theme-a-europe-fit-for-the-digital-age/file-ai-ethical-framework. 34 YY/T 1833.2-2022 人工智能医疗器械质量要求和评价 第2部分: 数据集通用要求[S/OL]. https://std.samr.gov.cn/hb/search/stdHBDetailed?id=E538DE5AEECF3527E05397BE0A0AF2A4. 35 ISO/IEC 8183:2023 Information technology—Artificial intelligence—Data life cycle framework[S/OL]. https://www.iso.org/standard/83002.html. 36 Dancy C L, Saucier P K. AI and blackness: toward moving beyond bias and representation[J]. IEEE Transactions on Technology and Society, 2022, 3(1): 31-40. 37 U.S. Department of Education, Office of Educational Technology. Artificial intelligence and the future of teaching and learning: insights and recommendations[R/OL]. Washington DC, 2023. https://digital.library.unt.edu/ark:/67531/metadc2114121/m1/2/. 38 ISO/IEC 5338:2023 Information technology—Artificial intelligence—AI system life cycle processes[S/OL]. https://www.iso.org/standard/81118.html#:~:text=This%20document%20defines%20a%20set%20of%20processes%20and,systems%20based%20on%20machine%20learning%20and%20heuristic%20systems. 39 ISO/IEC TR 24368:2022 Information technology—Artificial intelligence—Overview of ethical and societal concerns[S/OL]. https://www.iso.org/standard/78507.html. 40 Park J S, Bernstein M S, Brewer R N, et al. Understanding the representation and representativeness of age in AI data sets[C]// Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. New York: ACM Press, 2021: 834-842. 41 Dervishaj J. AI HLEG - Assessment list for trustworthy artificial intelligence (ALTAI)[EB/OL]. (2020-07-23) [2025-03-18]. https://futurium.ec.europa.eu/en/european-ai-alliance/document/ai-hleg-assessment-list-trustworthy-artificial-intelligence-altai. 42 ISO/IEC TS 22424-1:2020 Digital publishing—EPUB3 preservation—Part 1: principles[S/OL]. https://www.iso.org/standard/73163. html. 43 NIST. AI risk management framework[EB/OL]. [2025-03-18]. https://www.nist.gov/itl/ai-risk-management-framework. 44 Regulation EU) 2022/2065 of the European Parliament and of the Council of 19 October 2022 on a single market for digital services and amending directive 2000/31/EC (Digital Services Act) (text with EEA relevance)[A/OL]. (2022-10-27) [2025-03-18]. https://eur-lex.europa.eu/eli/reg/2022/2065/oj/eng. 45 Chen D Y, Huang Y L, Ma Z J, et al. Data-juicer: a one-stop data processing system for large language models[C]// Proceedings of the Companion of the 2024 International Conference on Management of Data. New York: ACM Press, 2024: 120-134. 46 Google. Palm 2 technical report[PP/OL]. V3. arXiv (2023-09-13) [2025-03-18]. https://arxiv.org/pdf/2305.10403. 47 Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models[PP/OL]. V3. arXiv (2023-11-13) [2025-03-18]. https://arxiv.org/pdf/2304.03738. 48 Huang L, Yu W J, Ma W T, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions[J]. ACM Transactions on Information Systems, 2025, 43(2): Article No.42. 49 Wang Z G, Zhong W J, Wang Y F, et al. Data management for training large language models: a survey[PP/OL]. V3. arXiv (2024-08-02) [2025-03-18]. https://arxiv.org/pdf/2312.01700. 50 Juraska J, Walker M. Characterizing variation in crowd-sourced data for training neural language generators to produce stylistically varied outputs[C]// Proceedings of the 11th International Conference on Natural Language Generation. Stroudsburg: Association for Computational Linguistics, 2018: 441-450. 51 Dada A, Chen A K, Peng C, et al. On the impact of cross-domain data on German language models[C]// Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: Association for Computational Linguistics, 2023: 13801-13813. 52 Kim J, Asai A, Ilharco G, et al. TaskWeb: selecting better source tasks for multi-task NLP[PP/OL]. V2. arXiv (2023-12-04) [2025-03-18]. https://arxiv.org/pdf/2305.13256.