|
|
Review of Domestic and International Research on Big Data Quality |
Liu Bing, Pang Lin |
Management School, Tianjin Normal University, Tianjin 300387 |
|
|
Abstract As a frontier research field, big data quality research is one of the core contents of big data research; it is also the focus of attention from all walks of life. Based on the literature on big data quality, this paper uses synthesis methods to examine the progress of relevant domestic and international research in terms of its basic implications, quality management, quality evaluation, and application practice. The results show that the study of big data quality is based on big data characteristics, with the basic attributes of big data quality as the core, combined with its application goals and applicable scenarios. It finally forms a complex and multidimensional theoretical system that is different from the conventional data quality theory. At the same time, the results indicate that the study of the essence of big data quality, the combination of technical and human environment, and research on the national and strategic levels based on a macro perspective will be the future research trends and research focus of big data quality research.
|
Received: 31 July 2017
|
|
|
|
[1] Lohr S. The change of big data[N]. New York Times, 2012-02-11. [2] Laney D. 3D data management: Controlling data volume, velocity and variety[J]. META Group Research Note, 2001, 6: 70. [3] Gantz J, Reinsel D. Extracting value from chaos[J]. IDC iView, 2011, 1142(2011): 1-12. [4] Gudivada V N, Baeza-Yates R, Raghavan V V. Big data: Promises and problems[J]. IEEE Computer, 2015, 48(3): 20-23. [5] Franks B. 驾驭大数据[M]. 北京: 人民邮电出版社, 2013. [6] Kulkarni A. A study on metadata management and quality evaluation in big data management[J]. Engineering Technology & Applied Science Research, 2016, 4(7): 455-459. [7] Lee Y W, Pipino L L, Funk J D, et al. 数据质量征途[M]. 黄伟, 王嘉寅, 苏秦, 等译. 北京: 高等教育出版社, 2015. [8] 汪应洛, 黄伟, 朱志祥. 大数据产业及管理问题的一些初步思考[J]. 科技促进发展, 2014(1): 15-19. [9] Immonen A, P??kk?nen P, Ovaska E. Evaluating the quality of social media data in big data architecture[J]. IEEE Access, 2015, 3: 2028-2043. [10] Liu J, Li J, Li W, et al. Rethinking big data: A review on the data quality and usage issues[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2016, 115: 134-142. [11] Boyd D, Crawford K. Critical questions for big data provocations for a cultural, technological, and scholarly phenomenon[J]. Information, Communication and Society, 2012, 15: 662-679. [12] Sukumar R, Ramachandran N, Ferrell R K. ‘Big Data’ in health care: How good is it?[J]. International Journal of Health Care Quality Assurance, 2015: 2-9. [13] Caballero I, Serrano M, Piattini M. A data quality in use model for big data[C]// Proceedings of the International Conference on Conceptual Modeling. Heidelberg: Springer, 2014: 65-74. [14] Cai L, Zhu Y Y. The challenges of data quality and data quality assessment in the big data era[J]. Data Science Journal, 2015, 14: Article No. 2. [15] Wahyudi A, Kuk G, Janssen M. A process pattern model for tackling and improving big data quality[J]. Information Systems Frontiers, 2018, 20: 457-469. [16] Haryadi A F, Hulstijn J, Wahyudi A, et al. Antecedents of big data quality: An empirical examination in financial service organizations[C]// Proceedings of 2016 IEEE International Conference on Big Data. IEEE, 2016: 116-121. [17] Gao J, Xie C, Tao C. Big data validation and quality assurance—Issuses, challenges, and needs[C]// Proceedings of 2016 IEEE Symposium on Service-Oriented System Engineering. IEEE, 2016: 433-441. [18] Batini C, Rula A, Scannapieco M, et al. From data quality to big data quality[J]. Journal of Database Management, 2015, 26(1): 60-82. [19] Rao D, Gudivada V N, Raghavan V V. Data quality issues in big data[C]// Proceedings of IEEE International Conference on Big Data. IEEE, 2015: 2654-2660. [20] Haryadi A F. Requirements on and antecedents of big data quality: An empirical examination to improve big data quality in financial service organizations[D]. Delft: Delft University of Technology, 2016: 13. [21] Glowalla P, Balazy P, Basten D, et al. Process-driven data quality management—An application of the combined conceptual life cycle model[C]// Proceedings of the 2014 47th Hawaii International Conference on System Sciences. Washington DC: IEEE Computer Society, 2014: 4700-4709. [22] Clarke. The OECD guidelines[EB/OL]. [2017-4-4]. http://www.rogerclarke.com/DV/PaperOECD.html. [23] Soares S. Big data governance[M]// An Emerging Imperative. MC Press, 2012. [24] Aggarwal A. Data quality evaluation framework to assess the dimensions of 3V’s of big data[J]. International Journal of Emerging Technology and Advanced Engineering, 2017, 7(10): 503-506. [25] Toivonen M. Big data quality challenges in the context of business analytics[D]. Helsinki: University of Helsinki, 2015: 47-48. [26] Kl?s M, Trendowicz A, Jedlitschka A. What makes big data different from a data quality assessment perspective? Practical challenges for data and information quality research[R]. ODQ2015 30 March 2015, Garching, Germany. [27] Ardagna D, Cappiello C, Samá W, et al. Context-aware data quality assessment for big data[J]. Future Generation Computer Systems, 2018, 89: 548-562. [28] 张绍华, 潘蓉, 宗宇伟. 大数据治理与服务[M]. 上海: 上海科学技术出版社, 2016: 120. [29] Juddoo S. Overview of data quality challenges in the context of Big Data[C]// Proceedings of the 2015 International Conference on Computing, Communication and Security. IEEE, 2015: 1-9. [30] Sneed H M, Erdoes K. Testing big data (assuring the quality of large databases)[C]// Proceedings of the 2015 IEEE Eighth International Conference on Software Testing, Verification and Validation Workshops. IEEE, 2015: 1-6. [31] Liedtke C A. Quality, analytics, and big data[R]. Strategic Improvement Systems, 2016. [32] 蔡莉, 朱扬勇. 大数据质量[M]. 上海: 上海科学技术出版社, 2017: 5. [33] Federal D A S. Data quality framework, version 1.0[R]. Justice Sector Information Strategy, Ministry of Justice, US, 2008. [34] Parkinson J. Six big data challenges[EB/OL]. [2017-02-01]. http://www.cioinsight.com/c/a/Expert-Voices/Managing-Big-Data-Six-Operational-Challenges-484979. [35] Loshin D. Big data analytics: From strategic planning to enterprise integration with tools, techniques, NoSQL, and graph[M]. Morgan Kaufmann Publishers, 2013: 13. [36] Ge M, Dohnal V. Quality management in big data[J]. Informatics, 2018, 5: 19. [37] Calder A. ISO/IEC 38500: The IT governance standard[M]. IT Governance Publishing, 2008. [38] Data Governance Institute. The DGI data governance framework[R]. 2009. [39] IBM Corporation. IBM data governance council maturity model: Building a roadmap for effective data governance[R]. 2007. [40] ISACA. COBIT 5: Enabling information[M]. ISA, 2013. [41] Gartner Group. Big data[EB/OL]. http:// www.gartner.com/it-glossary/big-data. [42] DAMA International. DAMA数据管理知识体系指南[M]. 马欢, 刘晨, 等译. 北京: 清华大学出版社, 2012. [43] Taleb I, Dssouli R, Serhani M A. Big data pre-processing: A quality framework[C]// Proceedings of the IEEE International Congress on Big Data. IEEE, 2015: 191-198. [44] Taleb I, Serhani M A, Dssouli R. Big data quality: A survey[C]// Proceedings of the 2018 IEEE International Congress on Big Data. IEEE, 2018: 166-173. [45] Chen Y T, Sun E W, Lin Y B. Coherent quality management for big data systems: a dynamic approach for stochastic time consistency[J]. Annals of Operations Research, 2018: Article No. 2795. [46] Cheah Y W, Canon R, Plale B, et al. Milieu: Lightweight and configurable big data provenance for science[C]// Proceedings of the 2013 IEEE International Congress on Big Data. IEEE, 2013: 46-53. [47] Beckеr D, King T D, McMullеn B. Big data, big data quality problеm[C]// Proceedings of the 2015 IEEE Intеrnational Conferencе on Santa Clara. IEEE, 2015: 2644-2653. [48] Pawar S H, Thakore D. An assessment model to evaluate quality attributes in big data quality[J]. International Journal of Computer Science Trends and Technology, 2017, 5(2): 373-376. [49] Reddy G M, Deshmukh G, Kumar R A, et al. Enhanced big data quality frame work[J]. International Journal of Computer Science and Information Technologies, 2016, 7(3): 1408-1409. [50] Saha B, Srivastava D. Data quality: The other face of Big Data[C]// Proceedings of the International Conference on Data Engineering. IEEE, 2014: 1294-1297. [51] 金范. 数据质量管理与安全管理[M]. 上海: 上海科学技术出版社, 2016: 47. [52] Soares S. 大数据治理[M]. 匡斌, 译. 北京: 清华大学出版社, 2014. [53] Taleb I, El Kassabi H T, Serhani M A, et al. Big data quality: A quality dimensions evaluation[C]// Proceedings of the 2016 International IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress. IEEE, 2016: 759-765. [54] Merino J, Caballero I, Rivas B, et al. A data quality in use model for big data[J]. Future Generation Computer Systems, 2016, 63: 123-130. [55] Krogstie J, Gao S. A semiotic approach to investigate quality issues of open big data ecosystems[M]// Information and Knowledge Management in Complex Systems. Springer International Publishing, 2015: 41-50. [56] Bizer C. Quality-driven information filtering—in the context of web-based information systems[M]. Saarbrücken: VDM Verlag, 2007: 1-22. [57] Desai K Y. Big data quality modeling and validation[D]. San Jose: San José State University, 2018, 5: 18-58. [58] Fabijan A, Helena H O, Bosch J. Customer feedback and data collection techniques in software R&D: A literature review[C]// Proceedings of the International Conference of Software Business. Springer: 2015, 1: 139-153. [59] Bertino E. Big data—Opportunities and challenges panel position paper[C]// Proceedings of the 2013 IEEE 37th Annual Computer Software and Applications Conference. Washington DC: IEEE Computer Society, 2013: 479-480. [60] 莫祖英. 大数据质量测度模型构建[J]. 情报理论与实践, 2018, 41(3): 11-15. [61] Floridi L. Big data and information quality[M]// The Philosophy of Information Quality. Springer International Publishing, 2014: 303-315. [62] Abdullah N, Ismail S A, Sophiayati S, et al. Data quality in big data: A review[J]. International Journal of Advances in Soft Computing and its Applications, 2015: 17-27. [63] Sukumar S R, Natarajan R, Ferrell R K. Quality of big data in health care[J]. International Journal of Health Care Quality Assurance, 2015, 28(6): 621-634. [64] Firmani D, Mecella M, Scannapieco M, et al. On the meaningfulness of “Big Data Quality”[J]. Data Science and Engineering, 2016, 1(1): 6-20. [65] Juddoo S. Overview of data quality challenges in the context of Big Data[C]// Proceedings of the 2015 International Conference on Computing, Communication and Security. IEEE, 2016. [66] Dumbill E. Making sense of big data[J]. Big Data, 2013, 1(1): 1-2. [67] Becker D, King T D, McMullen B, et al. Big data quality case study preliminary findings[R]. U.S. Army Medcom Mods, 2013: 1-54. [68] Kl?s M, Putz W, Lutz T. Quality evaluation for big data: A scalable assessment approach and first evaluation results[C]// Proceedings of the Joint Conference of the International Workshop on Software Measurement & the International Conference on Software Process & Product Measurement. IEEE, 2017. [69] Yao L, Ge Z. Big data quality prediction in the process industry: A distributed parallel modeling framework[J]. Journal of Process Control, 2018, 68: 1-13. [70] Farzi S, Dastjerdi A B. Data quality measurement using data mining[J]. International Journal of Computer Theory and Engineering, 2010, 2(1): 115-118. [71] Han R, Nie L, Ghanem M M, et al. Elastic algorithms for guaranteeing quality monotonicity in big data mining[C]// Proceedings of the 2013 IEEE International Conference on Big Data, 2013: 45-50. [72] Li L L, Li J Z, Gao H. Evaluating entity-description conflict on duplicated data[J]. Journal of Combinatorial Optimization, 2016, 31(2): 918-941. [73] Lai S T, Leu F Y. An iterative and incremental data preprocessing procedure for improving the risk of big data project[C]// Proceedings of the International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing. Heidelberg: Springer, 2017, 612: 483-492. [74] Lin Y M, Wang H Z, Li J Z, et al. Data source selection for information integration in big data era[J]. Information Sciences, 2019, 479: 197-213. [75] Miao D, Li J, Liu X, et al. Vertex cover in conflict graphs: Complexity and a near optimal approximation[C]// Proceedings of the International Conference on Combinatorial Optimization and Applications. New York: Springer, 2015: 395-408. [76] Heinrich B, Hristova D. A fuzzy metric for currency in the context of Big Data[C]// Proceedings of the Twenty Second European Conference on Information Systems, 2014: 1-15. [77] Li M H, Li J Z, Cheng S Y. Uncertain rule based method for evaluating data currency[J]. Journal of Software, 2014, 25(S2): 147-156. [78] Endler G, Baumg?rtel P, Wahl A M, et al. Is estimation of data completeness through time series forecasts feasible[C]// Proceedings of the Advances in Databases and Information Systems. Springer International Publishing, 2015: 261-274. [79] Razniewski S, Nutt W. Assessing the completeness of geographical data[C]// Proceedings of the Big Data. Berlin: Springer, 2013: 228-237. [80] Emran N A, Embury S, Missier P, et al. Measuring data completeness for microbial genomics database[C]// Proceedings of the Intelligent Information and Database Systems. Berlin: Springer, 2013: 186-195. [81] 周傲英, 金澈清, 王国仁, 等. 不确定性数据管理技术研究综述[J]. 计算机学报, 2009, 32(1): 1-16. [82] Zhang Y, Wang H Z, Yang Z S, et al. Relative accuracy evaluation[J]. PLoS ONE, 2014, 9(8): e103853. [83] Heinrich B, Klier M, Schiller A, et al. Assessing data quality–A probability-based metric for semantic consistency[J]. Decision Support Systems, 2018, 110: 95-106. [84] 罗纳德·巴赫曼, 吉多·肯珀, 托马斯·格尔策. 大数据时代下半场: 数据治理、驱动与变现[M]. 刘志则, 刘源, 译. 北京: 北京联合出版公司, 2017: 101. [85] Sidi F, Panahy P H S, Affendey L S, et al. Data quality: A survey of data quality dimensions[C]// Proceedings of the 2012 International Conference on Information Retrieval & Knowledge Management. IEEE, 2012: 300-304. [86] Ganapathi A, Chen Y, Ganapathi A, et al. Data quality: Experiences and lessons from operationalizing big data[C]// Proceedings of the IEEE International Conference on Big Data. IEEE, 2017. [87] 叶焕倬, 吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010, 26(9): 56-66. [88] 蒋勋, 刘喜文. 大数据环境下面向知识服务的数据清洗研究[J]. 图书与情报, 2013(5): 16-21. [89] 庞雄文, 姚占林, 李拥军. 大数据量的高效重复记录检测方法[J]. 华中科技大学学报(自然科学版), 2010(2): 8-11. [90] Williamson A. Big data and the implications for government[J]. Legal Information Management, 2014, 14(4): 253-257. [91] Ciancarini P, Poggi F, Russo D. Big data quality: a roadmap for open data[C]// Proceedings of the 2016 IEEE Second International Conference on Big Data Computing Service and Applications. IEEE, 2016: 210-215. [92] 洪学海, 王志强, 杨青海. 面向共享的政府大数据质量标准化问题研究[J]. 大数据, 2017(3): 44-52. [93] 马一鸣. 政府大数据质量评价体系构建研究[D]. 长春: 吉林大学, 2016. [94] Juddoo S, George C, Duquenoy P, et al. Data governance in the health industry: Investigating data quality dimensions within a big data context[J]. Applied System Innovation, 2018, 1(4): 43; [95] Juddoo S, George C. Discovering the most important data quality dimensions in health big data using latent semantic analysis[C]// Proceedings of the IEEE International Conference on Advances in Big Data, Computing and Data Communication Systems, Durban, South Africa, 2018. [96] Hoffman S. Medical big data and big data quality problems[J]. Social Science Electronic Publishing, 2014: 289-316. [97] 马国耀, 孙勇韬, 马玉玲. 数据校验技术在医疗健康大数据质量控制中的应用分析[J]. 中国卫生信息管理杂志, 2016, 13(4): 417-419. [98] 陈超. 电力大据质量评价模型及动态探查技术研究[J]. 现代电子技术, 2014(4): 153-155. [99] Hazen B, Boone C, Ezell J, et al. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications[J]. International Journal of Production Economics, 2014, 154: 72-80. |
|
|
|