|
|
Information Extraction and Integration of Large-scale Heterogeneous Socio-economic Statistical Statements |
Zhao Hong, Wang Fang |
Department of Information Resources Management, Business School, Nankai University, Tianjin 300071 |
|
|
Abstract To better serve the government and the public, full mining of government statistics such as the National Strategic Gold Mine has become an inevitable requirement for the development of big data systems in current smart e-government and new think tanks. However, it is impossible to directly correlate and aggregate statistics due to the semi-structured and large-scale heterogeneous characteristics of statistical statements, which causes significant difficulties in terms of standardized management, deep mining, and extensive utilization of statistical resources. In view of the deficiencies in existing research, this study defines the processing tasks based on the analysis of the semantic elements of government statistical statements and the application objectives of information extraction and integration. The processing tasks are divided into five logical processes: table semantic structure analysis, header semantic relationship recognition, numerical information extraction and representation, index terminology redundancy conversion, and inconsistent statistical data disambiguation loading and the roles and main tasks of each process are described. Finally, this study investigates and constructs the overall technical framework and processing flow. The processing and application results for large-scale real data sets reveal that this method can effectively solve the research question, and has a certain practical value. At the same time, it can also provide reference for other big data construction and application research based on semi-structured tables.
|
Received: 08 October 2019
|
|
|
|
1 付瑞平. 国家统计局: “统计云”先行[J]. 中国信息化, 2011(8): 68-69. 2 李纯, 张冬荣. 科技智库的社会经济数据需求及其建设模式案例分析[J]. 图书情报工作, 2015, 59(11): 98-105. 3 王世伟. 略论国家高端科技智库的功能定位[J]. 情报学报, 2018, 37(6): 590-599. 4 王文鹏. 浅谈大数据在政府统计中的作用[J]. 统计与咨询, 2017(4): 55-56. 5 苏州市统计局课题组. “大数据”背景下统计数据资源整合探索[J]. 统计科学与实践, 2018(10): 52-55. 6 中国知网. 中国经济社会大数据研究平台[EB/OL]. [2019-01-07]. http://data.cnki.net. 7 Adams H. Data integration: the teenage years[C]// Proceedings of the International Conference on Very Large Data Bases, Seoul, Korea, 2006: 9-16. 8 刘歆. 领域数据集成及服务关键技术研究[D]. 北京: 北京科技大学, 2017: 7. 9 Informatica[EB/OL]. [2019-01-07]. https://www.informatica.com. 10 IBM. Datastage[EB/OL]. [2019-01-07]. https://www.ibm.com/an alytics/information-server. 11 Oracle. Oracle Warehouse Builder (OWB)[EB/OL]. [2019-01-07]. https://www.oracle.com/database/technologies/warehouse/downloads.html. 12 Microsoft. SQL server integration services (SSIS)[EB/OL]. [2019-01-07]. https://www.microsoft.com/en-us/download/details.aspx?id=39931 13 HitachiVantara. Pentaho data integration (Kettle)[EB/OL]. [2019-04-26]. https://www.hitachivantara.com/en-us/products/big-data integration-analytics/pentaho-data-integration.html?source=penta ho-redirect. 14 吴超, 郑彦宁, 化柏林. 数值信息抽取研究进展综述[J]. 中国图书馆学报, 2014, 40(2): 107-119. 15 赵洪, 肖洪, 相生昌. 基于海量事实数据和协同机制的情报集成平台设计与实现[J]. 信息系统工程, 2018(4): 109-111, 114. 16 肖洪, 赵洪, 毋晓霞. 基于知识挖掘与协同融合的情报研究方法[J]. 情报理论与实践, 2018, 41(10): 15-19. 17 Chen H H, Tsai S C, Tsai J H. Mining tables from large scale HTML texts[C]// Proceedings of the 18th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2000: 166-172. 18 赵洪, 肖洪, 薛德军, 等. Web表格信息抽取研究综述[J]. 现代图书情报技术, 2008(3): 24-31. 19 秦振海, 谭守标, 徐超. 基于Web的表格信息抽取研究[J]. 计算机技术与发展, 2010, 20(2): 217-220. 20 刘颖. 基于Web结构的表格信息抽取研究[D]. 合肥: 合肥工业大学, 2012: 2-3. 21 张元鸣, 陈苗, 陆佳炜, 等. 非结构化表格文档数据抽取与组织模型研究[J]. 浙江工业大学学报, 2016, 15(5): 487-494. 22 曹贞兴. Web表格数据提取与分析系统的设计与实现[D]. 哈尔滨: 哈尔滨工业大学, 2016: 17-19. 23 Kasar T, Bhowmik T K, Bela?d A. Table information extraction and structure recognition using query patterns[C]// Proceedings of the 13th International Conference on Document Analysis and Recognition. IEEE, 2015, 1: 1086-1090. 24 Milosevic N, Gregson C, Hernandez R, et al. A framework for information extraction from tables in biomedical literature[J]. International Journal on Document Analysis and Recognition, 2019, 22(1): 55-78. 25 范莉娅, 肖田元. 自动获取HTML表格语义层次结构方法[J]. 清华大学学报(自然科学版), 2007, 47(10): 1586-1590. 26 Embley D W, Tao C, Liddle S W. Automating the extraction of data from HTML tables with unknown structure[J]. Data & Knowledge Engineering, 2005, 54(1): 3-28. 27 Gatterbauer W, Bohunsky P. Table extraction using spatial reasoning on the CSS2 visual box model[C]// Proceedings of the 21st National Conference on Artificial Intelligence, 2006, 2: 1313-1318. 28 Xue W Y, Li Q Y, Zhang Z, et al. Table analysis and information extraction for medical laboratory reports[C]// Proceedings of IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, 16th International Conference on Pervasive Intelligence and Computing, 4th International Conference on Big Data Intelligence and Computing and Cyber Science and Technology Congress. IEEE, 2018: 193-199. 29 Chu X, He Y Y, Chakrabarti K, et al. TEGRA: Table extraction by global record alignment[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2015: 1713-1728. 30 Luthfi Hanifah H, Akbar S. Table extraction from web pages using conditional random fields to extract toponym related data[J]. Journal of Physics: Conference Series, 2017, 801: 012064. 31 Ghasemi-Gol M, Szekely P. TabVec: Table vectors for classification of web tables[OL]. https://arxiv.org/pdf/1802.06290.pdf. 32 赵洪. 基于Ontology的Web表格数值知识元抽取研究与实现[D]. 天津: 南开大学, 2008: 38-48. 33 Zhang X M, Lv P, Zhao C, et al. A method for materials knowledge extraction from HTML tables based on sibling comparison[J]. International Journal of Software Engineering and Knowledge Engineering, 2016, 26(6): 897-926. 34 中国统计年鉴2018. 10-13按项目规模分固定资产投资(不含农户)[EB/OL]. [2019-06-19]. http://www.stats.gov.cn/tjsj/ndsj/2018/ html/CH1013.jpg. 35 中国统计年鉴2018. 1-3国民经济和社会发展结构指标[EB/OL]. [2019-06-19]. http://www.stats.gov.cn/tjsj/ndsj/2018/html/CH0103.jpg. 36 中国统计年鉴2018. 附录1-6国内生产总值产业构成[EB/OL]. [2019-04-19]. http://www.stats.gov.cn/tjsj/ndsj/2018/html/CH2906. jpg. 37 北京统计年鉴2018. 19-10研究与开发机构研发活动情况[EB/OL]. [2019-06-19]. http://tjj.beijing.gov.cn/nj/main/2018-tjnj/zk/html/ch19-10.JPG. 38 北京统计年鉴2018. 19-9限额以上信息传输、软件和信息技术服务业企业研究与试验发展(R&D)活动基本情况(2017年)[EB/OL]. [2019-04-19]. http://tjj.beijing.gov.cn/nj/main/2018-tjnj/zk/html/ch19-09.JPG. 39 北京统计年鉴2008. 2-1地区生产总值(1978—2007年)[EB/OL]. [2019-06-19]. http://tjj.beijing.gov.cn/nj/main/2008-ch/con tent/mV21_0201.htm. 40 江苏统计年鉴2008. 23-2地区生产总值(2007年)[EB/OL]. [2019-06-19]. http://tj.jiangsu.gov.cn/2008/nj23/nj2302.htm. 41 贾红邦, 梅廷彦. 21-1各省市区地区生产总值及增长速度[M]// 宁夏统计年鉴2008. 北京: 中国统计出版社, 2008: 453. 42 中国统计年鉴2008. 10-3省会城市和计划单列市主要经济指标(2007年)[EB/OL]. [2019-06-19]. http://www.stats.gov.cn/tjsj/ndsj/ 2008/html/K1003c.htm. 43 北京区域统计年鉴2011. 1-2主要年份国民经济和社会发展总量与速度指标[EB/OL]. [2019-06-19]. http://tjj.beijing.gov.cn/nj/qxnj/2011/Data/1-2.htm. 44 北京统计年鉴2016. 2-1地区生产总值(1978—2015年)[EB/OL]. [2019-06-19]. http://tjj.beijing.gov.cn/nj/main/2016-tjnj/zk/html/CH02-01.jpg. 45 许剑毅, 叶植材. 3-10各地区第三产业增加值[M]// 中国第三产业统计年鉴. 北京: 中国统计出版社, 2016: 68-69. 46 北京统计年鉴2018. 2-1地区生产总值(1978—2017年)[EB/OL]. [2019-06-19]. http://tjj.beijing.gov.cn/nj/main/2018-tjnj/zk/html/ch02-01.JPG. |
|
|
|