|
|
Recapturing the Flow of Knowledge: Tracing Structured Knowledge Back to Historical Records |
Zhang Qi1,2, Kong Jia1,2, Hu Haotian1,2, Wang Dongbo3,4, Wang Hao1,2, Deng Sanhong1,2 |
1.School of Information Management, Nanjing University, Nanjing 210023 2.Key Laboratory of Data Engineering and Knowledge Services in Provincial Universities (Nanjing University), Nanjing 210023 3.College of Information Management, Nanjing Agricultural University, Nanjing 210095 4.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095 |
|
|
Abstract Tracing structured historical knowledge back to historical records can enhance the verifiability and reliability of knowledge. In response to the challenges of inadequate knowledge tracing mechanisms in existing knowledge bases of ancient books and the absence of trigger words in several Archaic Chinese texts, this study introduces a method to trace structured historical knowledge back to historical records. First, a structured historical knowledge tracing framework is proposed by leveraging techniques such as co-reference resolution and textual entailment. Subsequently, a dataset is proposed to compare the effectiveness of different pre-trained language models, including BERT, SikuBERT, GPT-3, and GPT-4. This dataset combined with different input strategies on the knowledge tracing effect, is used to structure the historical knowledge tracing model, SHK-Tracer, which was employed to trace the historical subject matter knowledge base (Shiji Mutil-dimensional Knowledge Base, SMKB) to different ancient historical books, with 80.19% precision. We found that the knowledge overlap between Shiji and each historical fragment in historical books, such as Zuozhuan and Guoyu, did not correlate proportionally with the inherent information content of the historical fragment. The results of the study serve the dual purpose of first, aiding scholars and readers in verifying the authenticity of knowledge, by providing cross-references between different historical sources and identifying the original source; and second, facilitating digital humanities research, including historical knowmetrics, relation extraction, and linguistic style calculations of ancient historical records.
|
Received: 31 March 2023
|
|
|
|
1 Sikos L F, Philp D. Provenance-aware knowledge representation: a survey of data models and contextualized knowledge graphs[J]. Data Science and Engineering, 2020, 5(3): 293-316. 2 Lucassen T, Schraagen J M. Trust in Wikipedia: how users trust information from an unknown source[C]// Proceedings of the 4th Workshop on Information Credibility. New York: ACM Press, 2010: 19-26. 3 Hartig O, Zhao J. Using Web data provenance for quality assessment[C]// Proceedings of the First International Conference on Semantic Web in Provenance Management. Aachen: CEUR-WS.org, 2009: 29-34. 4 van den Hoven W. A user-centric design for the BiographyNet linked data interface[D]. Amsterdam: Vrije Universiteit Amsterdam, 2014. 5 Li X Y, Peng S Y, Du J. Towards medical knowmetrics: representing and computing medical knowledge using semantic predications as the knowledge unit and the uncertainty as the knowledge context[J]. Scientometrics, 2021, 126(7): 6225-6251. 6 陈小荷, 冯敏萱, 徐润华, 等. 先秦文献信息处理[M]. 北京: 世界图书出版公司, 2013. 7 Vrande?i? D, Kr?tzsch M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85. 8 China Biographical Database Project (CBDB)[DB/OL]. [2022-12-01]. https://projects.iq.harvard.edu/cbdb. 9 张琪, 王东波, 黄水清, 等. 史书多维知识重组与可视化研究——以《史记》为对象[J]. 情报学报, 2022, 41(2): 130-141. 10 Piscopo A, Kaffee L A, Phethean C, et al. Provenance information in a collaborative knowledge graph: an evaluation of wikidata external references[C]// Proceedings of International Semantic Web Conference. Cham: Springer, 2017: 542-558. 11 Li C Q, Sugimoto S. Provenance description of metadata vocabularies for the long-term maintenance of metadata[J]. Journal of Data and Information Science, 2017, 2(2): 41-55. 12 Glavic B, Dittrich K. Data provenance: a categorization of existing approaches[C]// Datenbanksysteme in Business, Technologie Und Web. Aachen: Gesellschaft für Informatik, 2007: 227-241. 13 Sahoo S S, Bodenreider O, Hitzler P, et al. Provenance context entity (PaCE): scalable provenance tracking for scientific RDF data[C]// Proceedings of the International Conference on Scientific and Statistical Database Management. Heidelberg: Springer, 2010: 461-470. 14 Seneviratne O, Das A K, Chari S, et al. Enabling trust in clinical decision support recommendations through semantics[C]// Proceedings of the Workshop on Semantic Web Solutions for Large-Scale Biomedical Data Analytics Co-Located with 18th International Semantic Web Conference. Heidelberg: Springer, 2019: 55-67. 15 Vlietstra W J, Vos R, Sijbers A M, et al. Using predicate and provenance information from a knowledge graph for drug efficacy screening[J]. Journal of Biomedical Semantics, 2018, 9(1): Article No.23. 16 Ockeloen N, Fokkens A, ter Braake S, et al. BiographyNet: managing provenance at multiple levels and from different perspectives[C]// Proceedings of the 3rd International Conference on Linked Science. Aachen: CEUR-WS.org, 2013: 13. 17 Fokkens A, ter Braake S, Ockeloen C J, et al. BiographyNet: extracting relations between people and events[M]// Europa baut auf Biographien: Aspekte, Bausteine, Normen und Standards für eine europ?ische Biographik. New York: New Academic Press, 2017: 193-224. 18 Lin P, Song Q, Wu Y H, et al. Discovering patterns for fact checking in knowledge graphs[J]. Journal of Data and Information Quality, 2019, 11(3): Article No.13. 19 Gad-Elrab M H, Stepanova D, Urbani J, et al. ExFaKT: a framework for explaining facts over knowledge graphs and text[C]// Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2019: 87-95. 20 Amaral G, Rodrigues O, Simperl E. ProVe: a pipeline for automated provenance verification of knowledge graphs against textual sources[OL]. (2022-10-26) [2023-01-05]. https://arxiv.org/pdf/2210.14846v1.pdf. 21 Vania C, Lee G, Pierleoni A. Improving distantly supervised document-level relation extraction through natural language inference[C]// Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 14-20. 22 Sainz O, de Lacalle O L, Labaka G, et al. Label verbalization and entailment for effective zero and few-shot relation extraction[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 1199-1212. 23 刘欣瑜, 刘瑞芳, 石航, 等. 基于图神经网络和语义知识的自然语言推理任务研究[J]. 中文信息学报, 2021, 35(6): 122-130. 24 Chen Q, Zhu X D, Ling Z H, et al. Enhanced LSTM for natural language inference[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1657-1668. 25 赵文. 基于深度学习的自然语言推理算法研究与实现[D]. 北京: 北京邮电大学, 2020. 26 Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 3651-3657. 27 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49. 28 张琪. 《史记》多维知识组织与可视化研究[D]. 南京: 南京农业大学, 2020. 29 王一钒, 李博, 史话, 等. 古汉语实体关系联合抽取的标注方法[J]. 数据分析与知识发现, 2021, 5(9): 63-74. 30 Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data[C]// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg: Association for Computational Linguistics, 2009: 1003-1011. 31 Roller R, Stevenson M. Held-out versus gold standard: comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts[C]// Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis. Stroudsburg: Association for Computational Linguistics, 2015: 97-102. 32 Riedel S, Yao L M, McCallum A. Modeling relations and their mentions without labeled text[C]// Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Heidelberg: Springer, 2010: 148-163. 33 Christou D, Tsoumakas G. Improving distantly-supervised relation extraction through BERT-based label and instance embeddings[J]. IEEE Access, 2021, 9: 62574-62582. 34 中国哲学书电子化计划[EB/OL]. [2022-11-29]. https://ctext.org/zhs. 35 漢達文庫[EB/OL]. [2022-11-29]. http://www.chant.org/. 36 薛嘉楠. 基于《汉学引得》的前四史人物关系知识抽取研究[D]. 南京: 南京农业大学, 2020. 37 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 38 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43. 39 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901. 40 OpenAI. GPT-4 technical report[OL]. (2023-03-15) [2023-06-14]. https://arxiv.org/pdf/2303.08774.pdf. 41 Liu J C, Shen D H, Zhang Y Z, et al. What makes good in-context examples for GPT-3?[C]// Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg: Association for Computational Linguistics, 2022: 100-114. |
|
|
|