重建知识源流：将结构化知识自动溯源至史籍原文

doi:10.3772/j.issn.1000-0135.2024.04.003

情报学报

2024, Vol. 43

Issue (4): 405-415 DOI: 10.3772/j.issn.1000-0135.2024.04.003

情报理论与方法

本期目录 | 过刊浏览 | 高级检索

重建知识源流：将结构化知识自动溯源至史籍原文

张琪^1,2, 孔嘉^1,2, 胡昊天^1,2, 王东波^3,4, 王昊^1,2, 邓三鸿^1,2

1.南京大学信息管理学院,南京 210023
2.数据工程与知识服务省高校重点实验室(南京大学),南京 210023
3.南京农业大学信息管理学院,南京 210095
4.南京农业大学人文与社会计算研究中心,南京 210095

Recapturing the Flow of Knowledge: Tracing Structured Knowledge Back to Historical Records

Zhang Qi^1,2, Kong Jia^1,2, Hu Haotian^1,2, Wang Dongbo^3,4, Wang Hao^1,2, Deng Sanhong^1,2

1.School of Information Management, Nanjing University, Nanjing 210023
2.Key Laboratory of Data Engineering and Knowledge Services in Provincial Universities (Nanjing University), Nanjing 210023
3.College of Information Management, Nanjing Agricultural University, Nanjing 210095
4.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (2098 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要将结构化历史知识溯源至史籍原文能够提升知识的可验证性和可靠性。本研究针对古籍知识库缺乏完善知识溯源机制和部分古汉语文本缺乏触发词的问题，提出了一种将结构化历史知识溯源至史籍原文的方法。首先，结合共指消解、文本蕴涵等技术和方法，提出了结构化历史知识溯源框架；其次，在构造数据集的基础上，通过实验对比了BERT（bidirectional encoder representations from transformers）、SikuBERT与GPT-3（generative pre-trained transformer 3）、GPT-4等不同预训练模型和不同输入策略对知识溯源效果的影响，构建了结构化历史知识溯源模型SHK-Tracer（structured historical knowledge tracing model），其精确率为80.19%；最后，采用SHK-Tracer将史记多维知识库（Shiji Mutil-dimensional Knowledge Base，SMKB）分别溯源至不同的史书，发现《史记》与《左传》《国语》中各史料片段的知识重合度及片段本身所包含的信息含量不成正比。本研究结果一方面能够支持相关读者核验知识真伪、提供不同史料之间的相互参照以及结合史料年代等信息确定知识源头，另一方面能够为史籍知识计量、关系抽取和语言风格计算等数字人文研究提供基础语料。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张琪
	孔嘉
	胡昊天
	王东波
	王昊
	邓三鸿

关键词 ：知识服务, 知识溯源, 知识计量, 数字人文, 知识三元组

收稿日期: 2023-03-31

基金资助:国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”（21&ZD331）。

作者简介: 张琪，女，1995年生，博士研究生，研究方向为数字人文、知识图谱、自然语言处理；孔嘉，男，1996年生，博士研究生，研究方向为信息计量分析；胡昊天，男，1997年生，博士研究生，研究方向为数字人文、自然语言处理；王东波，男，1981年生，教授，博士生导师，研究方向为自然语言处理与文本挖掘、信息计量；王昊，男，1981年生，博士，教授，博士生导师，研究方向为智能信息处理和检索、数据挖掘技术及其应用等；邓三鸿，男，1975年生，教授，博士生导师，研究方向为知识图谱、科学计量、学术评价，E-mail：sanhong@nju.edu.cn；

引用本文:

张琪, 孔嘉, 胡昊天, 王东波, 王昊, 邓三鸿. 重建知识源流：将结构化知识自动溯源至史籍原文[J]. 情报学报, 2024, 43(4): 405-415.
Zhang Qi, Kong Jia, Hu Haotian, Wang Dongbo, Wang Hao, Deng Sanhong. Recapturing the Flow of Knowledge: Tracing Structured Knowledge Back to Historical Records. 情报学报, 2024, 43(4): 405-415.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2024.04.003 或 https://qbxb.istic.ac.cn/CN/Y2024/V43/I4/405

1 Sikos L F, Philp D. Provenance-aware knowledge representation: a survey of data models and contextualized knowledge graphs[J]. Data Science and Engineering, 2020, 5(3): 293-316.
2 Lucassen T, Schraagen J M. Trust in Wikipedia: how users trust information from an unknown source[C]// Proceedings of the 4th Workshop on Information Credibility. New York: ACM Press, 2010: 19-26.
3 Hartig O, Zhao J. Using Web data provenance for quality assessment[C]// Proceedings of the First International Conference on Semantic Web in Provenance Management. Aachen: CEUR-WS.org, 2009: 29-34.
4 van den Hoven W. A user-centric design for the BiographyNet linked data interface[D]. Amsterdam: Vrije Universiteit Amsterdam, 2014.
5 Li X Y, Peng S Y, Du J. Towards medical knowmetrics: representing and computing medical knowledge using semantic predications as the knowledge unit and the uncertainty as the knowledge context[J]. Scientometrics, 2021, 126(7): 6225-6251.
6 陈小荷, 冯敏萱, 徐润华, 等. 先秦文献信息处理[M]. 北京: 世界图书出版公司, 2013.
7 Vrande?i? D, Kr?tzsch M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
8 China Biographical Database Project (CBDB)[DB/OL]. [2022-12-01]. https://projects.iq.harvard.edu/cbdb.
9 张琪, 王东波, 黄水清, 等. 史书多维知识重组与可视化研究——以《史记》为对象[J]. 情报学报, 2022, 41(2): 130-141.
10 Piscopo A, Kaffee L A, Phethean C, et al. Provenance information in a collaborative knowledge graph: an evaluation of wikidata external references[C]// Proceedings of International Semantic Web Conference. Cham: Springer, 2017: 542-558.
11 Li C Q, Sugimoto S. Provenance description of metadata vocabularies for the long-term maintenance of metadata[J]. Journal of Data and Information Science, 2017, 2(2): 41-55.
12 Glavic B, Dittrich K. Data provenance: a categorization of existing approaches[C]// Datenbanksysteme in Business, Technologie Und Web. Aachen: Gesellschaft für Informatik, 2007: 227-241.
13 Sahoo S S, Bodenreider O, Hitzler P, et al. Provenance context entity (PaCE): scalable provenance tracking for scientific RDF data[C]// Proceedings of the International Conference on Scientific and Statistical Database Management. Heidelberg: Springer, 2010: 461-470.
14 Seneviratne O, Das A K, Chari S, et al. Enabling trust in clinical decision support recommendations through semantics[C]// Proceedings of the Workshop on Semantic Web Solutions for Large-Scale Biomedical Data Analytics Co-Located with 18th International Semantic Web Conference. Heidelberg: Springer, 2019: 55-67.
15 Vlietstra W J, Vos R, Sijbers A M, et al. Using predicate and provenance information from a knowledge graph for drug efficacy screening[J]. Journal of Biomedical Semantics, 2018, 9(1): Article No.23.
16 Ockeloen N, Fokkens A, ter Braake S, et al. BiographyNet: managing provenance at multiple levels and from different perspectives[C]// Proceedings of the 3rd International Conference on Linked Science. Aachen: CEUR-WS.org, 2013: 13.
17 Fokkens A, ter Braake S, Ockeloen C J, et al. BiographyNet: extracting relations between people and events[M]// Europa baut auf Biographien: Aspekte, Bausteine, Normen und Standards für eine europ?ische Biographik. New York: New Academic Press, 2017: 193-224.
18 Lin P, Song Q, Wu Y H, et al. Discovering patterns for fact checking in knowledge graphs[J]. Journal of Data and Information Quality, 2019, 11(3): Article No.13.
19 Gad-Elrab M H, Stepanova D, Urbani J, et al. ExFaKT: a framework for explaining facts over knowledge graphs and text[C]// Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. New York: ACM Press, 2019: 87-95.
20 Amaral G, Rodrigues O, Simperl E. ProVe: a pipeline for automated provenance verification of knowledge graphs against textual sources[OL]. (2022-10-26) [2023-01-05]. https://arxiv.org/pdf/2210.14846v1.pdf.
21 Vania C, Lee G, Pierleoni A. Improving distantly supervised document-level relation extraction through natural language inference[C]// Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 14-20.
22 Sainz O, de Lacalle O L, Labaka G, et al. Label verbalization and entailment for effective zero and few-shot relation extraction[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 1199-1212.
23 刘欣瑜, 刘瑞芳, 石航, 等. 基于图神经网络和语义知识的自然语言推理任务研究[J]. 中文信息学报, 2021, 35(6): 122-130.
24 Chen Q, Zhu X D, Ling Z H, et al. Enhanced LSTM for natural language inference[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1657-1668.
25 赵文. 基于深度学习的自然语言推理算法研究与实现[D]. 北京: 北京邮电大学, 2020.
26 Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 3651-3657.
27 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49.
28 张琪. 《史记》多维知识组织与可视化研究[D]. 南京: 南京农业大学, 2020.
29 王一钒, 李博, 史话, 等. 古汉语实体关系联合抽取的标注方法[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
30 Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data[C]// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg: Association for Computational Linguistics, 2009: 1003-1011.
31 Roller R, Stevenson M. Held-out versus gold standard: comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts[C]// Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis. Stroudsburg: Association for Computational Linguistics, 2015: 97-102.
32 Riedel S, Yao L M, McCallum A. Modeling relations and their mentions without labeled text[C]// Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Heidelberg: Springer, 2010: 148-163.
33 Christou D, Tsoumakas G. Improving distantly-supervised relation extraction through BERT-based label and instance embeddings[J]. IEEE Access, 2021, 9: 62574-62582.
34 中国哲学书电子化计划[EB/OL]. [2022-11-29]. https://ctext.org/zhs.
35 漢達文庫[EB/OL]. [2022-11-29]. http://www.chant.org/.
36 薛嘉楠. 基于《汉学引得》的前四史人物关系知识抽取研究[D]. 南京: 南京农业大学, 2020.
37 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
38 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
39 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901.
40 OpenAI. GPT-4 technical report[OL]. (2023-03-15) [2023-06-14]. https://arxiv.org/pdf/2303.08774.pdf.
41 Liu J C, Shen D H, Zhang Y Z, et al. What makes good in-context examples for GPT-3?[C]// Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg: Association for Computational Linguistics, 2022: 100-114.