基于机器阅读理解的古文事件抽取研究

doi:10.3772/j.issn.1000-0135.2023.03.006

情报学报

2023, Vol. 42

Issue (3): 316-326 DOI: 10.3772/j.issn.1000-0135.2023.03.006

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

基于机器阅读理解的古文事件抽取研究

喻雪寒^1,2, 何琳^1,2, 王献琪^1,2

1.南京农业大学信息管理学院，南京 210095
2.南京农业大学人文与社会计算研究中心，南京 210095

Research on Event Extraction from Ancient Books Based on Machine Reading Comprehension

Yu Xuehan^1,2, He Lin^1,2, Wang Xianqi^1,2

1.College of Information Management, Nanjing Agricultural University, Nanjing 210095
2.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095

摘要
图/表
参考文献
相关文章 (2)

全文: PDF (1966 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要准确地梳理古文典籍脉络，抽取典籍中蕴含的事件和事件论元，对古籍从文本数据向智能化数据转化具有重要意义。针对古文事件的抽取研究主要有基于模式匹配、机器学习和神经网络三种方式，本文在现有的基于神经网络的方法中融入机器阅读理解模式，将事件抽取中出现的“事件类型”和“论元角色”糅合为问题形式，由此输出的答案即为事件论元。分别选取编年体史书《左传》和纪传体史书《史记》作为训练和泛化的数据，在具体的泛化过程中引入混淆句以验证模型效果，为古文事件抽取提供了可参照的思路。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	喻雪寒
	何琳
	王献琪

关键词 ：古籍文本, 机器阅读理解, 事件抽取, RoBERTa, 混淆句

收稿日期: 2022-04-21

基金资助:国家社会科学基金一般项目“基于典籍的中华传统文化知识表达体系自动构建方法研究”（18BTQ063）。

作者简介: 喻雪寒，女，1992年生，博士研究生，主要研究领域为自然语言处理；何琳，女，1980年生，博士，教授，博士生导师，主要研究领域为信息检索与文本挖掘，E-mail：helin@njau.edu.cn；王献琪，女，1998年生，硕士研究生，主要研究领域为文本挖掘；

引用本文:

喻雪寒, 何琳, 王献琪. 基于机器阅读理解的古文事件抽取研究[J]. 情报学报, 2023, 42(3): 316-326.
Yu Xuehan, He Lin, Wang Xianqi. Research on Event Extraction from Ancient Books Based on Machine Reading Comprehension. 情报学报, 2023, 42(3): 316-326.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2023.03.006 或 https://qbxb.istic.ac.cn/CN/Y2023/V42/I3/316

1 王大盈. 《中国基本古籍库》和《瀚堂典藏》两大古籍数据库比较研究[J]. 情报杂志, 2011, 30(S1): 157-158, 161.
2 季培培. 常见10种古籍全文数据库的比较研究[J]. 图书馆学研究, 2020(20): 71-80.
3 赵文友, 林世田. “中华古籍保护计划”成果——以“中华古籍资源库”建设为中心的古籍数字化工作[J]. 新世纪图书馆, 2018(3): 12-15.
4 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学, 2014.
5 王菁薇, 肖莉, 骆嘉伟, 等. 基于《伤寒论》的命名实体识别研究[J]. 计算机与数字工程, 2021, 49(8): 1584-1587.
6 肖怀志, 李明杰. 基于本体的历史年代知识元在古籍数字化中的应用——以《三国志》历史年代知识元的抽取、存储和表示为例[J]. 图书情报知识, 2005(3): 28-33.
7 李娜. 面向方志类古籍的多类型命名实体联合自动识别模型构建[J]. 图书馆论坛, 2021, 41(12): 113-123.
8 林睿凡. 基于本体方法构建唐本《伤寒论》知识图谱[D]. 北京: 中国中医科学院, 2021.
9 刘静. 基于古籍库中典型描述语检索的人口大量死亡事件时空特征与原因分析[D]. 西安: 陕西师范大学, 2018.
10 程结晶, 王璞钰. 古籍中人物史料的关联组织研究——以《汉书·艺文志》中西汉经学家群体为例[J]. 图书馆论坛, 2023, 43(3): 64-74.
11 李娜. 社会网络分析视角下方志古籍知识组织研究——以《方志物产》山西分卷为例[D]. 南京: 南京农业大学, 2017.
12 Riloff E. Automatically constructing a dictionary for information extraction tasks[C]// Proceedings of the Eleventh National Conference on Artificial Intelligence. Palo Alto: AAAI Press, 1993: 811-816.
13 姜吉发. 一种事件信息抽取模式获取方法[J]. 计算机工程, 2005, 31(15): 96-98.
14 Chen Y B, Xu L H, Liu K, et al. Event extraction via dynamic multi-pooling convolutional neural networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2015: 167-176.
15 Cao Y W, Peng H, Wu J, et al. Knowledge-preserving incremental social event detection via heterogeneous GNNs[C]// Proceedings of the Web Conference 2021. New York: ACM Press, 2021: 3383-3395.
16 Zheng S C, Hao Y X, Lu D Y, et al. Joint entity and relation extraction based on a hybrid neural network[J]. Neurocomputing, 2017, 257: 59-66.
17 李旭晖, 程威, 唐小雅, 等. 基于多层卷积神经网络的金融事件联合抽取方法[J]. 图书情报工作, 2021, 65(24): 89-99.
18 Li X Y, Feng J R, Meng Y X, et al. A unified MRC framework for named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 5849-5859.
19 Hermann K M, Ko?isky T, Grefenstette E, et al. Teaching machines to read and comprehend[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: The MIT Press, 2015: 1693-1701.
20 Hill F, Bordes A, Chopra S, et al. The goldilocks principle: reading children’s books with explicit memory representations[C]// Proceedings of the Forth International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
21 Cui Y M, Liu T, Chen Z P, et al. Consensus attention-based neural networks for Chinese reading comprehension[C]// Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, 2016: 1777-1786.
22 Richardson M, Burges C J C, Renshaw E. MCTest: a challenge dataset for the open-domain machine comprehension of text[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2013: 193-203.
23 Lai G K, Xie Q Z, Liu H X, et al. RACE: large-scale reading comprehension dataset from examinations[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 785-794.
24 Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 2383-2392.
25 Rajpurkar P, Jia R, Liang P. Know what you don’t know: unanswerable questions for SQuAD[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 784-789.
26 Reddy S, Chen D Q, Manning C D. CoQA: a conversational question answering challenge[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 249-266.
27 Bajaj P, Campos D, Craswell N, et al. MS MARCO: a human generated machine reading comprehension dataset[OL]. (2018-10-31). https://arxiv.org/pdf/1611.09268.pdf.
28 He W, Liu K, Liu J, et al. DuReader: a Chinese machine reading comprehension dataset from real-world applications[C]// Proceedings of the Workshop on Machine Reading for Question Answering. Stroudsburg: Association for Computational Linguistics, 2018: 37-46.
29 马建忠. 马氏文通[M]. 北京: 商务印书馆, 1983: 71.
30 Cui Y M, Che W X, Liu T, et al. Pre-training with whole word masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
32 Zhang Z S, Yang J J, Zhao H. Retrospective reader for machine reading comprehension[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14506-14514.
33 左传[M]. 郭丹, 译. 北京: 中华书局, 2014.
34 刘勋. 春秋左传精读[M]. 北京: 新世界出版社, 2014.
35 刘坤鹏. 杜预《春秋释例》“诸例”研究[D]. 开封: 河南大学, 2018.
36 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). https://arxiv.org/pdf/1907.11692.pdf.
37 Iter D, Guu K, Lansing L, et al. Pretraining with contrastive sentence objectives improves discourse performance of language models[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4859-4870.
38 蔡镜浩. 精心剪裁字字斟酌——《史记》、《左传》对比评议[J]. 当代修辞学, 1985(4): 55-57, 60.
39 牙彩练. 论《左传》《史记》叙史风格之差异——以对“吴越争霸”史实的叙写为例[M]// 中华优秀传统文化研究. 北京：中国社会科学出版社, 2019: 81-96.
40 袁喜竹. 《左传》与《史记》史实相同部分的比较研究[D]. 长沙: 湖南师范大学, 2011: 7-39.