<bold>HanNER</bold>：一个面向汉语古籍语料命名实体自动抽取的通用框架

doi:10.3772/j.issn.1000-0135.2023.02.007

情报学报

2023, Vol. 42

Issue (2): 203-216 DOI: 10.3772/j.issn.1000-0135.2023.02.007

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

HanNER：一个面向汉语古籍语料命名实体自动抽取的通用框架

严承希^1,2, 唐雪梅^3,4, 杨浩⁴, 苏祺^4,5, 王军^3,4

1.中国人民大学信息资源管理学院，北京 100872
2.中国人民大学数字人文研究中心，北京 100872
3.北京大学信息管理系，北京 100871
4.北京大学数字人文研究中心，北京 100871
5.北京大学外国语学院，北京 100871

HanNER: A General Framework for the Automatic Extraction of Named Entities in Ancient Chinese Corpora

Yan Chengxi^1,2, Tang Xuemei^3,4, Yang Hao⁴, Su Qi^4,5, Wang Jun^3,4

1.School of Information Resource Management, Renmin University of China, Beijing 100872
2.Research Center for Digital Humanities, Renmin University of China, Beijing 100872
3.Department of Information Management, Peking University, Beijing 100871
4.Research Center for Digital Humanities, Peking University, Beijing 100871
5.School of Foreign Languages, Peking University, Beijing 100871

摘要
图/表
参考文献
相关文章 (0)

全文: PDF (3146 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要古籍数字化整理是推动我国汉语古籍数据库建设及相关资源整合和利用的基础性工作。作为关键的技术环节之一，面向古籍命名实体的自动化抽取备受国内外学界和业界的关注。但是一些制约汉语古籍实体抽取方法的“卡脖子”问题仍未得到有效解决，包括少样本学习问题、标注成本管理问题和数据质量控制问题。本研究提出了一个面向古籍资源命名实体自动化抽取的通用框架——HanNER，包括“基于规则的实体预标注”“基于深度主动学习的迭代实体抽取”以及“人机交互模式下的标注决策”三个主要部分。多组实验比较证明了HanNER的可行性和优势，包括基于深度主动学习模型CNN-BiLSTM-CRF+margin的优势、多功能标注模块“标注查询”与“自动推荐”的积极作用以及ZenCrowd-II算法的优势。最后，本研究基于优化后的BERT-CNN-BiLSTM-CRF模型开发了在线的汉语古籍的实体自动抽取系统。HanNER的提出有利于推进汉语古籍实体抽取工作及相关任务在方法与技术上的发展，而且从工程化角度为古籍实体抽取产品的落地提供了借鉴和启发。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	严承希
	唐雪梅
	杨浩
	苏祺
	王军

关键词 ：汉语古籍, 实体抽取, 深度主动学习, 交互式标注, 标签自动汇聚

收稿日期: 2021-12-14

基金资助:国家自然科学基金项目“中国儒家学术史知识图谱构建研究”（72010107003）；中国博士后科学基金第70批面上资助项目“融合深度学习和知识图谱技术的清史语料数字化整理研究”（2021M703564）。

作者简介: 严承希，男，1988年生，博士，讲师，主要研究领域为数字人文和自然语言处理；唐雪梅，女，1995年生，博士研究生，主要研究领域为数字人文和自然语言处理；杨浩，男，1981年生，博士，副教授，主要研究领域为中国哲学与数字人文；苏祺，女，1979年生，博士，副教授，博士生导师，主要研究领域为计算语言与数字人文；王军，男，1968年生，博士，教授，博士生导师，主要研究领域为数字分析、知识组织与数字人文，E-mail：junwang@pku.edu.cn；

引用本文:

严承希, 唐雪梅, 杨浩, 苏祺, 王军. HanNER：一个面向汉语古籍语料命名实体自动抽取的通用框架[J]. 情报学报, 2023, 42(2): 203-216.
Yan Chengxi, Tang Xuemei, Yang Hao, Su Qi, Wang Jun. HanNER: A General Framework for the Automatic Extraction of Named Entities in Ancient Chinese Corpora. 情报学报, 2023, 42(2): 203-216.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2023.02.007 或 https://qbxb.istic.ac.cn/CN/Y2023/V42/I2/203

1 Mikheev A, Moens M, Grover C. Named entity recognition without gazetteers[C]// Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1999: 1-8.
2 夏翠娟, 贺晨芝, 刘倩倩, 等. 数字人文环境下历史文献资源共建共享模式新探[J]. 图书与情报, 2021(1): 53-61.
3 Ji Z J, Shen Y X, Sun Y N, et al. C-CLUE: a benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction[C]// Proceedings of the China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, 2021: 295-301.
4 Hou-Ieong H, Hilde De Weerdt. MARKUS - text analysis and reading platform[EB/OL]. [2021-09-21]. http://dh.chinese-empires.eu/beta/.
5 彭维谦, 程卉, 陈诗沛. 从全文到表格: 地方志职官志中职官资料之半自动撷取[J]. 数位典藏与数位人文, 2018, 1: 79-125.
6 朱锁玲, 包平. 方志类古籍地名识别及系统构建[J]. 中国图书馆学报, 2011, 37(3): 118-124.
7 朱锁玲, 包平. 方志类古籍地名识别及分析研究——以《方志物产》(广东分卷)为例[J]. 图书馆论坛, 2012, 32(4): 171-176.
8 Liu C L, Huang C K, Wang H S, et al. Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history[C]// Proceedings of the 2015 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, 2015: 1629-1638.
9 Long Y F, Xiong D, Lu Q, et al. Named entity recognition for Chinese novels in the Ming-Qing dynasties[C]// Proceedings of the 17th Workshop on Chinese Lexical Semantics. Cham: Springer, 2016: 362-375.
10 皇甫晶, 王凌云. 基于规则的纪传体古代汉语文献姓名识别[J]. 图书情报工作, 2013, 57(3): 120-124.
11 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140.
12 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018, 32(11): 41-48, 61.
13 Wu X T, Zhao H Y, Che C. Term translation extraction from historical classics using modern Chinese explanation[C]// Proceedings of the 17th China National Conference on Chinese Computational Linguistics and 6th International Symposiumon Natural Language Processing Based on Naturally Annotated Big Data. Cham: Springer, 2018: 88-98.
14 崔竞烽, 郑德俊, 王东波, 等. 基于深度学习模型的菊花古典诗词命名实体识别[J]. 情报理论与实践, 2020, 43(11): 150-155.
15 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
16 崔丹丹, 刘秀磊, 陈若愚, 等. 基于Lattice LSTM的古汉语命名实体识别[J]. 计算机科学, 2020, 47(S2): 18-22.
17 Kim S, Song Y, Kim K, et al. MMR-based active machine learning for bio named entity recognition[C]// Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL. Stroudsburg: Association for Computational Linguistics, 2006: 69-72.
18 Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2008: 1070-1079.
19 Hachey B, Alex B, Becker M. Investigating the effects of selective sampling on the annotation task[C]// Proceedings of the Ninth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics, 2005: 144-151.
20 Yao L, Sun C J, Wang X L, et al. Combining self learning and active learning for Chinese named entity recognition[J]. Journal of Software, 2010, 5(5): 530-537.
21 Shen D, Zhang J, Su J, et al. Multi-criteria-based active learning for named entity recognition[C]// Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004: 589-596.
22 Wei Q, Chen Y K, Salimi M, et al. Cost-aware active learning for named entity recognition in clinical text[J]. Journal of the American Medical Informatics Association, 2019, 26(11): 1314-1322.
23 Erdmann A, Wrisley D J, Allen B, et al. Practical, efficient, and customizable active learning for named entity recognition in the digital humanities[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 2223-2234.
24 Siddhant A, Lipton Z C. Deep Bayesian active learning for natural language processing: results of a large-scale empirical study[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 2904-2909.
25 Shen Y Y, Yun H, Lipton Z, et al. Deep active learning for named entity recognition[C]// Proceedings of the 2nd Workshop on Representation Learning for NLP. Stroudsburg: Association for Computational Linguistics, 2017: 252-256.
26 Cunningham H, Maynard D, Bontcheva K, et al. GATE: an architecture for development of robust HLT applications[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 168-175.
27 Stenetorp P, Pyysalo S, Topi? G, et al. BRAT: a web-based tool for NLP-assisted text annotation[C]// Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012: 102-107.
28 Yimam S M, Gurevych I, de Castilho R E, et al. WebAnno: a flexible, web-based and visually supported system for distributed annotations[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2013: 1-6.
29 Yang J, Zhang Y, Li L W, et al. YEDDA: a lightweight collaborative text span annotation tool[C]// Proceedings the 56th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2018: 31-36.
30 Lin B Y, Lee D H, Xu F F, et al. AlpacaTag: an active learning-based crowd annotation framework for sequence tagging[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2019: 58-63.
31 Haas D, Greenstein M, Kamalov K, et al. Reducing error in context-sensitive crowdsourced tasks[J]. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 2013, 1(1): 28-29.
32 Karger D R, Oh S, Shah D. Iterative learning for reliable crowdsourcing systems[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2011: 1953-1961.
33 Lee K, Caverlee J, Webb S. The social honeypot project: protecting online communities from spammers[C]// Proceedings of the 19th international conference on World Wide Web. New York: ACM Press, 2010: 1139-1140.
34 Hosseini M, Cox I J, Mili?-Frayling N, et al. On aggregating labels from multiple crowd workers to infer relevance of documents[C]// Proceedings of the 2012 European Conference on Information Retrieval. Heidelberg: Springer, 2012: 182-194.
35 Baba Y, Kashima H, Kinoshita K, et al. Leveraging crowdsourcing to detect improper tasks in crowdsourcing marketplaces[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 27(2): 1487-1492.
36 Sinha V B, Rao S, Balasubramanian V N. Fast Dawid-Skene: a fast vote aggregation scheme for sentiment classification[C]// Proceedings of the Workshop on Issues of Sentiment Discovery and Opinion Mining at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2018: 1-8.
37 Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]// Proceedings of the 21st international conference on World Wide Web. New York: ACM Press, 2012: 469-478.
38 Yan C X, Su Q, Wang J. MoGCN: mixture of gated convolutional neural network for named entity recognition of Chinese historical texts[J]. IEEE Access, 2020, 8: 181629-181639.
39 Fritzler A, Logacheva V, Kretov M. Few-shot classification in named entity recognition task[C]//Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: ACM Press, 2019: 993-1000.
40 Li J, Sun A X, Han J L, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
41 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 3111-3119.
42 Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543.
43 Bebis G, Georgiopoulos M. Feed-forward neural networks[J]. IEEE Potentials, 1994, 13(4): 27-31.
44 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
45 LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
46 Lafferty J D, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of 18th International Conference on Machine Learning. San Franciso: Morgan Kaufmann Publishers, 2001: 282-289.
47 Hutchinson M F. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines[J]. Communications in Statistics - Simulation and Computation, 1989, 18(3): 1059-1076.
48 Fleiss J L, Cohen J, Everitt B S. Large sample standard errors of kappa and weighted kappa[J]. Psychological Bulletin, 1969, 72(5): 323-327.
49 Kingma D P, Ba J. Adam: a method for stochastic optimization[OL]. (2017-01-30) [2020-09-21]. https://arxiv.org/pdf/1412.6980.pdf.
50 Goossens N A M C, Camp G, Verkoeijen P P J L, et al. The effect of retrieval practice in primary school vocabulary learning[J]. Applied Cognitive Psychology, 2014, 28(1): 135-142.
51 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.