|
|
HanNER: A General Framework for the Automatic Extraction of Named Entities in Ancient Chinese Corpora |
Yan Chengxi1,2, Tang Xuemei3,4, Yang Hao4, Su Qi4,5, Wang Jun3,4 |
1.School of Information Resource Management, Renmin University of China, Beijing 100872 2.Research Center for Digital Humanities, Renmin University of China, Beijing 100872 3.Department of Information Management, Peking University, Beijing 100871 4.Research Center for Digital Humanities, Peking University, Beijing 100871 5.School of Foreign Languages, Peking University, Beijing 100871 |
|
|
Abstract The digitization of ancient Chinese texts is fundamental for promoting the development of ancient Chinese book databases and utilizing relevant resources. As a critical technical aspect, the automatic extraction of named entities from ancient books has gained considerable attention from academia and industry worldwide. However, certain “bottle-neck” problems restricting the methodological development of such extraction have not been effectively handled; these problems mainly include few-shot learning, annotation cost management, and data quality control. This study presents a general framework called “HanNER” for the automatic extraction of named entities from ancient book resources. This approach can be regarded as a systematic solution that involves three steps: rule-based entity automatic annotation, iterative entity extraction based on deep active learning technology, and human-computer-interaction-based annotation decision. Experimental comparisons performed among multiple groups prove the feasibility and advantages of HanNER, including the advantages of a deep active learning algorithm known as “CNN-BiLSTM-CRF+margin,” functional positive effectiveness of proposed modules (entity query and entity recommendation), and efficiency of the proposed “ZenCrowd-II.” Finally, an automatic entity extraction system for ancient Chinese texts is developed based on the optimization of “BERT-CNN-BiLSTM-CRF.” The proposed “HanNER” method can not only further promote the technical and methodological development of the automatic entity extraction and other relevant tasks for ancient Chinese texts but also provide useful reference for product implementation from an engineering perspective.
|
Received: 14 December 2021
|
|
|
|
1 Mikheev A, Moens M, Grover C. Named entity recognition without gazetteers[C]// Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1999: 1-8. 2 夏翠娟, 贺晨芝, 刘倩倩, 等. 数字人文环境下历史文献资源共建共享模式新探[J]. 图书与情报, 2021(1): 53-61. 3 Ji Z J, Shen Y X, Sun Y N, et al. C-CLUE: a benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction[C]// Proceedings of the China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, 2021: 295-301. 4 Hou-Ieong H, Hilde De Weerdt. MARKUS - text analysis and reading platform[EB/OL]. [2021-09-21]. http://dh.chinese-empires.eu/beta/. 5 彭维谦, 程卉, 陈诗沛. 从全文到表格: 地方志职官志中职官资料之半自动撷取[J]. 数位典藏与数位人文, 2018, 1: 79-125. 6 朱锁玲, 包平. 方志类古籍地名识别及系统构建[J]. 中国图书馆学报, 2011, 37(3): 118-124. 7 朱锁玲, 包平. 方志类古籍地名识别及分析研究——以《方志物产》(广东分卷)为例[J]. 图书馆论坛, 2012, 32(4): 171-176. 8 Liu C L, Huang C K, Wang H S, et al. Mining local gazetteers of literary Chinese with CRF and pattern based methods for biographical information in Chinese history[C]// Proceedings of the 2015 IEEE International Conference on Big Data. Washington D.C.: IEEE Computer Society, 2015: 1629-1638. 9 Long Y F, Xiong D, Lu Q, et al. Named entity recognition for Chinese novels in the Ming-Qing dynasties[C]// Proceedings of the 17th Workshop on Chinese Lexical Semantics. Cham: Springer, 2016: 362-375. 10 皇甫晶, 王凌云. 基于规则的纪传体古代汉语文献姓名识别[J]. 图书情报工作, 2013, 57(3): 120-124. 11 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140. 12 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018, 32(11): 41-48, 61. 13 Wu X T, Zhao H Y, Che C. Term translation extraction from historical classics using modern Chinese explanation[C]// Proceedings of the 17th China National Conference on Chinese Computational Linguistics and 6th International Symposiumon Natural Language Processing Based on Naturally Annotated Big Data. Cham: Springer, 2018: 88-98. 14 崔竞烽, 郑德俊, 王东波, 等. 基于深度学习模型的菊花古典诗词命名实体识别[J]. 情报理论与实践, 2020, 43(11): 150-155. 15 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8): 86-97. 16 崔丹丹, 刘秀磊, 陈若愚, 等. 基于Lattice LSTM的古汉语命名实体识别[J]. 计算机科学, 2020, 47(S2): 18-22. 17 Kim S, Song Y, Kim K, et al. MMR-based active machine learning for bio named entity recognition[C]// Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL. Stroudsburg: Association for Computational Linguistics, 2006: 69-72. 18 Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2008: 1070-1079. 19 Hachey B, Alex B, Becker M. Investigating the effects of selective sampling on the annotation task[C]// Proceedings of the Ninth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics, 2005: 144-151. 20 Yao L, Sun C J, Wang X L, et al. Combining self learning and active learning for Chinese named entity recognition[J]. Journal of Software, 2010, 5(5): 530-537. 21 Shen D, Zhang J, Su J, et al. Multi-criteria-based active learning for named entity recognition[C]// Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004: 589-596. 22 Wei Q, Chen Y K, Salimi M, et al. Cost-aware active learning for named entity recognition in clinical text[J]. Journal of the American Medical Informatics Association, 2019, 26(11): 1314-1322. 23 Erdmann A, Wrisley D J, Allen B, et al. Practical, efficient, and customizable active learning for named entity recognition in the digital humanities[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 2223-2234. 24 Siddhant A, Lipton Z C. Deep Bayesian active learning for natural language processing: results of a large-scale empirical study[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 2904-2909. 25 Shen Y Y, Yun H, Lipton Z, et al. Deep active learning for named entity recognition[C]// Proceedings of the 2nd Workshop on Representation Learning for NLP. Stroudsburg: Association for Computational Linguistics, 2017: 252-256. 26 Cunningham H, Maynard D, Bontcheva K, et al. GATE: an architecture for development of robust HLT applications[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 168-175. 27 Stenetorp P, Pyysalo S, Topi? G, et al. BRAT: a web-based tool for NLP-assisted text annotation[C]// Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012: 102-107. 28 Yimam S M, Gurevych I, de Castilho R E, et al. WebAnno: a flexible, web-based and visually supported system for distributed annotations[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2013: 1-6. 29 Yang J, Zhang Y, Li L W, et al. YEDDA: a lightweight collaborative text span annotation tool[C]// Proceedings the 56th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2018: 31-36. 30 Lin B Y, Lee D H, Xu F F, et al. AlpacaTag: an active learning-based crowd annotation framework for sequence tagging[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2019: 58-63. 31 Haas D, Greenstein M, Kamalov K, et al. Reducing error in context-sensitive crowdsourced tasks[J]. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 2013, 1(1): 28-29. 32 Karger D R, Oh S, Shah D. Iterative learning for reliable crowdsourcing systems[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2011: 1953-1961. 33 Lee K, Caverlee J, Webb S. The social honeypot project: protecting online communities from spammers[C]// Proceedings of the 19th international conference on World Wide Web. New York: ACM Press, 2010: 1139-1140. 34 Hosseini M, Cox I J, Mili?-Frayling N, et al. On aggregating labels from multiple crowd workers to infer relevance of documents[C]// Proceedings of the 2012 European Conference on Information Retrieval. Heidelberg: Springer, 2012: 182-194. 35 Baba Y, Kashima H, Kinoshita K, et al. Leveraging crowdsourcing to detect improper tasks in crowdsourcing marketplaces[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 27(2): 1487-1492. 36 Sinha V B, Rao S, Balasubramanian V N. Fast Dawid-Skene: a fast vote aggregation scheme for sentiment classification[C]// Proceedings of the Workshop on Issues of Sentiment Discovery and Opinion Mining at the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2018: 1-8. 37 Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking[C]// Proceedings of the 21st international conference on World Wide Web. New York: ACM Press, 2012: 469-478. 38 Yan C X, Su Q, Wang J. MoGCN: mixture of gated convolutional neural network for named entity recognition of Chinese historical texts[J]. IEEE Access, 2020, 8: 181629-181639. 39 Fritzler A, Logacheva V, Kretov M. Few-shot classification in named entity recognition task[C]//Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. New York: ACM Press, 2019: 993-1000. 40 Li J, Sun A X, Han J L, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70. 41 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 3111-3119. 42 Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 43 Bebis G, Georgiopoulos M. Feed-forward neural networks[J]. IEEE Potentials, 1994, 13(4): 27-31. 44 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 45 LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. 46 Lafferty J D, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of 18th International Conference on Machine Learning. San Franciso: Morgan Kaufmann Publishers, 2001: 282-289. 47 Hutchinson M F. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines[J]. Communications in Statistics - Simulation and Computation, 1989, 18(3): 1059-1076. 48 Fleiss J L, Cohen J, Everitt B S. Large sample standard errors of kappa and weighted kappa[J]. Psychological Bulletin, 1969, 72(5): 323-327. 49 Kingma D P, Ba J. Adam: a method for stochastic optimization[OL]. (2017-01-30) [2020-09-21]. https://arxiv.org/pdf/1412.6980.pdf. 50 Goossens N A M C, Camp G, Verkoeijen P P J L, et al. The effect of retrieval practice in primary school vocabulary learning[J]. Applied Cognitive Psychology, 2014, 28(1): 135-142. 51 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. |
|
|
|