基于<bold>MU</bold>序列标注的古籍命名实体识别研究

doi:10.3772/j.issn.1000-0135.2025.06.007

情报学报

2025, Vol. 44

Issue (6): 736-747 DOI: 10.3772/j.issn.1000-0135.2025.06.007

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

基于MU序列标注的古籍命名实体识别研究

许乾坤¹, 王东波^1,2, 刘禹彤¹, 黄水清^1,2

1.南京农业大学信息管理学院，南京 210095
2.南京农业大学人文与社会计算研究中心，南京 210095

Named Entity Recognition of Ancient Books Based on MU Sequence Labeling

Xu Qiankun¹, Wang Dongbo^1,2, Liu Yutong¹, Huang Shuiqing^1,2

1.College of Information Management, Nanjing Agricultural University, Nanjing 210095
2.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095

摘要
图/表
参考文献
相关文章 (9)

全文: PDF (1221 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要命名实体识别任务是自然语言处理中众多下游任务的重要基础步骤。古籍作为中华文明的载体，不仅蕴含着丰富的文化遗产，更是汲取历史智慧、启迪未来的重要源泉。提高古籍文本中实体识别的准确性，有助于推动古籍文本结构化、知识体系化，助力古籍资源的智能利用和开发。首先，选取本课题组精加工的二十四史古籍作为原始数据集，使用GujiBERT_FAN预训练模型对Sequence Labeling、Sequence Labeling_CRF、Span-level Prediction方法进行微调，从而更准确地捕捉实体边界和类型，对古籍文本中的实体进行识别和预测。其次，本文引入多数投票（Majority Voting Combiner，MVC）和合并（Union Combiner，UC）的方法，与预测数据集进行整合并构建新的数据集，基于已识别实体数据集，使用MVC和UC方法结合NER（Named Entity Recognition）模型的预测结果重新生成新的数据集。最后，通过学习判断Sequence Labeling、Sequence Labeling_CRF、Span-level Prediction方法对实体的预测结果是否错误，并使用提示的思路对模型进行微调。为验证本文提出的方法，采用评估指标来验证模型的效果。实验结果表明，UC方法的加入使得实体识别的召回率显著提升，MVC方法提升了模型的F1值。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	许乾坤
	王东波
	刘禹彤
	黄水清

关键词 ：序列标注, 命名实体识别, 二十四史, MU方法, 跨度预测

收稿日期: 2024-05-09

基金资助:国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”（21&ZD331）。

作者简介: 许乾坤，男，1996年生，博士研究生，主要研究领域为人文计算、自然语言处理与知识挖掘；王东波，男，1981年生，博士，教授，博士生导师，主要研究领域为自然语言处理与知识挖掘、数字人文、信息计量；刘禹彤，女，1996年生，博士研究生，主要研究领域为科学数据管理、信息计量；黄水清，通信作者，男，1964年生，博士，教授，博士生导师，主要研究领域为人文计算、文本信息处理与检索、文本挖掘，E-mail：sqhuang@njau.edu.cn；

引用本文:

许乾坤, 王东波, 刘禹彤, 黄水清. 基于MU序列标注的古籍命名实体识别研究[J]. 情报学报, 2025, 44(6): 736-747.
Xu Qiankun, Wang Dongbo, Liu Yutong, Huang Shuiqing. Named Entity Recognition of Ancient Books Based on MU Sequence Labeling. 情报学报, 2025, 44(6): 736-747.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.06.007 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I6/736

1 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270.
2 Wang X Z, Gao T Y, Zhu Z C, et al. KEPLER: a unified model for knowledge embedding and pre-trained language representation[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 176-194.
3 Wang D B, Liu C, Zhao Z X, et al. GujiBERT and GujiGPT: construction of intelligent information processing foundation language models for ancient texts[OL]. (2023-07-11). https://arxiv.org/pdf/2307.05354.
4 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
5 Wang P Y, Ren Z C. The uncertainty-based retrieval framework for ancient Chinese CWS and POS[C]// Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages. European Language Resources Association, 2022: 164-168.
6 Cheng J R, Liu J X, Xu X B, et al. A review of Chinese named entity recognition[J]. KSII Transactions on Internet and Information Systems, 2021, 15(6): 2012-2030.
7 余馨玲, 常娥. 基于DA-BERT-CRF模型的古诗词地名自动识别研究——以金陵古诗词为例[J]. 图书馆杂志, 2023, 42(10): 87-94, 73.
8 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10): 51-60.
9 刘耀, 李冠霖, 李浣青. 面向中医古籍的单篇文本知识标引与结构解析技术[J]. 图书情报工作, 2022, 66(24): 118-127.
10 Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12: 2493-2537.
11 Chiu J P C, Nichols E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370.
12 Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[OL]. (2015-08-09). https://arxiv.org/pdf/1508.01991.
13 Zhang H L, Zhu H, Ruan J S, et al. A boundary detection enhanced model for people name recognition in ancient Chinese literature[C]// Proceedings of the 4th International Conference on Applied Machine Learning. Piscataway: IEEE, 2022: 1-5.
14 Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2670-2680.
15 Zhu Y Y, Wang G X. CAN-NER: convolutional attention network for Chinese named entity recognition[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 3384-3393.
16 Zhang Y, Yang J. Chinese NER using lattice LSTM[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 1554-1564.
17 Xuan Z Y, Bao R, Jiang S Y. FGN: fusion glyph network for Chinese named entity recognition[C]// Proceedings of the 5th China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, 2020: 28-40.
18 Fu J L, Huang X J, Liu P F. SpanNER: named entity re-/recognition as span prediction[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 7183-7195.
19 田红鹏, 吴璟玮. RIB-NER: 基于跨度的中文命名实体识别方法[J]. 计算机工程与科学, 2024, 46(7): 1311-1320.
20 Ye D M, Lin Y K, Li P, et al. Packed levitated marker for entity and relation extraction[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4904-4917.
21 Li F, Lin Z C, Zhang M S, et al. A span-based model for joint overlapped and discontinuous named entity recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 4814-4828.
22 Zhang S, Cheng H, Gao J F, et al. Optimizing bi-encoder for named entity recognition via contrastive learning[OL]. (2023-02-23). https://arxiv.org/pdf/2208.14565.
23 Jiang Z B, Xu W, Araki J, et al. Generalizing natural language analysis through span-relation representations[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2120-2133.
24 Yan C X, Su Q, Wang J. MoGCN: mixture of gated convolutional neural network for named entity recognition of Chinese historical texts[J]. IEEE Access, 2020, 8: 181629-181639.
25 Akbik A, Bergmann T, Blythe D, et al. FLAIR: an easy-to-use framework for state-of-the-art NLP[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 54-59.
26 Xiao C J, Yao Y, Xie R B, et al. Denoising relation extraction from document-level distant supervision[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 3683-3688.
27 Li X Y, Feng J R, Meng Y X, et al. A unified MRC framework for named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 5849-5859.