|
|
Named Entity Recognition of Ancient Books Based on MU Sequence Labeling |
Xu Qiankun1, Wang Dongbo1,2, Liu Yutong1, Huang Shuiqing1,2 |
1.College of Information Management, Nanjing Agricultural University, Nanjing 210095 2.Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095 |
|
|
Abstract The named entity recognition task is an important basic step in many downstream tasks in natural language processing. Ancient books, as carriers of Chinese civilization, not only contain a rich cultural heritage, but they are also an important source of historical wisdom and enlightenment for the future. Improving entity recognition accuracy in ancient texts promotes the structuring of ancient texts and knowledge systematization, as well as the intelligent use and development of ancient resources. First, we selected the Twenty-Four Histories dataset refined by the group as the original dataset and used the GujiBERT_FAN pre-training model to fine-tune the Sequence Labeling, Sequence Labeling_CRF, and Span-level Prediction methods to capture entity boundaries and types more accurately. Subsequently, the entities in ancient texts were recognized and predicted. Second, this study developed methods of merging and majority voting mechanisms for integrating with the prediction dataset and creating a new dataset. Based on the recognized entity dataset, we created a new dataset using merging and majority voting methods in combination with the prediction results of the Named Entity Recognition model. Finally, the Sequence Labeling, Sequence Labeling_CRF, and Span-level Prediction methods were trained to determine whether the prediction results of the entities were incorrect, and the model was fine-tuned using hinted concepts. To validate the method proposed in this study, the effectiveness of the model was verified using evaluation metrics, which showed that the addition of merging methods resulted in a significant increase in the recall rate of entity recognition, and most voting methods improved the model’s F1 value.
|
Received: 09 May 2024
|
|
|
|
1 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270. 2 Wang X Z, Gao T Y, Zhu Z C, et al. KEPLER: a unified model for knowledge embedding and pre-trained language representation[J]. Transactions of the Association for Computational Linguistics, 2021, 9: 176-194. 3 Wang D B, Liu C, Zhao Z X, et al. GujiBERT and GujiGPT: construction of intelligent information processing foundation language models for ancient texts[OL]. (2023-07-11). https://arxiv.org/pdf/2307.05354. 4 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43. 5 Wang P Y, Ren Z C. The uncertainty-based retrieval framework for ancient Chinese CWS and POS[C]// Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages. European Language Resources Association, 2022: 164-168. 6 Cheng J R, Liu J X, Xu X B, et al. A review of Chinese named entity recognition[J]. KSII Transactions on Internet and Information Systems, 2021, 15(6): 2012-2030. 7 余馨玲, 常娥. 基于DA-BERT-CRF模型的古诗词地名自动识别研究——以金陵古诗词为例[J]. 图书馆杂志, 2023, 42(10): 87-94, 73. 8 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10): 51-60. 9 刘耀, 李冠霖, 李浣青. 面向中医古籍的单篇文本知识标引与结构解析技术[J]. 图书情报工作, 2022, 66(24): 118-127. 10 Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12: 2493-2537. 11 Chiu J P C, Nichols E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370. 12 Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[OL]. (2015-08-09). https://arxiv.org/pdf/1508.01991. 13 Zhang H L, Zhu H, Ruan J S, et al. A boundary detection enhanced model for people name recognition in ancient Chinese literature[C]// Proceedings of the 4th International Conference on Applied Machine Learning. Piscataway: IEEE, 2022: 1-5. 14 Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2670-2680. 15 Zhu Y Y, Wang G X. CAN-NER: convolutional attention network for Chinese named entity recognition[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 3384-3393. 16 Zhang Y, Yang J. Chinese NER using lattice LSTM[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 1554-1564. 17 Xuan Z Y, Bao R, Jiang S Y. FGN: fusion glyph network for Chinese named entity recognition[C]// Proceedings of the 5th China Conference on Knowledge Graph and Semantic Computing. Singapore: Springer, 2020: 28-40. 18 Fu J L, Huang X J, Liu P F. SpanNER: named entity re-/recognition as span prediction[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 7183-7195. 19 田红鹏, 吴璟玮. RIB-NER: 基于跨度的中文命名实体识别方法[J]. 计算机工程与科学, 2024, 46(7): 1311-1320. 20 Ye D M, Lin Y K, Li P, et al. Packed levitated marker for entity and relation extraction[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2022: 4904-4917. 21 Li F, Lin Z C, Zhang M S, et al. A span-based model for joint overlapped and discontinuous named entity recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 4814-4828. 22 Zhang S, Cheng H, Gao J F, et al. Optimizing bi-encoder for named entity recognition via contrastive learning[OL]. (2023-02-23). https://arxiv.org/pdf/2208.14565. 23 Jiang Z B, Xu W, Araki J, et al. Generalizing natural language analysis through span-relation representations[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2120-2133. 24 Yan C X, Su Q, Wang J. MoGCN: mixture of gated convolutional neural network for named entity recognition of Chinese historical texts[J]. IEEE Access, 2020, 8: 181629-181639. 25 Akbik A, Bergmann T, Blythe D, et al. FLAIR: an easy-to-use framework for state-of-the-art NLP[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 54-59. 26 Xiao C J, Yao Y, Xie R B, et al. Denoising relation extraction from document-level distant supervision[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 3683-3688. 27 Li X Y, Feng J R, Meng Y X, et al. A unified MRC framework for named entity recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 5849-5859. |
|
|
|