|
|
Ancient Chinese Word Segmentation Based on Graph Convolutional Neural Network |
Tang Xuemei1,2, Su Qi3,2,4, Wang Jun1,2,4, Yang Hao2,4 |
1.Department of Information Management, Peking University, Beijing 100871 2.Digital Humanities Center of Peking University, Beijing 100871 3.School of Foreign Languages, Peking University, Beijing 100871 4.Institute for Artificial Intelligence, Peking University, Beijing 100871 |
|
|
Abstract The syntax of ancient Chinese is characterized by the omission and inversion of word order, and morphology is characterized by the word-class shift and the abundance of pronouns and nouns. These features increase the difficulty of ancient Chinese word segmentation (CWS) and lead to the serious out-of-vocabulary (OOV) problem. Recently, deep learning methods have been widely used on ancient CWS tasks and achieved significant success. However, these works paid more attention to improving the performance of CWS and ignored the OOV issue, a major challenge in CWS. Therefore, we propose an ancient CWS framework that combines the pre-trained language model and the graph convolutional neural network, integrating external knowledge into the neural network model to relieve the OOV problem. The experimental results on three ancient Chinese CWS datasets (Zuo Zhuan, Stratagems of the Warring States, and The Scholars) demonstrate that our model improves the word segmentation performance of the three datasets. Further analysis illustrates that our model can effectively integrate lexicon and N-gram information. In particular, N-gram helps to alleviate the OOV problem.
|
Received: 14 July 2022
|
|
|
|
1 国家技术监督局. 中华人民共和国国家标准: 信息处理用现代汉语分词规范(GB/T 13715—92)[S]. 北京: 中国标准出版社, 1993. 2 张琪, 江川, 纪有书, 等. 面向多领域先秦典籍的分词词性一体化自动标注模型构建[J]. 数据分析与知识发现, 2021, 5(3): 2-11. 3 高毅. 基于BERT预训练模型的古汉语自动分词方法研究[J]. 电子设计工程, 2021, 29(22): 28-32. 4 刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54. 5 Yao L, Mao C S, Luo Y. Graph convolutional networks for text classification[C]// Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI Press, 2019: 7370-7377. 6 Zhao L, Zhang A L, Liu Y, et al. Encoding multi-granularity structural information for joint Chinese word segmentation and POS tagging[J]. Pattern Recognition Letters, 2020, 138: 163-169. 7 Hu L M, Yang T C, Shi C, et al. Heterogeneous graph attention networks for semi-supervised short text classification[C]// Proceedings of the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 4820-4829. 8 Bastings J, Titov I, Aziz W, et al. Graph convolutional encoders for syntax-aware neural machine translation[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 1957-1967. 9 Liu J X, Wu F Z, Wu C H, et al. Neural Chinese word segmentation with dictionary knowledge[C]// Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 80-91. 10 Ma J, Ganchev K, Weiss D. State-of-the-art Chinese word segmentation with Bi-LSTMs[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4902-4908. 11 Zhang M S, Zhang Y, Fu G H. Transition-based neural word segmentation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 421-431. 12 Chen X C, Qiu X P, Zhu C X, et al. Long short-term memory neural networks for Chinese word segmentation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2015: 1197-1206. 13 Qiu X P, Pei H Z, Yan H, et al. A concise model for multi-criteria Chinese word segmentation with transformer encoder[C]// Proceedings of the Conference on the Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 2887-2897. 14 Margatina K, Baziotis C, Potamianos A. Attention-based conditioning methods for external knowledge integration[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 3944-3951. 15 Ding N, Long D, Xu G, et al. Coupling distant annotation and adversarial training for cross-domain Chinese word segmentation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 6662-6671. 16 Liu W, Fu X Y, Zhang Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5847-5858. 17 Tian Y, Song Y, Xia F, et al. Improving Chinese word segmentation with wordhood memory networks[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 8274-8285. 18 Tian Y, Song Y, Xia F. Joint Chinese word segmentation and part-of-speech tagging via multi-channel attention of character N-grams[C]// Proceedings of the 28th International Conference on Computational Linguistics. Barcelona: International Committee on Computational Linguistics, 2020: 2073-2084. 19 郭辉, 苏中义, 王文, 等. 一种改进的MM分词算法[J]. 微型电脑应用, 2002(1): 13-15, 2. 20 邱冰, 皇甫娟. 基于中文信息处理的古代汉语分词研究[J]. 微计算机信息, 2008, 24(24): 100-102. 21 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学, 2014. 22 梁社会, 陈小荷. 先秦文献《孟子》自动分词方法研究[J]. 南京师范大学文学院学报, 2013(3): 175-182. 23 高嘉琦, 赵庆聪. 基于新词发现的古典文学作品分词方法研究[J]. 计算机技术与发展, 2021, 31(9): 178-181, 207. 24 邢付贵, 朱廷劭. 基于大规模语料库的古文词典构建及分词技术研究[J]. 中文信息学报, 2021, 35(7): 41-46. 25 钱智勇, 周建忠, 童国平, 等. 基于HMM的楚辞自动分词标注研究[J]. 图书情报工作, 2014, 58(4): 105-110. 26 王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5): 62-70. 27 杨世超. 古汉语分词与词性标注方法研究[D]. 唐山: 华北理工大学, 2018. 28 程宁, 李斌, 葛四嘉, 等. 基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J]. 中文信息学报, 2020, 34(4): 1-9. 29 俞敬松, 魏一, 张永伟. 基于BERT的古文断句研究与应用[J]. 中文信息学报, 2019, 33(11): 57-63. 30 Zhou Z H, Li M. Tri-training: exploiting unlabeled data using three classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541. 31 Nguyen M V, Min B, Dernoncourt F, et al. Joint extraction of entities, relations, and events via modeling inter-instance and inter-label dependencies[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2022: 4363-4374. 32 Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering[C]// Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2016: 3844-3852. 33 Marcheggiani D, Titov I. Encoding sentences with graph convolutional networks for semantic role labeling[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 1507-1516. 34 Du J, Mi W, Du X. Chinese word segmentation in electronic medical record text via graph neural network-bidirectional LSTM-CRF model[C]// Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine. Piscataway: IEEE, 2020: 985-989. 35 Huang K, Yu H, Liu J P, et al. Lexicon-based graph convolutional network for Chinese word segmentation[C]// Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg Association for Computational Linguistics, 2021: 2908-2917. 36 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg Association for Computational Linguistics, 2019: 4171-4186. 37 唐雪梅, 苏祺, 王军, 等. 基于预训练语言模型的繁体古文自动句读研究[C]// 第二十届中国计算语言学大会. 北京: 中国中文信息学会, 2021: 678-688. 38 Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 2001: 282-289. 39 Feng H D, Chen K, Deng X T, et al. Accessor variety criteria for Chinese word extraction[J]. Computational Linguistics, 2004, 30(1): 75-93. 40 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43. 41 Wei X Y, Liu W H, Qing Z, et al. Glyph features matter: a multimodal solution for EvaHan in LT4HALA2022[C]// Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages. Marseille: European Language Resources Association, 2022: 178-182. 责任编辑 王克平) |
|
|
|