|
|
Research on the Automatic Word Segmentation of The Book of Songs under Multi-dimensional Domain Knowledge |
Wang Shanshan, Wang Dongbo, Huang Shuiqing, He Lin |
Nanjing Agricultural University, Nanjing 210095 |
|
|
Abstract The Book of Songs is the earliest anthology of poetry in China: it is one of the thirteen classic books of Confucian tradition. The Book of Songs is ranked the first of the ancient canonical Five Classics. The Five Classics include Yijing (“Classic of Changes”), the Shujing (“Classic of History”), The Book of Songs, the Collection of Rituals, and the Chunqiu (“Spring and Autumn Annals”). The connotations of The Book of Songs are abundant, reflecting all aspects of social life in the Zhou Dynasty, such as labor and love, war and corvee oppression and rebellion, customs and marriage, ancestor worship and banquets, and even astronomy, geomorphology, animals, and plants. It is a mirror of Zhou Dynasty society, known as The Life Encyclopedia of Ancient Society. Moreover, The Book of Songs is the textbook of ancient Chinese political ethics, aesthetic education, and naturalism. With the extensive application of humanities computing, this paper combines the Sinological Index Series with the domain knowledge of the Mao Shi Index, and studies the automatic word segmentation of The Book of Songs using the machine learning method. Based on the corpus of the manual word segmentation of The Book of Songs, the method of combining the Guang Yun and statistical analysis was used to get 23 sets of feature templates that fuse different characteristics knowledge and then producing machine learning segmentation model by training. The performance of each word segmentation model is analyzed, and it is found that lexical features have the greatest influence on the word segmentation effect of The Book of Songs, and the harmonic mean F value of the word segmentation model can be up to 97.42%. Finally, the paper uses the domain glossary of the Mao Shi Index to carry out the post-processing of the long word correction with the test performance optimum segmentation model, and obtains the word corpus of The Book of Songs that fuses the expert vocabulary knowledge of the Mao Shi Index. This article integrates knowledge into the multi-dimensional domain to realize the automatic segmentation of The Book of Songs, which provides reference for the related research of the Pre-Qin poetry. Moreover, it inspires the study of the automatic word segmentation of Pre-Qin Classics. The word corpus of The Book of Songs, as part of the Pre-Qin Classics word corpus, has a supporting role to further realize the knowledge mining of the Pre-Qin Classics.
|
Received: 19 May 2017
|
|
|
|
[1] 沈岚. 跨文化经典阐释:理雅各《诗经》译介研究[D]. 苏州:苏州大学, 2013. [2] Cheng X R, Wang D, Xie K.Application of MPSO-based neural network model in Chinese word segmentation[C]// Proceedings of the 2009 Second International Conference on Intelligent Computation Technology and Automation. Washington DC: IEEE Computer Society, 2009, 1: 295-298. [3] 李庆虎, 陈玉健, 孙家广. 一种中文分词词典新机制——双字哈希机制[J]. 中文信息学报, 2003, 17(4): 13-18. [4] 孙茂松, 黄昌宁, 邹嘉彦, 等. 利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J]. 计算机研究与发展, 1997, 34(5): 332-339. [5] 马玉春, 宋瀚涛. Web中文文本分词技术研究[J]. 计算机应用, 2004, 24(4): 134-135, 155. [6] 姚天顺, 张桂平, 吴映明. 基于规则的汉语自动分词系统[J]. 中文信息学报, 1990, 4(1): 37-43. [7] 傅士光, 林友芳, 万怀宇, 等. 一种基于规则的中文分词算法[C]// 中国中文信息学会, 新加坡中文与东方语言信息处理学会, 武汉大学语言与信息研究中心. 中国计算技术与语言问题研究——第七届中文信息处理国际会议论文集. 中国中文信息学会, 新加坡中文与东方语言信息处理学会, 武汉大学语言与信息研究中心, 2007: 5. [8] 杜丽萍, 李晓戈, 于根, 等. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版), 2016, 52(1): 35-40. [9] 李家福, 张亚非. 基于EM算法的汉语自动分词方法[J]. 情报学报, 2002, 21(3): 269-272. [10] 高军, 陈锡先. 无监督的动态分词方法[J]. 北京邮电大学学报, 1997, 20(4): 68-71. [11] 黄昌宁, 赵海. 由字构词——中文分词新方法[C]// 中国中文信息学会. 中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集. 中国中文信息学会, 2006: 11. [12] 黄德根, 焦世斗, 周惠巍. 基于子词的双层CRFs中文分词[J]. 计算机研究与发展, 2010, 47(5): 962-968. [13] 韩冬煦, 常宝宝. 基于边界熵和卡方统计量的多领域适应性中文分词方法[C]//中国中文信息学会. 中国计算语言学研究前沿进展(2009-2011). 中国中文信息学会, 2011: 6. [14] 邱冰, 皇甫娟. 基于中文信息处理的古代汉语分词研究[J]. 微计算机信息, 2008, 24(8): 100-102. [15] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45. [16] 徐润华, 陈小荷. 一种利用注疏的《左传》分词新方法[J]. 中文信息学报, 2012, 26(2): 13-17, 45. [17] 梁社会. 《孟子》及其注疏的信息处理[D]. 南京: 南京师范大学, 2013. [18] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537. [19] Zheng X Q, Chen H Y, Xu T Y.Deep learning for Chinese word segmentation and POS tagging[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2013: 647-657. [20] 汉达文库[EB/OL]. [2005-04-13]. http://www.chant.org/. [21] Lafferty J D, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 2001: 282-289. [22] 严学宭. 国学经典导读: 广韵[M]. 北京: 中国国际广播出版社, 2011. [23] CRF++[EB/OL]. [2017-2-15].https://sourceforge.net/projects/crfpp/. [24] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133. |
|
|
|