融合结构特性的语义增强式古籍句读识别方法研究

doi:10.3772/j.issn.1000-0135.2023.02.003

情报学报

2023, Vol. 42

Issue (2): 150-163 DOI: 10.3772/j.issn.1000-0135.2023.02.003

Intelligence Theories and Methods

Current Issue | Archive | Adv Search

Study of Antiquarian Punctuation Recognition Methods Incorporating Semantic Enhancement with Structural Properties

Li Peiqi^1,2, Wang Hao^1,2, Ren Qiutong^1,2, Fan Tao^1,2

1.School of Information Management, Nanjing University, Nanjing 210023
2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023

Abstract
Figure/Table
References
Related Citation (7)

Download: PDF (4091 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract The concept of digital humanities has extended the connotation and extension of the automated processing of ancient texts and achieving a deeper understanding of the semantics of ancient texts has become a priority. Therefore, this article focuses on exploring semantic enhancement models in recognizing punctuation of ancient texts to improve the ability of mainstream BBiC models (BERT-BiLSTM-CRF) to characterize the semantics of ancient texts. This article fuses structural features to achieve a deeper representation of the semantics of ancient texts from both text and model dimensions, proposes a BBiC-EK (BBiC-external knowledge) model that introduces textual fine-grained textual knowledge and a BBiCC-EK model (BBiC-CNN-EK) that fuses the structural features of texts, and explores the structural perspective of the model to explore the relationship between CNN and the optimal connection between CNN and BiLSTM. The optimal location of external knowledge coding is also investigated from the perspective of model structuring. The experimental results show that the optimal external knowledge combination model in the BBiC-EK model can improve the sentence reading recognition accuracy by 0.83 percentage point compared with the baseline BBiC model, and the BBiCC-EK (Se) model can improve the recognition accuracy of the BBiC model by 1.36 percentage points by further fusing the CNN and exploring the optimal model structure. This article achieves the improvement of the punctuation recognition accuracy of ancient texts by fusing semantic enhancement techniques with structural features and provides new ideas for the automated semantic understanding of ancient texts.

Key words： digital humanities ancient text punctuation recognition BERT

Received: 26 January 2022

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Li Peiqi
	Wang Hao
	Ren Qiutong
	Fan Tao

Cite this article:

Li Peiqi,Wang Hao,Ren Qiutong, et al. Study of Antiquarian Punctuation Recognition Methods Incorporating Semantic Enhancement with Structural Properties[J]. 情报学报, 2023, 42(2): 150-163.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2023.02.003 OR https://qbxb.istic.ac.cn/EN/Y2023/V42/I2/150

1 黄金生. 二十四史选读攻略[J]. 国家人文历史, 2021(1): 8-9.
2 邓三鸿, 胡昊天, 王昊, 等. 古文自动处理研究现状与新时代发展趋势展望[J]. 科技情报研究, 2021, 3(1): 1-20.
3 倪志云. 当代美术古籍校勘、标点及注释问题分析——美术古籍整理研究的三关[J]. 民族艺术研究, 2018, 31(4): 39-49.
4 赵阳, 顾磊. 基于中文信息处理的古籍整理研究评述[J]. 图书情报工作, 2010, 54(3): 116-119, 63.
5 王倩, 王东波, 李斌, 等. 面向海量典籍文本的深度学习自动断句与标点平台构建研究[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
6 熊国祯. 古籍图书整理出版规范浅谈[G]// 中国编辑研究(2013). 北京: 人民教育出版社, 2015: 199-209.
7 俞敬松, 魏一, 张永伟. 基于BERT的古文断句研究与应用[J]. 中文信息学报, 2019, 33(11): 57-63.
8 刘鹏程, 孙林夫, 张常有, 等. 基于交互注意力机制网络模型的故障文本分类[J]. 计算机集成制造系统, 2021, 27(1): 72-89.
9 李成名. 基于深度学习的古籍词法分析研究[D]. 南京: 南京师范大学, 2018.
10 王东波, 黄水清, 何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作, 2017, 61(12): 64-70.
11 Xie L, Xu C L, Wang X X. Prosody-based sentence boundary detection in Chinese broadcast news[C]// Proceedings of the 2012 8th International Symposium on Chinese Spoken Language Processing. IEEE, 2012: 261-265.
12 Charoenpornsawat P, Sornlertlamvanich V. Automatic sentence break disambiguation for Thai[C]// Proceedings of the International Conference on Computer Processing of Oriental Languages, 2001: 231-235.
13 黄建年, 侯汉清. 农业古籍断句标点模式研究[J]. 中文信息学报, 2008, 22(4): 31-38.
14 陈天莹, 陈蓉, 潘璐璐, 等. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程, 2007, 33(3): 192-193, 196.
15 Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286.
16 Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 2001: 282-289.
17 张合, 王晓东, 杨建宇, 等. 一种基于层叠CRF的古文断句与句读标记方法[J]. 计算机应用研究, 2009, 26(9): 3326-3329.
18 张开旭, 夏云庆, 宇航. 基于条件随机场的古文自动断句与标点方法[J]. 清华大学学报(自然科学版)网络预览, 2009, 49(10): 163-166.
19 Wang B L, Shi X D, Tan Z X, et al. A sentence segmentation method for ancient Chinese texts based on NNLM[C]// Proceedings of the Workshop on Chinese Lexical Semantics. Cham: Springer, 2016: 387-396.
20 Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
21 王博立, 史晓东, 苏劲松. 一种基于循环神经网络的古文断句方法[J]. 北京大学学报(自然科学版), 2017, 53(2): 255-261.
22 程宁, 李斌, 葛四嘉, 等. 基于BiLSTM-CRF的古汉语自动断句与词法分析一体化研究[J]. 中文信息学报, 2020, 34(4): 1-9.
23 Wang H B, Wei H B, Guo J Y, et al. Ancient Chinese sentence segmentation based on bidirectional LSTM+CRF model[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2019, 23(4): 719-725.
24 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
25 胡韧奋, 李绅, 诸雨辰. 基于深层语言模型的古汉语知识表示及自动断句研究[J]. 中文信息学报, 2021, 35(4): 8-15.
26 Huang H H, Sun C T, Chen H H. Classical Chinese sentence segmentation[C]// Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing. Beijing: Chinese Information Processing Society of China, 2010: 15-22.
27 凌娟. 古代文言文中五大特殊句式翻译技巧[J]. 语文建设, 2012(12): 59-61.
28 徐绪堪, 周泽聿. 基于多尺度BiLSTM-CNN的微信推文的情感分类模型及应用研究[J]. 情报科学, 2021, 39(5): 130-137.
29 范涛, 王昊, 张宝隆. 基于远程监督和深度学习的非物质文化遗产文本属性抽取研究[J]. 情报理论与实践, 2021, 44(10): 1-7, 17.
30 王丽亚, 刘昌辉, 蔡敦波, 等. 基于CNN-BiLSTM网络引入注意力模型的文本情感分析[J]. 武汉工程大学学报, 2019, 41(4): 386-391.
31 高冰倩. 清代学者《北齐书》研究述论[D]. 兰州: 兰州大学, 2021.
32 我国古代的纪年[J]. 党政论坛(干部文摘), 2012(12): 35.
33 李智海. 中国古代文献中的干支纪时[J]. 内蒙古民族大学学报(社会科学版), 2004, 30(6): 107-109.
34 LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
35 Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 2818-2826.
36 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 770-778.
37 赵宏, 王乐, 王伟杰. 基于BiLSTM-CNN串行混合模型的文本情感分析[J]. 计算机应用, 2020, 40(1): 16-22.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3