|
|
Topic Classification of Ancient Texts Based on SWPF2vec and DJ-TextRCNN |
Wu Shuai1, Yang Xiuzhang2,3, He Lin1, Gong Zuoquan4 |
1.College of Information Management, Nanjing Agricultural University, Nanjing 211800 2.Guizhou Big Data Academy, Guizhou University, Guiyang 550025 3.School of Cyber Science and Engineering, Wuhan University, Wuhan 430030 4.School of Information, Guizhou University of Finance and Economics, Guiyang 550025 |
|
|
Abstract The method for classifying topics in ancient book texts, mainly based on cataloging and rule matching, encounters challenges such as low efficiency, heavy reliance on expert knowledge, a single classification basis, and difficulties in automating the classification process. In addressing these issues, this study attempts to classify themes that meet the researchers’ needs based on the content and characteristics of ancient texts, and promote the transformation of digital humanities research paradigms. First, referring to the analysis method of characters in the ancient book Analytical Dictionary of Characters (Shuowen Jiezi) of the Eastern Han Dynasty, a new four-dimensional feature dataset of “pronunciation (speaking) - original text (text) - structure (pattern) - glyph (font)” is constructed based on the corpus dataset of ancient books. Second, a four-dimensional feature vector extraction model (speaking, word, pattern, and font to vector; SWPF2vec) is designed and combined with a pre-trained model to achieve fine-grained feature representation of ancient texts. Once again, the ancient text topic classification model (dianji - recurrent convolutional neural networks for text classification; DJ-TextRCNN) is constructed by fusing convolutional neural networks, recurrent neural networks, and multi-head attention mechanism. Finally, multidimensional, deep-level, and fine-grained semantic mining of ancient texts is achieved by integrating four-dimensional semantic features. DJ-TextRCNN exhibits the best accuracy in topic classification under different dimensional features, achieving an accuracy of 76.23% under the four-dimensional feature of “shuo, wen, jie, zi,” preliminarily achieving accurate topic classification of ancient book texts.
|
Received: 22 August 2023
|
|
|
|
1 曾蕾, 王晓光, 范炜. 图档博领域的智慧数据及其在数字人文研究中的角色[J]. 中国图书馆学报, 2018, 44(1): 17-34. 2 焦艳鹏, 刘葳. 知识获取、人工智能与图书馆精神[J]. 中国图书馆学报, 2021, 47(5): 20-32. 3 周贞云, 邱均平. 面向人工智能的我国知识图谱研究的分布特点与发展趋势[J]. 情报科学, 2022, 40(1): 184-192. 4 高丹, 何琳. 数智赋能视域下的数字人文研究: 数据、技术与应用[J]. 图书馆论坛, 2023, 43(9): 107-119. 5 何琳, 陈雅玲, 孙珂迪. 面向先秦典籍的知识本体构建技术研究[J]. 图书情报工作, 2020, 64(7): 13-19. 6 杨秀璋, 武帅, 夏换, 等. 基于自适应图像增强技术的水族文字提取与识别研究[J]. 计算机科学, 2021, 48(S1): 74-79. 7 聂娜, 翟晓娟, 马音宁. 数字人文合作研究实践——以汉语历史语音库共享研究平台的设计与实现为例[J]. 图书馆杂志, 2020, 39(12): 89-97, 106. 8 赵宇翔, 张妍, 夏翠娟, 等. 数字人文视域下文化记忆机构价值共创研究及实践述评[J]. 中国图书馆学报, 2023, 49(1): 99-117. 9 Rafiei M H, Adeli H. A new neural dynamic classification algorithm[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(12): 3074-3083. 10 何琳, 乔粤, 孟凯. 基于典籍的春秋社会时间序列演变分析方法初探[J]. 情报理论与实践, 2021, 44(2): 33-40. 11 Church K W. word2vec[J]. Natural Language Engineering, 2017, 23(1): 155-162. 12 Gu J X, Wang Z H, Kuen J, et al. Recent advances in convolutional neural networks[J]. Pattern Recognition, 2018, 77: 354-377. 13 Sutskever I, Martens J, Hinton G. Generating text with recurrent neural networks[C]// Proceedings of the 28th International Conference on International Conference on Machine Learning. Madison: Omnipress, 2011: 1017-1024. 14 Garnot V S F, Landrieu L. Lightweight temporal self-attention for classifying satellite images time series[C]// Proceedings of 5th ECML PKDD Workshop on Advanced Analytics and Learning on Temporal Data. Cham: Springer, 2020: 171-181. 15 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 16 陆伟, 杨金庆. 数智赋能的情报学学科发展趋势探析[J]. 信息资源管理学报, 2022, 12(2): 4-12. 17 陈晓涛. 基于SSM的数字化古籍书库的设计与实现[D]. 南京: 东南大学, 2019. 18 聂锦燃, 魏蛟龙, 唐祖平. 基于变分自编码器的无监督文本风格转换[J]. 中文信息学报, 2020, 34(7): 79-88. 19 Hearst M A, Dumais S T, Osuna E, et al. Support vector machines[J]. IEEE Intelligent Systems and Their Applications, 1998, 13(4): 18-28. 20 Murphy K P. Naive Bayes classifiers[R/OL]. Vancouver: University of British Columbia. (2006-10-24). https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall06/reading/NB.pdf. 21 Myles A J, Feudale R N, Liu Y, et al. An introduction to decision tree modeling[J]. Journal of Chemometrics, 2004, 18(6): 275-285. 22 Iglesias J A, Angelov P, Ledezma A, et al. Creating evolving user behavior profiles automatically[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(5): 854-867. 23 韩梅花, 赵景秀. 基于“用户画像”的阅读疗法模式研究——以抑郁症为例[J]. 大学图书馆学报, 2017, 35(6): 105-110. 24 Adomavicius G, Tuzhilin A. Using data mining methods to build customer profiles[J]. Computer, 2001, 34(2): 74-82. 25 Nasraoui O, Soliman M, Saka E, et al. A web usage mining framework for mining evolving user profiles in dynamic web sites[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(2): 202-215. 26 王庆, 赵发珍. 基于“用户画像” 的图书馆资源推荐模式设计与分析[J]. 现代情报, 2018, 38(3): 105-109, 137. 27 Hofmann T. Unsupervised learning by probabilistic latent semantic analysis[J]. Machine Learning, 2001, 42(1): 177-196. 28 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022. 29 白淑霞, 鲍玉来. LDA单词图像表示的蒙古文古籍图像关键词检索方法[J]. 现代情报, 2017, 37(7): 51-54, 88. 30 王小红, 科林·艾伦, 浦江淮, 等. 人文知识发现的计算机实现——对“汉典古籍”主题建模的实证分析[J]. 自然辩证法通讯, 2018, 40(4): 50-58. 31 孙燕, 刘浏, 王东波. 《春秋左传正义》引书计算人文研究[J]. 图书情报工作, 2023, 67(2): 119-130. 32 何琳, 乔粤, 刘雪琪. 春秋时期社会发展的主题挖掘与演变分析——以《左传》为例[J]. 图书情报工作, 2020, 64(7): 30-38. 33 Schmidt B M. Words alone: dismantling topic models in the humanities[J]. Journal of Digital Humanities, 2012, 2(1): 49-65. 34 牛雪莹. 结合主题模型词向量的CNN文本分类[J]. 计算机与现代化, 2019(10): 7-10. 35 肖倩, 谢海涛, 刘平平. 一种融合LDA与CNN的社交媒体中热点舆情识别方法[J]. 情报科学, 2019, 37(11): 27-33. 36 石磊, 杜军平, 梁美玉. 基于RNN和主题模型的社交网络突发话题发现[J]. 通信学报, 2018, 39(4): 189-198. 37 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 38 Wang S X, Wang X, Wang S M, et al. Bi-directional long short-term memory method based on attention mechanism and rolling update for short-term load forecasting[J]. International Journal of Electrical Power & Energy Systems, 2019, 109: 470-479. 39 彭敏, 杨绍雄, 朱佳晖. 基于双向LSTM语义强化的主题建模[J]. 中文信息学报, 2018, 32(4): 40-49. 40 胡朝举, 梁宁. 基于深层注意力的LSTM的特定主题情感分析[J]. 计算机应用研究, 2019, 36(4): 1075-1079. 41 曾子明, 陈思语. 基于LDA与BERT-BiLSTM-Attention模型的突发公共卫生事件网络舆情演化分析[J]. 情报理论与实践, 2023, 46(9): 158-166. 42 杨伯峻, 徐提. 春秋左传词典[M]. 北京: 中华书局, 1985. 43 杨伯峻. 春秋左传注·一[M]. 2版. 北京: 中华书局, 1990. 44 马晓雯. 面向数字人文的典籍事件触发动词数据集构建及应用研究[D]. 南京: 南京农业大学, 2021. 45 李章超, 何琳, 喻雪寒. 基于事理图谱的典籍内容知识组织与应用——以《左传》为例[J/OL]. 图书馆论坛. (2023-08-31) [2024-01-24]. https://kns.cnki.net/kcms/detail/44.1306.g2.20230830.1929.004.html. 46 李章超, 李忠凯, 何琳. 《左传》战争事件抽取技术研究[J]. 图书情报工作, 2020, 64(7): 20-29. |
|
|
|