|
|
ChpoBERT: A Pre-trained Model for Chinese Policy Texts |
Shen Si1, Chen Meng1, Feng Shuyang1, Xu Qiankun2, Liu Jiangfeng2, Wang Fei3, Wang Dongbo2 |
1.School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094 2.College of Information Management, Nanjing Agricultural University, Nanjing 210095 3.Jiangsu Institute of Science and Technology Information, Nanjing 210042 |
|
|
Abstract With the rapid development of deep learning and the accumulation of domain data, domain-based pre-trained models play an increasingly important supporting role in knowledge organization and mining. Aimed at massive Chinese policy texts, the pre-trained model of Chinese policy texts combined with the corresponding pre-trained strategies not only helps to improve the level of intelligent processing of Chinese policy texts, but also lays a solid foundation for the refinement, multi-dimensional analysis, and exploration of policy texts driven by data. For the national, provincial, and municipal policy texts, 131,390 policy texts with a total number of 305,648,206 Chinese words were obtained through the combination of automatic capture and manual assistance by removing non-policy text. This study develops a Chinese policy text pre-training model (ChpoBERT) for the constructed Chinese policy text corpus, which is based on the Chinese-RoBERTa-wwm-ext and BERT-base-Chinese. The model is open source and is available on Github. In terms of the evaluation indices of perplexity and downstream tasks of automatic word segmentation, automatic part-of-speech tagging, and named entity recognition of policy texts, the constructed ChpoBERT models showed better performance, which can provide basic computing resource support for the domain of intelligent knowledge mining of policy texts.
|
Received: 05 December 2022
|
|
|
|
1 李纲, 蓝石. 公共政策内容分析方法: 理论与应用[M]. 重庆: 重庆大学出版社, 2007. 2 Chilton P, Sch?ffner C. Politics as text and talk: analytic approaches to political discourse[M]. Amsterdam: John Benjamins Publishing, 2002. 3 郑新曼, 董瑜. 政策文本量化研究的综述与展望[J]. 现代情报, 2021, 41(2): 168-177. 4 裴雷, 孙建军, 周兆韬. 政策文本计算: 一种新的政策文本解读方式[J]. 图书与情报, 2016(6): 47-55. 5 Denny M J, Spirling A. Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it[J]. Political Analysis, 2018, 26(2): 168-189. 6 胡吉明. 政策文本研究: 从内容计算到功能理解[J]. 图书情报知识, 2023, 40(4): 145-152. 7 杨正. 政策计量的应用: 概念界限、取向与趋向[J]. 情报杂志, 2019, 38(4): 60-65, 51. 8 Bengio Y, Ducharme R, Vincent P. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155. 9 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781.pdf. 10 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 11 O’Shea K, Nash R. An introduction to convolutional neural networks[OL]. (2015-12-02). https://arxiv.org/pdf/1511.08458.pdf. 12 Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization[OL]. (2015-02-19). https://arxiv.org/pdf/1409.2329.pdf. 13 Graves A. Long short-term memory[M]// Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45. 14 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 15 Liu Z, Lin W, Shi Y, et al. A robustly optimized BERT pre-training approach with post-training[C]// Proceedings of the China National Conference on Chinese Computational Linguistics. Cham: Springer, 2021: 471-484. 16 闫盛枫. 融合词向量语义增强和DTM模型的公共政策文本时序建模与演化分析——以“大数据领域”为例[J]. 情报科学, 2021, 39(9): 146-154. 17 赵菲菲, 王宇琪, 周庆山, 等. 个人信息保护政策网络评价的文本分析建模研究[J]. 情报杂志, 2020, 39(8): 154-159. 18 李牧南, 王良, 赖华鹏. 中文科技政策文本分类: 增强的TextCNN视角[J]. 科技管理研究, 2023, 43(2): 160-166. 19 Grimmer J, Stewart B M. Text as data: the promise and pitfalls of automatic content analysis methods for political texts[J]. Political Analysis, 2013, 21(3): 267-297. 20 王晶金, 刘立, 王斐. 高校与国立科研机构科技成果转移转化政策文本量化研究[J]. 科学管理研究, 2017, 35(4): 24-27, 35. 21 郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究[J]. 数据分析与知识发现, 2021, 5(10): 81-93. 22 魏宇, 余青. 基于语义分析的政策差异量化研究——以近三十年旅游交通政策为例[J]. 情报杂志, 2019, 38(3): 194-202. 23 Du H B, Guo Y Q, Lin Z G, et al. Effects of the joint prevention and control of atmospheric pollution policy on air pollutants - a quantitative analysis of Chinese policy texts[J]. Journal of Environmental Management, 2021, 300: 113721. 24 Song C, Guo J, Gholizadeh F, et al. Quantitative analysis of food safety policy—based on text mining methods[J]. Foods, 2022, 11(21): 3421. 25 杜燕萍. 基于LDA主题建模的教师队伍建设改革政策文本分析[J]. 系统科学与数学, 2022, 42(6): 1411-1422. 26 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 27 Shen S, Liu J F, Lin L T, et al. SsciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. 28 Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240. 29 Chalkidis I, Fergadiotis M, Malakasiotis P, et al. LEGAL-BERT: the muppets straight out of law school[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 2898-2904. 30 Araci D T. FinBERT: financial sentiment analysis with pre-trained language models[OL]. (2019-08-27). https://arxiv.org/pdf/1908.10063.pdf. 31 杨晨, 宋晓宁, 宋威. SentiBERT: 结合情感信息的预训练语言模型[J]. 计算机科学与探索, 2020, 14(9): 1563-1570. 32 李亮. 基于ALBERT的藏文预训练模型及其应用[D]. 兰州: 兰州大学, 2020. 33 Zhang G S, Wu J J, Tan M Z, et al. Learning to predict U.S. policy change using New York Times corpus with pre-trained language model[J]. Multimedia Tools and Applications, 2020, 79(45): 34227-34240. 34 朱娜娜, 王航, 张家乐, 等. 基于预训练语言模型的政策识别研究[J]. 中文信息学报, 2022, 36(2): 104-110. 35 关海山, 郑玉龙, 魏笔凡, 等. 税收优惠政策关键要素抽取与可视化分析[J]. 大数据, 2022, 8(5): 106-123. 36 华斌, 康月, 范林昊. 政策文本的知识建模与关联问答研究[J]. 数据分析与知识发现, 2022, 6(11): 79-92. 37 Fan J, Chan S, Patro R. Perplexity: evaluating transcript abundance estimation in the absence of ground truth[J]. Algorithms for Molecular Biology, 2022, 17(1): 6. 38 黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019, 63(22): 5-12. 39 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5): 49-64. |
|
|
|