ChpoBERT: A Pre-trained Model for Chinese Policy Texts
Shen Si1, Chen Meng1, Feng Shuyang1, Xu Qiankun2, Liu Jiangfeng2, Wang Fei3, Wang Dongbo2
1.School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094 2.College of Information Management, Nanjing Agricultural University, Nanjing 210095 3.Jiangsu Institute of Science and Technology Information, Nanjing 210042
摘要随着深度学习的迅速发展和领域数据的快速积累,领域化的预训练模型在知识组织和挖掘中发挥了越来越重要的支撑作用。面向海量的中文政策文本,结合相应的预训练策略构建中文政策文本预训练模型,不仅有助于提升中文政策文本智能化处理的水平,而且为政策文本数据驱动下的精细化和多维度分析与探究奠定了坚实的基础。面向国家级、省级和市级平台上的政策文本,通过自动抓取和人工辅助相结合的方式,在去除非政策文本的基础上,确定了131390份政策文本,总字数为305648206。面向所构建的中文政策文本语料库,基于BERT-base-Chinese和Chinese-RoBERTa-wwm-ext,本研究利用MLM(masked language model)和WWM(whole word masking)任务构建了中文政策文本预训练模型(ChpoBERT),并在Github上对该模型进行了开源。在困惑度评价指标和政策文本自动分词、词性自动标注、命名实体识别下游任务上,ChpoBERT系列模型均表现出了较优的性能,可为政策文本的智能知识挖掘提供领域化的基础计算资源支撑。
1 李纲, 蓝石. 公共政策内容分析方法: 理论与应用[M]. 重庆: 重庆大学出版社, 2007. 2 Chilton P, Sch?ffner C. Politics as text and talk: analytic approaches to political discourse[M]. Amsterdam: John Benjamins Publishing, 2002. 3 郑新曼, 董瑜. 政策文本量化研究的综述与展望[J]. 现代情报, 2021, 41(2): 168-177. 4 裴雷, 孙建军, 周兆韬. 政策文本计算: 一种新的政策文本解读方式[J]. 图书与情报, 2016(6): 47-55. 5 Denny M J, Spirling A. Text preprocessing for unsupervised learning: why it matters, when it misleads, and what to do about it[J]. Political Analysis, 2018, 26(2): 168-189. 6 胡吉明. 政策文本研究: 从内容计算到功能理解[J]. 图书情报知识, 2023, 40(4): 145-152. 7 杨正. 政策计量的应用: 概念界限、取向与趋向[J]. 情报杂志, 2019, 38(4): 60-65, 51. 8 Bengio Y, Ducharme R, Vincent P. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155. 9 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781.pdf. 10 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 11 O’Shea K, Nash R. An introduction to convolutional neural networks[OL]. (2015-12-02). https://arxiv.org/pdf/1511.08458.pdf. 12 Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization[OL]. (2015-02-19). https://arxiv.org/pdf/1409.2329.pdf. 13 Graves A. Long short-term memory[M]// Supervised Sequence Labelling with Recurrent Neural Networks. Heidelberg: Springer, 2012: 37-45. 14 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 15 Liu Z, Lin W, Shi Y, et al. A robustly optimized BERT pre-training approach with post-training[C]// Proceedings of the China National Conference on Chinese Computational Linguistics. Cham: Springer, 2021: 471-484. 16 闫盛枫. 融合词向量语义增强和DTM模型的公共政策文本时序建模与演化分析——以“大数据领域”为例[J]. 情报科学, 2021, 39(9): 146-154. 17 赵菲菲, 王宇琪, 周庆山, 等. 个人信息保护政策网络评价的文本分析建模研究[J]. 情报杂志, 2020, 39(8): 154-159. 18 李牧南, 王良, 赖华鹏. 中文科技政策文本分类: 增强的TextCNN视角[J]. 科技管理研究, 2023, 43(2): 160-166. 19 Grimmer J, Stewart B M. Text as data: the promise and pitfalls of automatic content analysis methods for political texts[J]. Political Analysis, 2013, 21(3): 267-297. 20 王晶金, 刘立, 王斐. 高校与国立科研机构科技成果转移转化政策文本量化研究[J]. 科学管理研究, 2017, 35(4): 24-27, 35. 21 郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究[J]. 数据分析与知识发现, 2021, 5(10): 81-93. 22 魏宇, 余青. 基于语义分析的政策差异量化研究——以近三十年旅游交通政策为例[J]. 情报杂志, 2019, 38(3): 194-202. 23 Du H B, Guo Y Q, Lin Z G, et al. Effects of the joint prevention and control of atmospheric pollution policy on air pollutants - a quantitative analysis of Chinese policy texts[J]. Journal of Environmental Management, 2021, 300: 113721. 24 Song C, Guo J, Gholizadeh F, et al. Quantitative analysis of food safety policy—based on text mining methods[J]. Foods, 2022, 11(21): 3421. 25 杜燕萍. 基于LDA主题建模的教师队伍建设改革政策文本分析[J]. 系统科学与数学, 2022, 42(6): 1411-1422. 26 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 27 Shen S, Liu J F, Lin L T, et al. SsciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. 28 Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240. 29 Chalkidis I, Fergadiotis M, Malakasiotis P, et al. LEGAL-BERT: the muppets straight out of law school[C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 2898-2904. 30 Araci D T. FinBERT: financial sentiment analysis with pre-trained language models[OL]. (2019-08-27). https://arxiv.org/pdf/1908.10063.pdf. 31 杨晨, 宋晓宁, 宋威. SentiBERT: 结合情感信息的预训练语言模型[J]. 计算机科学与探索, 2020, 14(9): 1563-1570. 32 李亮. 基于ALBERT的藏文预训练模型及其应用[D]. 兰州: 兰州大学, 2020. 33 Zhang G S, Wu J J, Tan M Z, et al. Learning to predict U.S. policy change using New York Times corpus with pre-trained language model[J]. Multimedia Tools and Applications, 2020, 79(45): 34227-34240. 34 朱娜娜, 王航, 张家乐, 等. 基于预训练语言模型的政策识别研究[J]. 中文信息学报, 2022, 36(2): 104-110. 35 关海山, 郑玉龙, 魏笔凡, 等. 税收优惠政策关键要素抽取与可视化分析[J]. 大数据, 2022, 8(5): 106-123. 36 华斌, 康月, 范林昊. 政策文本的知识建模与关联问答研究[J]. 数据分析与知识发现, 2022, 6(11): 79-92. 37 Fan J, Chan S, Patro R. Perplexity: evaluating transcript abundance estimation in the absence of ground truth[J]. Algorithms for Molecular Biology, 2022, 17(1): 6. 38 黄水清, 王东波. 新时代人民日报分词语料库构建、性能及应用(一)——语料库构建及测评[J]. 图书情报工作, 2019, 63(22): 5-12. 39 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5): 49-64.