基于<bold>BERT</bold>嵌入<bold>BiLSTM-CRF</bold>模型的中文专业术语抽取研究

doi:10.3772/j.issn.1000-0135.2020.04.007

情报学报

2020, Vol. 39

Issue (4): 409-418 DOI: 10.3772/j.issn.1000-0135.2020.04.007

Current Issue | Archive | Adv Search

Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model

Wu Jun¹, Cheng Yao¹, Hao Han¹, Ailiyaer·Aizezi², Liu Feixue¹, Su Yipo¹

1.School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876
2.Shenzhen Storm Intelligent Technology Co., Ltd, Beijing 100191

Abstract
Figure/Table
References
Related Citation (6)

Download: PDF (2255 KB) HTML (106 KB)
Export: BibTeX | EndNote (RIS)

Abstract High quality professional term recognition and its extraction play an important role in the fields of domain information retrieval and knowledge graph building. To improve the precision and recall rate of terminology recognition, we propose a Chinese terminology recognition and extraction approach that does not rely on specific domain knowledge or artificial features. Using the latest developments in representation learning, this study introduces BERT embedding as a character-based pre-trained model and incorporates it with a bi-directional long short-term memory (BiLSTM) and a conditional random field (CRF) to extract deep learning terminologies from 1278 annotated datasets collected from domain e-books. The experimental results show that the proposed model reaches 92.96% in F-score and outperforms other competing algorithms, such as left and right entropy, mutual information, a word2vec based similar terminology recognition algorithm, and a BiLSTM-CRF model. The best practices and related procedures for the implementation of the proposed model are also provided to guide its users in its further improvement.

Key words： BERT BiLSTM CRF terminology recognition and extraction

Received: 10 October 2019

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wu Jun
	Cheng Yao
	Hao Han
	Ailiyaer·Aizezi
	Liu Feixue
	Su Yipo

Cite this article:

Wu Jun,Cheng Yao,Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. 情报学报, 2020, 39(4): 409-418.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2020.04.007 OR https://qbxb.istic.ac.cn/EN/Y2020/V39/I4/409

1 HuangZ H, XuW, YuK. Bidirectional LSTM-CRF models for sequence tagging[OL]. https://arxiv.org/abs/1508.01991v1.
2 陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018, 45(S1): 91-96, 113.
3 陈世梅, 伍星, 唐凡. 基于BiLSTM-CRF模型的汉语否定信息识别[J]. 中文信息学报, 2018, 32(11): 55-61.
4 林怀逸, 刘箴, 柴玉梅, 等. 基于词向量预训练的不平衡文本情绪分类[J]. 中文信息学报, 2019, 33(5): 132-142.
5 PetersM E, NeumannM, IyyerM, et al. Deep contextualized word representations[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237.
6 DevlinJ, ChangM W, LeeK, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
7 BaldwinT, BannardC,TanakaT, et al. An empirical model of multiword expression decomposability[C]// Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Stroudsburg: Association for Computational Linguistics, 2003: 89-96.
8 胡小荣, 姚长青, 高影繁. 基于风险短语自动抽取的上市公司风险识别方法及可视化研究[J]. 情报学报, 2017, 36(7): 663-668.
9 龚双双, 陈钰枫, 徐金安, 等. 基于网络文本的汉语多词表达抽取方法[J]. 山东大学学报(理学版), 2018, 53(9): 40-48.
10 MikolovT, ChenK, CorradoG, et al. Efficient estimation of word representations in vector space[OL]. https://arxiv.org/abs/1301.3781.
11 ZhuL, WangG J, ZouX C. Improved information gain feature selection method for Chinese text classification based on word embedding[C]// Proceedings of the 6th International Conference on Software and Computer Applications. New York: ACM Press, 2017: 72-76.
12 李丽双, 郭元凯. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别[J]. 中文信息学报, 2018, 32(1): 116-122.
13 李健龙, 王盼卿, 韩琪羽. 基于双向LSTM的军事命名实体识别[J]. 计算机工程与科学, 2019, 41(4): 713-718.
14 李明浩, 刘忠, 姚远哲. 基于LSTM-CRF的中医医案症状术语识别[J]. 计算机应用, 2018, 38(S2): 42-46.
15 冯艳红, 于红, 孙庚, 等. 基于BLSTM的命名实体识别方法[J]. 计算机科学, 2018, 45(2): 261-268.
16 张应成, 杨洋, 蒋瑞, 等. 基于BiLSTM-CRF的商情实体识别模型[J]. 计算机工程, 2019, 45(5): 308-314.
17 杨飘, 董文永. 基于BERT嵌入的中文命名实体识别方法[J/OL]. 计算机工程 (2019-05-30 [2019-07-10]. https://doi.org/10.19678/j.issn.1000-3428.0054272.
18 VaswaniA, ShazeerN, ParmarN, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Breach, 2017: 6000-6010.
19 HochreiterS, SchmidhuberJ. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
20 MengY X, LiX Y, SunX F, et al. Is word segmentation necessary for deep learning of Chinese representations?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 3242-3252.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3