基于多任务学习和多态语义特征的中文疾病名称归一化研究

doi:10.3772/j.issn.1000-0135.2021.11.009

情报学报

2021, Vol. 40

Issue (11): 1234-1244 DOI: 10.3772/j.issn.1000-0135.2021.11.009

情报分析方法与技术

本期目录 | 过刊浏览 | 高级检索

基于多任务学习和多态语义特征的中文疾病名称归一化研究

韩普^1,2, 张展鹏¹, 张伟¹

1.南京邮电大学管理学院，南京 210003
2.江苏省数据工程与知识服务重点实验室，南京 210023

Chinese Disease Name Normalization Based on Multi-task Learning and Polymorphic Semantic Features

Han Pu^1,2, Zhang Zhanpeng¹, Zhang Wei¹

1.School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003
2.Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023

摘要
图/表
参考文献
相关文章 (6)

全文: PDF (2417 KB) HTML (137 KB)
输出: BibTeX | EndNote (RIS)

摘要为解决在线文本中存在大量疾病指称的问题，提出了基于多任务学习和多态语义特征的中文疾病名称归一化模型（multi-task attention-dictionary BERT GRU-CNN，MTAD-BERT-GCNN）。首先利用word2vec和Glove生成融合局部和全局的外部语义特征向量；其次将CNN（convolutional neural networks）和BERT（bidirectional encoder representations from transformers）作为基准模型进行对比实验；接着在CNN上引入GRU（gated recurrent unit）、LSTM（long short-term memory）、BiGRU（bi-directional gated recurrent unit）和BiLSTM（bi-directional long short-term memory）以提取文本间语义关系；然后，基于多任务学习视角，将上述模型与BERT相结合以捕获静态和动态语义信息；最后，引入医学词典生成注意力权重词典作为辅助任务以调节静态向量，从而进一步提升模型效果。在自建的中文疾病名称归一化数据集ChDND（Chinese disease normalization data）上进行实验。研究结果发现，MTAD-BERT-GCNN模型在Accuracy@10指标上可以达到89.60%的准确率，较基础的词级CNN和字级CNN分别提高了12.96%和5.12%。本研究在中文疾病名称归一化任务中引入了多任务学习思路，从语义向量和模型框架层面进行了优化，在中文医学知识图谱构建、信息抽取和自然语言理解中具有较好的应用价值。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	韩普
	张展鹏
	张伟

关键词 ：疾病名称归一化, 有监督学习, 多任务学习, 卷积神经网络, BERT

收稿日期: 2020-11-23

基金资助:国家社会科学基金项目“大数据环境下健康领域实体语义挖掘研究”（17CTQ022）。

作者简介: 韩普，男，1983年生，博士，副教授，硕士生导师，主要研究领域为医疗健康语义分析，E-mail：hanpu@njupt.edu.cn；张展鹏，男，1996年生，硕士研究生，主要研究领域为实体归一化；张伟，男，2000年生，本科生，主要研究领域为自然语言处；

引用本文:

韩普, 张展鹏, 张伟. 基于多任务学习和多态语义特征的中文疾病名称归一化研究[J]. 情报学报, 2021, 40(11): 1234-1244.
Han Pu, Zhang Zhanpeng, Zhang Wei. Chinese Disease Name Normalization Based on Multi-task Learning and Polymorphic Semantic Features. 情报学报, 2021, 40(11): 1234-1244.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2021.11.009 或 https://qbxb.istic.ac.cn/CN/Y2021/V40/I11/1234

1 Magumba M A, Nabende P, Mwebaze E. Ontology boosted deep learning for disease name extraction from Twitter messages[J]. Journal of Big Data, 2018, 5(1): 1-19.
2 陈美杉, 夏晨曦. 肝癌患者在线提问的命名实体识别研究: 一种基于迁移学习的方法[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
3 Grover S, Aujla G S. Prediction model for influenza epidemic based on Twitter data[J]. International Journal of Advanced Research in Computer and Communication Engineering, 2014, 3(7): 7541-7545.
4 王萍, 牟冬梅, 高和璇, 等. 基于传染病监测数据的危机探测研究[J]. 情报学报, 2019, 38(5): 492-499.
5 Chen L T, Baird A, Straub D. Fostering participant health knowledge and attitudes: an econometric study of a chronic disease-focused online health community[J]. Journal of Management Information Systems, 2019, 36(1): 194-229.
6 Thelwall M, Buckley K. Topic-based sentiment analysis for the social web: the role of mood and issue‐related words[J]. Journal of the American Society for Information Science and Technology, 2013, 64(8): 1608-1617.
7 Li S, Yu C H, Wang Y C, et al. Exploring adverse drug reactions of diabetes medicine using social media analytics and interactive visualizations[J]. International Journal of Information Management, 2019, 48: 228-237.
8 Karimi S, Metke-Jimenez A, Kemp M, et al. CADEC: a corpus of adverse drug event annotations[J]. Journal of Biomedical Informatics, 2015, 55: 73-81.
9 Ching T, Himmelstein D S, Beaulieu-Jones B K, et al. Opportunities and obstacles for deep learning in biology and medicine[J]. Journal of the Royal Society Interface, 2018, 15(141): 20170387.
10 Leaman R, Islamaj Do?an R, Lu Z Y. DNorm: disease name normalization with pairwise learning to rank[J]. Bioinformatics, 2013, 29(22): 2909-2917.
11 韩普, 马健, 张嘉明, 等. 基于多数据源融合的医疗知识图谱框架构建研究[J]. 现代情报, 2019, 39(6): 81-90.
12 林泽斐, 欧石燕. 多特征融合的中文命名实体链接方法研究[J]. 情报学报, 2019, 38(1): 68-78.
13 Luo Y, Song G J, Li P Y, et al. Multi-task medical concept normalization using multi-view convolutional neural network[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018.
14 Zhang Y Z, Ma X J, Song G J. Chinese medical concept normalization by using text and comorbidity network embedding[C]// Proceedings of the 2018 IEEE International Conference on Data Mining. IEEE, 2018: 777-786.
15 Zhou S J, Li X. Feature engineering vs. deep learning for paper section identification: toward applications in Chinese medical literature[J]. Information Processing & Management, 2020, 57(3): 102206.
16 Ristad E S, Yianilos P N. Learning string-edit distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(5): 522-532.
17 Aronson A R. Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program[J]. Proceedings of the AMIA Symposium, 2001: 17-21.
18 Tsuruoka Y, McNaught J, Tsujii J, et al. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression[J]. Bioinformatics, 2007, 23(20): 2768-2774.
19 Yang H. Automatic extraction of medication information from medical discharge summaries[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 545-548.
20 Khare R, Li J, Lu Z Y. LabeledIn: cataloging labeled indications for human drugs[J]. Journal of Biomedical Informatics, 2014, 52: 448-456.
21 Kate R J. Normalizing clinical terms using learned edit distance patterns[J]. Journal of the American Medical Informatics Association, 2015, 23(2): 380-386.
22 Jonnagaddala J, Jue T R, Chang N W, et al. Improving the dictionary lookup approach for disease normalization using enhanced dictionary and query expansion[J]. Database, 2016, 2016: baw112.
23 Shi H R, Xie P T, Hu Z T, et al. Towards automated ICD coding using deep learning[OL]. (2017-11-30). https://arxiv.org/pdf/1711.04075.pdf.
24 Liu H W, Xu Y. A deep learning way for disease name representation and normalization[C]// Proceedings of the 8th National CCF Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2017: 151-157.
25 Limsopatham N, Collier N. Normalising medical concepts in social media texts by learning semantic representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 1014-1023.
26 Li H D, Chen Q C, Tang B Z, et al. CNN-based ranking for biomedical entity normalization[J]. BMC Bioinformatics, 2017, 18(Suppl 11): 385.
27 Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Sequence learning with RNNs for medical concept normalization in user-generated texts[OL]. (2018-11-29). https://arxiv.org/pdf/1811.11523.
28 Huang J M, Osorio C, Sy L W. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes[J]. Computer Methods and Programs in Biomedicine, 2019, 177: 141-153.
29 Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning[C]// Proceedings of the 25th International Conference on Machine Learning. New York: ACM Press, 2008: 160-167.
30 Liu P F, Qiu X P, Huang X J. Recurrent neural network for text classification with multi-task learning[OL]. (2016-05-17). https://arxiv.org/pdf/1605.05101.
31 Liu P F, Qiu X P, Huang X J. Adversarial multi-task learning for text classification[OL]. (2017-04-19). https://arxiv.org/pdf/1704.05742.
32 Yang J L, Liu Y N, Qian M H, et al. Information extraction from electronic medical records using multitask recurrent neural network with contextual word embedding[J]. Applied Sciences, 2019, 9(18): 3658.
33 Niu J H, Yang Y H, Zhang S H, et al. Multi-task character-level attentional networks for medical concept normalization[J]. Neural Processing Letters, 2019, 49(3): 1239-1256.
34 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[OL]. (2019-05-24). https://arxiv.org/pdf/1810.04805.
35 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12): 1320-1329.
36 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
37 Li F, Jin Y H, Liu W S, et al. Fine-tuning bidirectional encoder representations from transformers (BERT)-based models on large-scale electronic health record notes: an empirical study[J]. JMIR Medical Informatics, 2019, 7(3): e14830.
38 Xu D F, Gopale M, Zhang J C, et al. Unified medical language system resources improve sieve-based generation and bidirectional encoder representations from transformers (BERT)-based ranking for concept normalization[J]. Journal of the American Medical Informatics Association, 2020, 27(10): 1510-1519.
39 Ji Z C, Wei Q, Xu H. BERT-based ranking for biomedical entity normalization[OL]. (2019-08-09). https://arxiv.org/ftp/arxiv/papers/1908/1908.03548.pdf.
40 Kalyan K S, Sangeetha S. BertMCN: mapping colloquial phrases to standard medical concepts using BERT and highway network[J]. Artificial Intelligence in Medicine, 2021, 112: 102008.
41 Lee K, Hasan S A, Farri O, et al. Medical concept normalization for online user-generated texts[C]// Proceedings of the IEEE International Conference on Healthcare Informatics. IEEE, 2017: 462-469.
42 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
43 Cho K, van Merri?nboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1724-1734.
44 Kim Y. Convolutional neural networks for sentence classification[OL]. (2014-09-03). https://arxiv.org/pdf/1408.5882.
45 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[OL]. (2017-12-06). https://arxiv.org/pdf/1706.03762.
46 Dogan R I, Lu Z. An inference method for disease name normalization[C]// Proceedings of the AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. Palo Alto: AAAI Press, 2012: 8-13.
47 Karadeniz ?, ?zgür A. Linking entities through an ontology using word embeddings and syntactic re-ranking[J]. BMC Bioinformatics, 2019, 20(1): 156.