学术文本词汇功能识别——基于<bold>BERT</bold>向量化表示的关键词自动分类研究

doi:10.3772/j.issn.1000-0135.2020.12.008

情报学报

2020, Vol. 39

Issue (12): 1320-1329 DOI: 10.3772/j.issn.1000-0135.2020.12.008

Current Issue | Archive | Adv Search

Recognition of Lexical Functions in Academic Texts: Automatic Classification of Keywords Based on BERT Vectorization

Lu Wei^1,2, Li Pengcheng^1,2, Zhang Guobiao^1,2, Cheng Qikai^1,2

1.School of Information Management, Wuhan University, Wuhan 430072
2.Institute for Information Retrieval and Knowledge Mining, Wuhan University, Wuhan 430072

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (2619 KB) HTML (101 KB)
Export: BibTeX | EndNote (RIS)

Abstract As vocabulary or terminology that maps the full-text subject matter content in academic texts, keywords can provide important underlying semantic labels for knowledge retrieval and large-scale text computation. At present, there are problems in the use of keywords in academic texts, such as unclear intention, fuzzy semantic function, and lack of context information. Therefore, a neural network method based on supervised learning is proposed to classify the semantic functions carried by keywords to facilitate the identification of research questions and research methods in academic texts. In this study, journal papers published during a period of 10 years in the field of computer science were used as the training corpus, and the classification model was constructed using BERT and LSTM models. The results show that the proposed method is better than the traditional method. Its overall accuracy, recall rate, and F1 value reached 0.83, 0.87, and 0.85.

Key words： academic text keywords lexical function recognition deep learning

Received: 16 May 2020

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Lu Wei
	Li Pengcheng
	Zhang Guobiao
	Cheng Qikai

Cite this article:

Lu Wei,Li Pengcheng,Zhang Guobiao, et al. Recognition of Lexical Functions in Academic Texts: Automatic Classification of Keywords Based on BERT Vectorization[J]. 情报学报, 2020, 39(12): 1320-1329.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2020.12.008 OR https://qbxb.istic.ac.cn/EN/Y2020/V39/I12/1320

1 Ribaupierre H D, Falquet G. Extracting discourse elements and annotating scientific documents using the SciAnnotDoc model: A use case in gender documents[J]. International Journal on Digital Libraries, 2018, 19(2-3): 271-286.
2 钱力, 张晓林, 王茜. 科技论文的研究设计指纹自动识别方法构建与实现[J]. 图书情报工作, 2018, 62(2): 135-143.
3 张学工. 关于统计学习理论与支持向量机[J]. 自动化学报, 2000, 26(1): 32-42.
4 饶鲜, 董春曦, 杨绍全. 基于支持向量机的入侵检测系统[J]. 软件学报, 2003, 14(4): 798-803.
5 温有奎, 焦玉英. 知识元语义链接模型研究[J]. 图书情报工作, 2010, 54(12): 27-31.
6 Ding Y, Song M, Han J, et al. Entitymetrics: Measuring the impact of entities[J]. PLoS ONE, 2013, 8(8): e71416.
7 刘挺, 车万翔, 李生. 基于最大熵分类器的语义角色标注[J]. 软件学报, 2007, 18(3): 565-573.
8 Hearst M A. Automatic acquisition of hyponyms from large text corpora[C]// Proceedings of the 14th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 1992, 2: 539-545.
9 Soderland S. Learning information extraction rules for semi-structured and free text[J]. Machine Language, 1999, 34: 233-272.
10 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270.
11 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3): 329-340.
12 Augenstein I, Das M, Riedel S, et al. SemEval 2017 Task 10: ScienceIE - Extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555.
13 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016, 60(2): 122-128.
14 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9): 923-938.
15 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75.
16 余丽, 钱力, 付常雷, 等. 基于深度学习的文本中细粒度知识元抽取方法研究[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
17 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers􀆳 titles[C]// Proceedings of the Language and Technology Conference on Human Language Technology, Challenges for Computer Science and Linguistics. Heidelberg: Springer, 2009: 512-521.
18 Nanba H, Kondo T, Takezawa T. Automatic creation of a technical trend map from research papers and patents[C]// Proceedings of the 3rd International Workshop on Patent Information Retrieval. New York: ACM Press, 2010: 11-16.
19 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9.
20 Tsai C T, Kundu G, Roth D. Concept-based analysis of scientific literature[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. NewYork: ACM Press, 2013: 1733-1738.
21 程齐凯. 学术文献词汇功能识别[D]. 武汉: 武汉大学, 2015.
22 商宪丽. 基于多模主题网络的交叉学科知识组合模式研究——以数字图书馆为例[J]. 情报科学, 2018, 36(3): 130-137, 150.
23 Chopra S, Auli M, Rush A M. Abstractive sentence summarization with attentive recurrent neural networks[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 93-98.
24 Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382.
25 Huang S S, Wan X J. AKMiner: Domain-specific knowledge graph mining from academic literatures[C]// Proceedings of the International Conference on Web Information Systems Engineering. Heidelberg: Springer, 2013: 241-255.
26 李信, 程齐凯，刘兴帮. 基于词汇功能识别的科研文献分析系统设计与实现[J]. 图书情报工作, 2017, 61(1): 109-116.
27 Sun C, Shrivastava A, Singh S, et al. Revisiting unreasonable effectiveness of data in deep learning era[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE, 2017: 843-852.
28 Manning C D, Surdeanu M, Bauer J, et al. The Stanford CoreNLP natural language processing toolkit[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2014: 55-60.
29 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010.
30 Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language modeling[C]// Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, 2012.
31 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
32 Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 4171-4186.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3