学术文本词汇功能识别——在关键词自动抽取中的应用

doi:10.3772/j.issn.1000-0135.2021.02.005

情报学报

2021, Vol. 40

Issue (2): 152-162 DOI: 10.3772/j.issn.1000-0135.2021.02.005

Current Issue | Archive | Adv Search

Recognition of Lexical Functions in Academic Texts: Application in Automatic Keyword Extraction

Jiang Yi^1,2, Huang Yong^1,2, Xia Yikun³, Li Pengcheng^1,2, Lu Wei^1,2

1.School of Information Management, Wuhan University, Wuhan 430072
2.Institute for Information Retrieval and Knowledge Mining, Wuhan University, Wuhan 430072
3.Center for Studies of Information Resources, Wuhan University, Wuhan 430072

Abstract
Figure/Table
References
Related Citation (8)

Download: PDF (1401 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Traditional automatic keyword extraction often uses non-semantic information such as the frequency and location of candidate keywords to construct features without considering the specific semantic role of keywords in the academic text, that is, lexical function. Our statistical analysis found that 67.99% of the keywords in our dataset represented research questions or methods. Therefore, we classified lexical functions into three categories: Research Questions, Research Methods, and Others. Then, based on the word frequency and position features, a method was proposed to implement lexical functions in computer science papers through a classification model and ranking model. The results showed that our method could outperform the baseline with base features. The Acc and F of the classification model were improved to 0.840 and 0.666, with relative improvements of 24.63% and 25.19%, respectively. The MAP, NDCG@5, and P@5 of the ranking model improved by 168.32%, 189.50%, and 148.30%, reaching 0.813, 0.828, and 0.447, respectively. All improvements showed that lexical functions play an important role in automatic keyword extraction.

Key words： lexical function keyword extraction SVM learning to rank academic text

Received: 16 May 2020

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Jiang Yi
	Huang Yong
	Xia Yikun
	Li Pengcheng
	Lu Wei

Cite this article:

Jiang Yi,Huang Yong,Xia Yikun, et al. Recognition of Lexical Functions in Academic Texts: Application in Automatic Keyword Extraction[J]. 情报学报, 2021, 40(2): 152-162.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2021.02.005 OR https://qbxb.istic.ac.cn/EN/Y2021/V40/I2/152

1 Hasan K S, Ng V. Automatic keyphrase extraction: a survey of the state of the art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2014: 1262-1273.
2 Salton G, Buckley C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
3 Mihalcea R, Tarau P. Textrank: bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411.
4 Witten I H, Paynter G W, Frank E, et al. KEA: practical automatic keyphrase extraction[C]// Proceedings of the Fourth ACM Conference on Digital Libraries. New York: ACM Press, 1999: 254-255.
5 Medelyan O, Frank E, Witten I H. Human-competitive tagging using automatic keyphrase extraction[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2009: 1318-1327.
6 Gollapalli S D, Li X L. Keyphrase extraction using sequential labeling[OL]. (2016-08-01). https://arxiv.org/pdf/1608.00329v1.pdf.
7 Sahrawat D, Mahata D, Zhang H M, et al. Keyphrase extraction as sequence labeling using contextualized embeddings[C]// Proceedings of the European Conference on Information Retrieval. Cham: Springer, 2020: 328-335.
8 Meng R, Zhao S Q, Han S G, et al. Deep keyphrase generation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 582-592.
9 Chen J, Zhang X M, Wu Y, et al. Keyphrase generation with correlation constraints[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4057-4066.
10 Papagiannopoulou E, Tsoumakas G. A review of keyphrase extraction[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2020, 10(2): e1339.
11 常耀成, 张宇翔, 王红, 等. 特征驱动的关键词提取算法综述[J]. 软件学报, 2018, 29(7): 2046-2070.
12 程齐凯. 学术文本的词汇功能识别[D]. 武汉: 武汉大学, 2015.
13 Strübing J. Research as pragmatic problem-solving: the pragmatist roots of empirically-grounded theorizing[M]// The SAGE Handbook of Grounded Theory. London: SAGE Publications, 2007: 580-601.
14 Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382.
15 胡昌平, 陈果. 科技论文关键词特征及其对共词分析的影响[J]. 情报学报, 2014, 33(1): 23-32.
16 刘智锋, 李信, 程齐凯, 等. 学术文本关键词语义功能数据集构建与分析——以Journal of Informetrics为例[J]. 图书馆论坛, 2019, 39(7): 64-74.
17 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers' titles[C]// Proceedings of the Language and Technology Conference. Heidelberg: Springer, 2011: 512-521.
18 Nanba H, Kondo T, Takezawa T. Automatic creation of a technical trend map from research papers and patents[C]// Proceedings of the 3rd International Workshop on Patent Information Retrieval. New York: ACM Press, 2010: 11-16.
19 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9.
20 Tsai C T, Kundu G, Roth D. Concept-based analysis of scientific literature[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York: ACM Press, 2013: 1733-1738.
21 Augenstein I, Das M, Riedel S, et al. SemEval 2017 Task 10: ScienceIE - extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555.
22 程齐凯, 李信, 陆伟. 领域无关学术文献词汇功能标准化数据集构建及分析[J]. 情报科学, 2019, 37(7): 41-47.
23 李素建, 王厚峰, 俞士汶, 等. 关键词自动标引的最大熵模型应用研究[J]. 计算机学报, 2004, 27(9): 1192-1197.
24 Campos R, Mangaravite V, Pasquali A, et al. YAKE! Collection-independent automatic keyword extractor[C]// Proceedings of the European Conference on Information Retrieval. Cham: Springer, 2018: 806-810.
25 Campos R, Mangaravite V, Pasquali A, et al. YAKE! Keyword extraction from single documents using multiple local features[J]. Information Sciences, 2020, 509: 257-289.
26 Liu Z Y, Huang W Y, Zheng Y B, et al. Automatic keyphrase extraction via topic decomposition[C]// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2010: 366-376.
27 Florescu C, Caragea C. Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 1105-1115.
28 方俊伟, 崔浩冉, 贺国秀, 等. 基于先验知识TextRank的学术文本关键词抽取[J]. 情报科学, 2019, 37(3): 75-80.
29 Rose S, Engel D, Cramer N, et al. Automatic keyword extraction from individual documents[M]// Text Mining: Applications and Theory. Chichester: John Wiley & Sons, 2010: 1-20.
30 Jiang X, Hu Y H, Li H. A ranking approach to keyphrase extraction[C]// Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2009: 756-757.
31 Zhang Y X, Chang Y C, Liu X Q, et al. MIKE: keyphrase extraction by integrating multidimensional information[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM Press, 2017: 1349-1358.
32 Nguyen T D, Kan M. Keyphrase extraction in scientific publications[C]// Proceedings of the International Conference on Asian Digital Libraries. Heidelberg: Springer, 2007: 317-326.
33 Caragea C, Bulgarov F, Godea A, et al. Citation-enhanced keyphrase extraction from research papers: a supervised approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1435-1446.
34 Turney P D. Learning algorithms for keyphrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
35 Hulth A. Improved automatic keyword extraction given more linguistic knowledge[C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2003: 216-223.
36 Zhang K, Xu H, Tang J, et al. Keyword extraction using support vector machine[C]// Proceedings of the International Conference on Web-Age Information Management. Heidelberg: Springer, 2006: 85-96.
37 方龙, 李信, 黄永, 等. 学术文本的结构功能识别——在关键词自动抽取中的应用[J]. 情报学报, 2017, 36(6): 599-605.
38 Zhang C Z, Wang H L, Liu Y, et al. Automatic keyword extraction from documents using conditional random fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
39 Patel K, Caragea C. Exploring word embeddings in CRF-based keyphrase extraction from research papers[C]// Proceedings of the 10th International Conference on Knowledge Capture. New York: ACM Press, 2019: 37-44.
40 Martinc M, ?krlj B, Pollak S. TNT-KID: transformer-based neural tagger for keyword identification[OL]. (2020-12-08). https://arxiv.org/pdf/2003.09166.pdf.
41 Chen W, Gao Y F, Zhang J N, et al. Title-guided encoding for keyphrase generation[C]// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019, 33: 6268-6275.
42 Zhao J, Zhang Y X. Incorporating linguistic constraints into keyphrase generation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 5224-5233.
43 Pasumarthi R K, Bruch S, Wang X, et al. TF-Ranking: scalable TensorFlow library for learning-to-rank[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2019: 2970-2978.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3