基于语义文本图的论文摘要关键词抽取算法

doi:10.3772/j.issn.1000-0135.2021.08.006

情报学报

2021, Vol. 40

Issue (8): 854-868 DOI: 10.3772/j.issn.1000-0135.2021.08.006

Current Issue | Archive | Adv Search

Keyword Extraction from a Paper's Abstract Based on Semantic Text Graph

Wang Xiaoyu¹, Wang Fang²

1.Department of Information Management, School of Management Science and Engineering, Dongbei University of Finance and Economics, Dalian 116025
2.Department of Information Resources Management, Business School, Nankai University, Tianjin 300071

Abstract
Figure/Table
References
Related Citation (4)

Download: PDF (3142 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Considering the basic role of keywords in large-scale document retrieval and text content analysis, an unsupervised keyword extraction algorithm based on a semantic graph is proposed, which focuses on improving the method of graph construction and the index of word weighting. To ensure that the text graph retains more semantic and structural information, the algorithm generates a semantic text graph consisting of four features, according to the dependence of words in a sentence: conceptual connection, equivalent membership, functional attributes, and modification. This operation eliminates the sliding window parameter in the traditional method and improves the usability of the algorithm. On this basis, a word-weighting method combining word position information, concept hierarchy, concept connection preference, and connection strength is proposed and the importance of each word is ranked. Finally, high-score nodes are selected to form a keyword set of abstracts included in research papers. Experimental results based on four open corpora show that the efficiency of this method is better compared with that of the other three baseline algorithms, and the F1 value has increased to 0.570.

Key words： text graph keyword extraction word weighting syntactic parsing

Received: 03 August 2020

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wang Xiaoyu
	Wang Fang

Cite this article:

Wang Xiaoyu,Wang Fang. Keyword Extraction from a Paper's Abstract Based on Semantic Text Graph[J]. 情报学报, 2021, 40(8): 854-868.

URL:

https://qbxb.istic.ac.cn/EN/10.3772/j.issn.1000-0135.2021.08.006 OR https://qbxb.istic.ac.cn/EN/Y2021/V40/I8/854

1 Boudin F. Unsupervised keyphrase extraction with multipartite graphs[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2008, 2: 667-672.
2 Carpena P, Bernaola-Galván P, Hackenberg M, et al. Level statistics of words: finding keywords in literary texts and symbolic sequences[J]. Physical Review E, 2009, 79(3): 035102-035106.
3 Carretero-Campos C, Bernaola-Galván P, Coronado A V, et al. Improving statistical keyword detection in short texts: entropic and clustering approaches[J]. Physica A: Statistical Mechanics and Its Applications, 2013, 392(6): 1481-1492.
4 Luhn H P. A statistical approach to mechanized encoding and searching of literary information[J]. IBM Journal of Research and Development, 1957, 1(4): 309-317.
5 Sp?rck Jones K. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 2004, 60(5): 493-502.
6 钟伟金. 共现关键词-叙词同义关系自动识别研究——基于互信息法、概率法的对比分析[J]. 图书情报工作, 2012, 56(18): 122-126.
7 孙健, 王伟, 钟义信. 基于统计的常用词搭配(Collocation)的发现方法[J]. 情报学报, 2002, 21(1): 12-16.
8 Bookstein A, Swanson D R. Probabilistic models for automatic indexing[J]. Journal of the American Society for Information Science, 1974, 25(5): 312-316.
9 Ortu?o M, Carpena P, Bernaola-Galván P, et al. Keyword detection in natural languages and DNA[J]. Europhysics Letters, 2002, 57(5): 759-764.
10 Zhou H D, Slater G W. A metric to search for relevant words[J]. Physica A: Statistical Mechanics and Its Applications, 2003, 329(1/2): 309-327.
11 Yoon B, Park Y. A text-mining-based patent network: analytical tool for high-technology trend[J]. The Journal of High Technology Management Research, 2004, 15(1): 37-50.
12 Rokaya M, Atlam E, Fuketa M, et al. Ranking of field association terms using co-word analysis[J]. Information Processing & Management, 2008, 44(2): 738-755.
13 Yoon B, Phaal R, Probert D. Morphology analysis for technology roadmapping: application of text mining[J]. R&D Management, 2008, 38(1): 51-68.
14 郭宇, 王晰巍, 贺伟, 等. 基于文献计量和知识图谱可视化方法的国内外低碳技术发展动态研究[J]. 情报科学, 2015, 33(4): 139-148.
15 龚惠群, 刘琼泽, 黄超. 机器人产业技术机会发现研究——基于专利文本挖掘[J]. 科技进步与对策, 2014, 31(5): 70-74.
16 党倩娜, 罗天雨, 曹磊. 多维视角下大数据领域技术创新演进、前沿与特性[J]. 科学学与科学技术管理, 2015, 36(8): 49-60.
17 Hulth A. Improved automatic keyword extraction given more linguistic knowledge[C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2003: 216-223.
18 Turney P D. Learning algorithms for keyphrase extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
19 Witten I H, Paynter G W, Frank E, et al. KEA: practical automated keyphrase extraction[M]// Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. Hershey: IGI Global, 2005: 129-152.
20 Zhang C. Automatic keyword extraction from documents using conditional random fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
21 Litvak M, Last M, Aizenman H, et al. DegExt—a language-independent graph-based keyphrase extractor[C]// Proceedings of the Conference on Advances in Intelligent Web Mastering-3. Heidelberg: Springer, 2011: 121-130.
22 Matsuo Y, Ohsawa Y, Ishizuka M. KeyWorld: extracting keywords from documents small world[C]// Proceedings of the 4th International Conference on Discovery Science. Heidelberg: Springer, 2001: 271-281.
23 Mihalcea R, Tarau P. TextRank: bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411.
24 Wu X, Du Z K, Guo Y K. A visual attention-based keyword extraction for document classification[J]. Multimedia Tools and Applications, 2018, 77(19): 25355-25367.
25 Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents[C]// Proceedings of the World Wide Web Conference. New York: ACM Press, 2019: 2551-2557.
26 Ray Chowdhury J, Caragea C, Caragea D. Keyphrase extraction from disaster-related tweets[C]// Proceedings of the World Wide Web Conference. New York: ACM Press, 2019: 1555-1566.
27 Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase extraction using deep recurrent neural networks on twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 836-845.
28 Alfonseca E, Manandhar S. An unsupervised method for general named entity recognition and automated concept discovery[C]// Proceedings of the 1st International Conference on General WordNet, 2002: 34-43.
29 Nadeau D, Turney P D, Matwin S. Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity[C]// Proceedings of the 19th Conference of the Canadian Society for Computational Studies of Intelligence. Heidelberg: Springer, 2006: 266-277.
30 Elsner M, Charniak E, Johnson M. Structured generative models for unsupervised named-entity clustering[C]// Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2009: 164-172.
31 Konkol M, Brychcín T, Konopík M. Latent semantics in named entity recognition[J]. Expert Systems with Applications, 2015, 42(7): 3470-3479.
32 Martinez-Romo J, Araujo L, Duque Fernandez A. SemGraph: extracting keyphrases following a novel semantic graph-based approach[J]. Journal of the Association for Information Science and Technology, 2016, 67(1): 71-82.
33 Bollegala D T, Matsuo Y, Ishizuka M. Relational duality: unsupervised extraction of semantic relations between entities on the web[C]// Proceedings of the 19th International Conference on World Wide Web. New York: ACM Press, 2010: 151-160.
34 Zhang M, Su J, Wang D M, et al. Discovering relations between named entities from a large raw corpus using tree similarity-based clustering[C]// Proceedings of the International Conference on Natural Language Processing. Heidelberg: Springer, 2005: 378-389.
35 Etzioni O, Cafarella M, Downey D, et al. Unsupervised named-entity extraction from the Web: an experimental study[J]. Artificial Intelligence, 2005, 165(1): 91-134.
36 Newman D, Chemudugunta C, Smyth P, et al. Analyzing entities and topics in news articles using statistical topic models[C]// Proceedings of the International Conference on Intelligence and Security Informatics. Heidelberg: Springer, 2006: 93-104.
37 Page L, Brin S, Motwani R, et al. The PageRank citation ranking: bringing order to the web[R]. Stanford InfoLab, 1999.
38 Tsatsaronis G, Varlamis I, N?rv?g K. SemanticRank: ranking keywords and sentences using semantic graphs[C]// Proceedings of the 23rd International Conference on Computational Linguistics. Coling 2010 Organizing Committee, 2010: 1074-1082.
39 Florescu C, Caragea C. A position-biased PageRank algorithm for keyphrase extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017.
40 Liu Z Y, Sun M S. Can prior knowledge help graph-based methods for keyword extraction?[J]. Frontiers of Electrical and Electronic Engineering, 2012, 7(2): 242-253.
41 Wan X J, Xiao J G. Single document keyphrase extraction using neighborhood knowledge[C]// Proceedings of the 23rd National Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2008, 2: 855-860.
42 Bougouin A, Boudin F, Daille B. TopicRank: graph-based topic ranking for keyphrase extraction[C]// Proceedings of the Sixth International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2013: 543-551.
43 Litvak M, Last M. Graph-based keyword extraction for single-document summarization[C]// Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization. Stroudsburg: Association for Computational Linguistics, 2008: 17-24.
44 Lahiri S, Choudhury S R, Caragea C. Keyword and keyphrase extraction using centrality measures on collocation networks[OL]. (2014-01-25) [2019-04-10]. https://arxiv.org/pdf/1401.6571.pdf.
45 Rousseau F, Vazirgiannis M. Main core retention on graph-of-words for single-document keyword extraction[C]// Proceedings of the European Conference on Information Retrieval. Cham: Springer, 2015: 382-393.
46 Tixier A, Malliaros F, Vazirgiannis M. A graph degeneracy-based approach to keyword extraction[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2016: 1860-1870.
47 Boudin F. A comparison of centrality measures for graph-based keyphrase extraction[C]// Proceedings of the Sixth International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2013: 834-838.
48 Biswas S K, Bordoloi M, Shreya J. A graph based keyword extraction model using collective node weight[J]. Expert Systems with Applications, 2018, 97: 51-59.
49 Bellaachia A, Al-Dhelaan M. NE-Rank: a novel graph-based keyphrase extraction in Twitter[C]// Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. IEEE, 2012, 1: 372-379.
50 Vega-Oliveros D A, Gomes P S, Milios E E, et al. A multi-centrality index for graph-based keyword extraction[J]. Information Processing & Management, 2019, 56(6): 102063.
51 Ohsawa Y, Benson N E, Yachida M. KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor[C]// Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries. IEEE, 1998: 12-18.
52 Duari S, Bhatnagar V. sCAKE: semantic connectivity aware keyword extraction[J]. Information Sciences, 2019, 477: 100-117.
53 Vo D T, Bagheri E. Self-training on refined clause patterns for relation extraction[J]. Information Processing & Management, 2018, 54(4): 686-706.
54 付芸, 汪雪锋, 李佳, 等. 基于SAO结构的创新解决方案遴选研究——以空气净化技术为例[J]. 图书情报工作, 2019, 63(6): 75-84.
55 汪雪锋, 付芸, 邱鹏君, 等. 基于SAO分析的R&D合作伙伴识别研究[J]. 科研管理, 2015, 36(10): 19-27.
56 Zhang Y, Zhou X, Porter A L, et al. How to combine term clumping and technology roadmapping for newly emerging science & technology competitive intelligence: “problem & solution” pattern based semantic TRIZ tool and case study[J]. Scientometrics, 2014, 101(2): 1375-1389.
57 Zhang Y Z, Milios E E, Nur Zincir-Heywood A. A comparative study on key phrase extraction methods in automatic web site summarization[J]. Journal of Digital Information Management, 2007, 5(5): 323-332.
58 Batagelj V, Zaver?nik M. Fast algorithms for determining (generalized) core groups in social networks[J]. Advances in Data Analysis and Classification, 2011, 5(2): 129-145.
59 Meng R, Zhao S Q, Han S G, et al. Deep keyphrase generation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 582-592.

Editorial Office: JCSSTI Editorial Office, No.15 fuxing road, haidian, Beijing 100038
Tel: +86(010)68598273; Fax: +86(010)68598285; E-mail: qbxb@istic.ac.cn
Copyright © 2015 by the Journal of The China Society for Scientific and Technical Information
ISSN: 1000-0135 CN: 11-2257 / G3