|
|
Extraction and Evolution Analysis of Fine-grained Method Entities from Full Text of Academic Articles |
Zhang Chengzhi, Xie Yuxin, Zhang Heng |
Department of Information Management, School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094 |
|
|
Abstract During scientific research, researchers should consider choosing appropriate solutions for different research problems and optimize the methods to better solve the research problems. Therefore, research methods are often the key to solving research problems and important knowledge in academic literature. This helps researchers to quickly discover the method entities contained in the full text of academic literature and provides a practical reference for recommending key solutions to their research problems, which can improve the efficiency of researchers in solving problems. Currently, research on the relationship between method entities and the rich knowledge contained in academic literature is unavailable. To this end, this study considers the field of Natural Language Processing as an example; subdivides the method entities into four types: algorithms, datasets, indicators, and tools; annotates 50 papers as a training corpus. In this study, four types of models were used to extract entities. The experimental results demonstrated that the performance of the SciBERT+CRF model is the best. Based on the full-text data of papers collected by the ACL Conference from 2001 to 2020, this study further analyzed the usage of the extracted method entities. In this study, an entity association dataset was developed by combining the classical association rule mining algorithm Apriori and the chi-square value, and the evolution of the entities was analyzed. The results of this study reveal the relationship between method entities and their overall evolution, which can assist researchers in specific fields to find suitable research methods.
|
Received: 28 July 2022
|
|
|
|
1 Ding Y, Song M, Han J A, et al. Entitymetrics: measuring the impact of entities[J]. PLoS One, 2013, 8(8): e71416. 2 Yao R J, Ye Y C, Zhang J, et al. AI marker-based large-scale AI literature mining[OL]. (2020-11-01). https://arxiv.org/pdf/2011.00518.pdf. 3 Zheng A Q, Zhao H, Luo Z C, et al. Improving on-line scientific resource profiling by exploiting resource citation information in the literature[J]. Information Processing & Management, 2021, 58(5): 102638. 4 Yu Q, Wang Q, Zhang Y F, et al. Analyzing knowledge entities about COVID-19 using entitymetrics[J]. Scientometrics, 2021, 126(5): 4491-4509. 5 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232. 6 D’Souza J, Auer S. NLPContributions: an annotation scheme for machine reading of scholarly contributions in natural language processing literature[C]// Proceedings of the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents. CEUR-WS.org, 2020: 16-27. 7 QasemiZadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. European Language Resources Association, 2016: 1862-1868. 8 章成志, 谢雨欣, 宋云天. 学术文本中细粒度知识实体的关联分析[J]. 图书馆论坛, 2021, 41(3): 12-20. 9 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091. 10 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 11 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. 12 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers’ titles[C]// Proceedings of the 4th Language and Technology Conference. Heidelberg: Springer, 2011: 512-521. 13 Singh M, Dan S, Agarwal S, et al. AppTechMiner: mining applications and techniques from scientific articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. New York: ACM Press, 2017: 1-8. 14 Dan S, Agarwal S, Singh M, et al. Which techniques does your application use?: An information extraction framework for scientific articles[OL]. (2016-08-23). https://arxiv.org/pdf/1608.06386.pdf. 15 Pan X L, Yan E J, Wang Q Q, et al. Assessing the impact of software on science: a bootstrapped learning of software entities in full-text papers[J]. Journal of Informetrics, 2015, 9(4): 860-871. 16 Pan X L, Yan E J, Hua W N. Disciplinary differences of software use and impact in scientific literature[J]. Scientometrics, 2016, 109(3): 1593-1610. 17 Bikel D M, Miller S, Schwartz R, et al. Nymble: a high-performance learning name-finder[C]// Proceedings of the Fifth Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1997: 194-201. 18 Bikel D M, Schwartz R, Weischedel R M. An algorithm that learns what’s in a name[J]. Machine Learning, 1999, 34(1): 211-231. 19 Borthwick A E. A maximum entropy approach to named entity recognition[D]. New York: New York University, 1999. 20 McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]// Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg: Association for Computational Linguistics, 2003: 188-191. 21 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9. 22 Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 1-7. 23 Névéol A, Wilbur W J, Lu Z Y. Extraction of data deposition statements from the literature: a method for automatically tracking research results[J]. Bioinformatics, 2011, 27(23): 3306-3312. 24 Wang R, Liu W, McDonald C. Featureless domain-specific term extraction with minimal labelled data[C]// Proceedings of the Australasian Language Technology Association Workshop, Melbourne, Australia, 2016: 103-112. 25 Ma J X, Yuan H. Bi-LSTM+CRF-based named entity recognition in scientific papers in the field of ecological restoration technology[J]. Proceedings of the Association for Information Science and Technology, 2019, 56(1): 186-195. 26 Cohen J. A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement, 1960, 20(1): 37-46. 27 Klinger R, Tomanek K. Classical probabilistic models and conditional random fields[R]. Dortmund: Algorithm Engineering Report, 2007. 28 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270. 29 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 30 Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. 31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 32 Souza F, Nogueira R, Lotufo R. Portuguese named entity recognition using BERT-CRF[OL]. (2020-02-27). https://arxiv.org/pdf/1909.10649.pdf. 33 Wunnava S, Qin X, Kakar T, et al. Bidirectional LSTM-CRF for adverse drug event tagging in electronic health records[C]// Proceedings of the 1st International Workshop on Medication and Adverse Drug Event Detection. PMLR, 2018: 48-56. 34 ?niegula A, Poniszewska-Marańda A, Chom?tek L. Towards the named entity recognition methods in biomedical field[C]// Proceedings of the 46th International Conference on Current Trends in Theory and Practice of Informatics. Cham: Springer, 2020: 375-387. 35 Li P C, Liu Q K, Cheng Q K, et al. Data set entity recognition based on distant supervision[J]. The Electronic Library, 2021, 39(3): 435-449. 36 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 37 Heddes J, Meerdink P, Pieters M, et al. The automatic detection of dataset names in scientific articles[J]. Data, 2021, 6(8): 84. 38 Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1993: 207-216. 39 Yang Y M, Pedersen J O. A comparative study on feature selection in text categorization[C]// Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1997: 412-420. 40 Kingma D P, Ba J. Adam: a method for stochastic optimization[OL]. (2015-04-23). https://arxiv.org/pdf/1412.6980v5.pdf. 41 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781v3.pdf. 42 Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 427-431. 43 Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created graph database for structuring human knowledge[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2008: 1247-1250. 44 Miller G A. WordNet: a lexical database for English[J]. Communications of the ACM, 1995, 38(11): 39-41. |
|
|
|