学术文献全文内容中的方法实体细粒度抽取及演化分析研究

doi:10.3772/j.issn.1000-0135.2023.08.007

情报学报

2023, Vol. 42

Issue (8): 952-966 DOI: 10.3772/j.issn.1000-0135.2023.08.007

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

学术文献全文内容中的方法实体细粒度抽取及演化分析研究

章成志, 谢雨欣, 张恒

南京理工大学经济管理学院信息管理系，南京 210094

Extraction and Evolution Analysis of Fine-grained Method Entities from Full Text of Academic Articles

Zhang Chengzhi, Xie Yuxin, Zhang Heng

Department of Information Management, School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094

摘要
图/表
参考文献
相关文章 (7)

全文: PDF (4700 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要在科学研究过程中，科研人员需要考虑针对不同的研究问题选择合适的研究方法，有时还会对研究方法进行优化，从而更好地解决研究问题。因此，研究方法往往是解决研究问题的关键，是学术文献的重要知识。帮助科研人员快速发现学术文献全文内容中蕴含的方法实体，为其推荐适用于自身研究问题的关键解决方法提供实践参考，可以提高科研人员解决问题的效率。当前相关研究缺乏对方法实体之间共现关系的分析，未充分挖掘学术文献中蕴含的丰富知识。为此，本研究以自然语言处理领域为例，将方法实体细分为算法、数据集、指标以及工具4种类型，并标注了50篇论文作为训练语料。本研究构建了CRF（conditional random field）、BiLSTM（bi-directional long short-term memory）+CRF等4种实体抽取模型。研究结果表明，SciBERT（scientific bidirectional encoder representations from transformers）+CRF模型的性能最优。以ACL年会（Annual Meeting of the Association for Computational Linguistics）在2001—2020年共20年收录的论文全文数据为基础，进一步分析抽取出的方法实体的使用情况。本研究结合经典关联规则挖掘算法Apriori和卡方值构建方法实体共现数据集，并分析方法实体的演化。研究结果揭示了方法实体间的共现关系及其整体演化情况，可辅助特定领域的科研人员寻找合适的研究方法。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	章成志
	谢雨欣
	张恒

关键词 ：方法实体, 命名实体识别, Apriori算法, 演化分析

收稿日期: 2022-07-28

基金资助:国家自然科学基金项目“基于学术文献全文内容的细粒度算法实体抽取与评估研究”（72074113）。

作者简介: 章成志，男，1977年生，博士，教授，博士生导师，主要研究方向为信息组织、信息检索、数据挖掘及自然语言处理，E-mail：zhangcz@njust.edu.cn；谢雨欣，女，1997年生，硕士研究生，主要研究方向为文本挖掘与科学计量；张恒，男，1995年生，博士研究生，主要研究方向为文本挖掘与科学计量；

引用本文:

章成志, 谢雨欣, 张恒. 学术文献全文内容中的方法实体细粒度抽取及演化分析研究[J]. 情报学报, 2023, 42(8): 952-966.
Zhang Chengzhi, Xie Yuxin, Zhang Heng. Extraction and Evolution Analysis of Fine-grained Method Entities from Full Text of Academic Articles. 情报学报, 2023, 42(8): 952-966.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2023.08.007 或 https://qbxb.istic.ac.cn/CN/Y2023/V42/I8/952

1 Ding Y, Song M, Han J A, et al. Entitymetrics: measuring the impact of entities[J]. PLoS One, 2013, 8(8): e71416.
2 Yao R J, Ye Y C, Zhang J, et al. AI marker-based large-scale AI literature mining[OL]. (2020-11-01). https://arxiv.org/pdf/2011.00518.pdf.
3 Zheng A Q, Zhao H, Luo Z C, et al. Improving on-line scientific resource profiling by exploiting resource citation information in the literature[J]. Information Processing & Management, 2021, 58(5): 102638.
4 Yu Q, Wang Q, Zhang Y F, et al. Analyzing knowledge entities about COVID-19 using entitymetrics[J]. Scientometrics, 2021, 126(5): 4491-4509.
5 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232.
6 D’Souza J, Auer S. NLPContributions: an annotation scheme for machine reading of scholarly contributions in natural language processing literature[C]// Proceedings of the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents. CEUR-WS.org, 2020: 16-27.
7 QasemiZadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. European Language Resources Association, 2016: 1862-1868.
8 章成志, 谢雨欣, 宋云天. 学术文本中细粒度知识实体的关联分析[J]. 图书馆论坛, 2021, 41(3): 12-20.
9 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091.
10 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600.
11 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650.
12 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers’ titles[C]// Proceedings of the 4th Language and Technology Conference. Heidelberg: Springer, 2011: 512-521.
13 Singh M, Dan S, Agarwal S, et al. AppTechMiner: mining applications and techniques from scientific articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. New York: ACM Press, 2017: 1-8.
14 Dan S, Agarwal S, Singh M, et al. Which techniques does your application use?: An information extraction framework for scientific articles[OL]. (2016-08-23). https://arxiv.org/pdf/1608.06386.pdf.
15 Pan X L, Yan E J, Wang Q Q, et al. Assessing the impact of software on science: a bootstrapped learning of software entities in full-text papers[J]. Journal of Informetrics, 2015, 9(4): 860-871.
16 Pan X L, Yan E J, Hua W N. Disciplinary differences of software use and impact in scientific literature[J]. Scientometrics, 2016, 109(3): 1593-1610.
17 Bikel D M, Miller S, Schwartz R, et al. Nymble: a high-performance learning name-finder[C]// Proceedings of the Fifth Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1997: 194-201.
18 Bikel D M, Schwartz R, Weischedel R M. An algorithm that learns what’s in a name[J]. Machine Learning, 1999, 34(1): 211-231.
19 Borthwick A E. A maximum entropy approach to named entity recognition[D]. New York: New York University, 1999.
20 McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]// Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg: Association for Computational Linguistics, 2003: 188-191.
21 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9.
22 Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 1-7.
23 Névéol A, Wilbur W J, Lu Z Y. Extraction of data deposition statements from the literature: a method for automatically tracking research results[J]. Bioinformatics, 2011, 27(23): 3306-3312.
24 Wang R, Liu W, McDonald C. Featureless domain-specific term extraction with minimal labelled data[C]// Proceedings of the Australasian Language Technology Association Workshop, Melbourne, Australia, 2016: 103-112.
25 Ma J X, Yuan H. Bi-LSTM+CRF-based named entity recognition in scientific papers in the field of ecological restoration technology[J]. Proceedings of the Association for Information Science and Technology, 2019, 56(1): 186-195.
26 Cohen J. A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement, 1960, 20(1): 37-46.
27 Klinger R, Tomanek K. Classical probabilistic models and conditional random fields[R]. Dortmund: Algorithm Engineering Report, 2007.
28 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270.
29 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
30 Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610.
31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
32 Souza F, Nogueira R, Lotufo R. Portuguese named entity recognition using BERT-CRF[OL]. (2020-02-27). https://arxiv.org/pdf/1909.10649.pdf.
33 Wunnava S, Qin X, Kakar T, et al. Bidirectional LSTM-CRF for adverse drug event tagging in electronic health records[C]// Proceedings of the 1st International Workshop on Medication and Adverse Drug Event Detection. PMLR, 2018: 48-56.
34 ?niegula A, Poniszewska-Marańda A, Chom?tek L. Towards the named entity recognition methods in biomedical field[C]// Proceedings of the 46th International Conference on Current Trends in Theory and Practice of Informatics. Cham: Springer, 2020: 375-387.
35 Li P C, Liu Q K, Cheng Q K, et al. Data set entity recognition based on distant supervision[J]. The Electronic Library, 2021, 39(3): 435-449.
36 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620.
37 Heddes J, Meerdink P, Pieters M, et al. The automatic detection of dataset names in scientific articles[J]. Data, 2021, 6(8): 84.
38 Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1993: 207-216.
39 Yang Y M, Pedersen J O. A comparative study on feature selection in text categorization[C]// Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1997: 412-420.
40 Kingma D P, Ba J. Adam: a method for stochastic optimization[OL]. (2015-04-23). https://arxiv.org/pdf/1412.6980v5.pdf.
41 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781v3.pdf.
42 Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 427-431.
43 Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created graph database for structuring human knowledge[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2008: 1247-1250.
44 Miller G A. WordNet: a lexical database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.