摘要在科学研究过程中,科研人员需要考虑针对不同的研究问题选择合适的研究方法,有时还会对研究方法进行优化,从而更好地解决研究问题。因此,研究方法往往是解决研究问题的关键,是学术文献的重要知识。帮助科研人员快速发现学术文献全文内容中蕴含的方法实体,为其推荐适用于自身研究问题的关键解决方法提供实践参考,可以提高科研人员解决问题的效率。当前相关研究缺乏对方法实体之间共现关系的分析,未充分挖掘学术文献中蕴含的丰富知识。为此,本研究以自然语言处理领域为例,将方法实体细分为算法、数据集、指标以及工具4种类型,并标注了50篇论文作为训练语料。本研究构建了CRF(conditional random field)、BiLSTM(bi-directional long short-term memory)+CRF等4种实体抽取模型。研究结果表明,SciBERT(scientific bidirectional encoder representations from transformers)+CRF模型的性能最优。以ACL年会(Annual Meeting of the Association for Computational Linguistics)在2001—2020年共20年收录的论文全文数据为基础,进一步分析抽取出的方法实体的使用情况。本研究结合经典关联规则挖掘算法Apriori和卡方值构建方法实体共现数据集,并分析方法实体的演化。研究结果揭示了方法实体间的共现关系及其整体演化情况,可辅助特定领域的科研人员寻找合适的研究方法。
1 Ding Y, Song M, Han J A, et al. Entitymetrics: measuring the impact of entities[J]. PLoS One, 2013, 8(8): e71416. 2 Yao R J, Ye Y C, Zhang J, et al. AI marker-based large-scale AI literature mining[OL]. (2020-11-01). https://arxiv.org/pdf/2011.00518.pdf. 3 Zheng A Q, Zhao H, Luo Z C, et al. Improving on-line scientific resource profiling by exploiting resource citation information in the literature[J]. Information Processing & Management, 2021, 58(5): 102638. 4 Yu Q, Wang Q, Zhang Y F, et al. Analyzing knowledge entities about COVID-19 using entitymetrics[J]. Scientometrics, 2021, 126(5): 4491-4509. 5 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232. 6 D’Souza J, Auer S. NLPContributions: an annotation scheme for machine reading of scholarly contributions in natural language processing literature[C]// Proceedings of the 1st Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents. CEUR-WS.org, 2020: 16-27. 7 QasemiZadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. European Language Resources Association, 2016: 1862-1868. 8 章成志, 谢雨欣, 宋云天. 学术文本中细粒度知识实体的关联分析[J]. 图书馆论坛, 2021, 41(3): 12-20. 9 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091. 10 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 11 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. 12 Kondo T, Nanba H, Takezawa T, et al. Technical trend analysis by analyzing research papers’ titles[C]// Proceedings of the 4th Language and Technology Conference. Heidelberg: Springer, 2011: 512-521. 13 Singh M, Dan S, Agarwal S, et al. AppTechMiner: mining applications and techniques from scientific articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. New York: ACM Press, 2017: 1-8. 14 Dan S, Agarwal S, Singh M, et al. Which techniques does your application use?: An information extraction framework for scientific articles[OL]. (2016-08-23). https://arxiv.org/pdf/1608.06386.pdf. 15 Pan X L, Yan E J, Wang Q Q, et al. Assessing the impact of software on science: a bootstrapped learning of software entities in full-text papers[J]. Journal of Informetrics, 2015, 9(4): 860-871. 16 Pan X L, Yan E J, Hua W N. Disciplinary differences of software use and impact in scientific literature[J]. Scientometrics, 2016, 109(3): 1593-1610. 17 Bikel D M, Miller S, Schwartz R, et al. Nymble: a high-performance learning name-finder[C]// Proceedings of the Fifth Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1997: 194-201. 18 Bikel D M, Schwartz R, Weischedel R M. An algorithm that learns what’s in a name[J]. Machine Learning, 1999, 34(1): 211-231. 19 Borthwick A E. A maximum entropy approach to named entity recognition[D]. New York: New York University, 1999. 20 McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]// Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. Stroudsburg: Association for Computational Linguistics, 2003: 188-191. 21 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9. 22 Isozaki H, Kazawa H. Efficient support vector classifiers for named entity recognition[C]// Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002: 1-7. 23 Névéol A, Wilbur W J, Lu Z Y. Extraction of data deposition statements from the literature: a method for automatically tracking research results[J]. Bioinformatics, 2011, 27(23): 3306-3312. 24 Wang R, Liu W, McDonald C. Featureless domain-specific term extraction with minimal labelled data[C]// Proceedings of the Australasian Language Technology Association Workshop, Melbourne, Australia, 2016: 103-112. 25 Ma J X, Yuan H. Bi-LSTM+CRF-based named entity recognition in scientific papers in the field of ecological restoration technology[J]. Proceedings of the Association for Information Science and Technology, 2019, 56(1): 186-195. 26 Cohen J. A coefficient of agreement for nominal scales[J]. Educational and Psychological Measurement, 1960, 20(1): 37-46. 27 Klinger R, Tomanek K. Classical probabilistic models and conditional random fields[R]. Dortmund: Algorithm Engineering Report, 2007. 28 Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2016: 260-270. 29 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 30 Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. 31 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 32 Souza F, Nogueira R, Lotufo R. Portuguese named entity recognition using BERT-CRF[OL]. (2020-02-27). https://arxiv.org/pdf/1909.10649.pdf. 33 Wunnava S, Qin X, Kakar T, et al. Bidirectional LSTM-CRF for adverse drug event tagging in electronic health records[C]// Proceedings of the 1st International Workshop on Medication and Adverse Drug Event Detection. PMLR, 2018: 48-56. 34 ?niegula A, Poniszewska-Marańda A, Chom?tek L. Towards the named entity recognition methods in biomedical field[C]// Proceedings of the 46th International Conference on Current Trends in Theory and Practice of Informatics. Cham: Springer, 2020: 375-387. 35 Li P C, Liu Q K, Cheng Q K, et al. Data set entity recognition based on distant supervision[J]. The Electronic Library, 2021, 39(3): 435-449. 36 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 37 Heddes J, Meerdink P, Pieters M, et al. The automatic detection of dataset names in scientific articles[J]. Data, 2021, 6(8): 84. 38 Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 1993: 207-216. 39 Yang Y M, Pedersen J O. A comparative study on feature selection in text categorization[C]// Proceedings of the Fourteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1997: 412-420. 40 Kingma D P, Ba J. Adam: a method for stochastic optimization[OL]. (2015-04-23). https://arxiv.org/pdf/1412.6980v5.pdf. 41 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781v3.pdf. 42 Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2017: 427-431. 43 Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created graph database for structuring human knowledge[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York: ACM Press, 2008: 1247-1250. 44 Miller G A. WordNet: a lexical database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.