|
|
Identification of Problem and Method in Scientific Papers Based on Formulaic Expression Desensitization and Enhanced Boundary Recognition |
Zhang Yingyi1, Zhang Chengzhi2 |
1.Department of Archives and E-government, School of Social Science, Soochow University, Suzhou 215123 2.Department of Information Management, School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094 |
|
|
Abstract Problems and methods are crucial components of scientific papers and play a significant role in the organization, management, retrieval, and evaluation of scientific papers. To alleviate the formulaic expression dependency and word boundary recognition errors in problem and method recognition methods, we propose a model combined with formulaic expression desensitization and enhanced boundary recognition. Specifically, formulaic expression desensitization is achieved through data augmentation methods, whereas boundary enhancement utilizes pointer networks and sequence labeling models. With open access to scientific papers, researchers are utilizing full-text papers for entity recognition tasks. To demonstrate the importance of using full-text papers, this paper manually constructs an abstract and full-text annotated dataset in the field of natural language processing. Numerical and content-based metrics are designed to compare the problem, method, and their relationship extracted from two datasets. The results of ten-fold cross-validation experiments indicate that the proposed model outperforms baseline models such as SciBERT-BiLSTM-CRF significantly, with a macro-average F1 score improvement of 3.69 percentage points. When comparing entity recognition and relationship extraction results between abstracts and full texts, this paper shows that problem and method entities in abstracts have a broader semantic representation, whereas full texts contain more detailed entities and relationships that describe model design and training procedures.
|
Received: 24 July 2023
|
|
|
|
1 李丹. 科学研究活动中的知识管理研究[D]. 武汉: 武汉大学, 2005. 2 Luo Z R, Lu W, He J G, et al. Combination of research questions and methods: a new measurement of scientific novelty[J]. Journal of Informetrics, 2022, 16(2): 101282. 3 Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382. 4 Kova?evi? A, Konjovi? Z, Milosavljevi? B, et al. Mining methodologies from NLP publications: a case study in automatic terminology recognition[J]. Computer Speech & Language, 2012, 26(2): 105-126. 5 伊惠芳. 基于问题-解决方案(P-S)的技术机会发现研究[D]. 北京: 中国科学院大学(中国科学院文献情报中心), 2022. 6 马费成, 张帅. 我国图书情报领域新兴交叉学科发展探析[J]. 中国图书馆学报, 2023, 49(2): 4-14. 7 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 8 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. 9 王玉琢, 章成志. 考虑全文本内容的算法学术影响力分析研究[J]. 图书情报工作, 2017, 61(23): 6-14. 10 章成志, 丁睿祎, 王玉琢. 基于学术论文全文内容的算法使用行为及其影响力研究[J]. 情报学报, 2018, 37(12): 1175-1187. 11 Westergaard D, St?rfeldt H H, T?nsberg C, et al. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts[J]. PLoS Computational Biology, 2018, 14(2): e1005962. 12 Lin J. Is searching full text more effective than searching abstracts?[J]. BMC Bioinformatics, 2009, 10(1): Article No.46. 13 Yang H C, Aguirre C, Hsu W. PIEKM: ML-based procedural information extraction and knowledge management system for materials science literature[C]// Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2022: 57-62. 14 Yang H C, Hsu W. Named entity recognition from synthesis procedural text in materials science domain with attention-based approach[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper15. 15 Zhang H H, Ren F L. BERTatDE at SemEval-2020 task 6: extracting term-definition Pairs in free text using pre-trained model[C]// Proceedings of the Fourteenth Workshop on Semantic Evaluation. Stroudsburg: International Committee for Computational Linguistics, 2020: 690-696. 16 Wray A. Formulaic sequences in second language teaching: principle and practice[J]. Applied Linguistics, 2000, 21(4): 463-489. 17 Liakata M, Teufel S, Siddharthan A, et al. Corpora for the conceptualisation and zoning of scientific papers[C]// Proceedings of the 7th International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2010: 2054-2061. 18 Shorten C, Khoshgoftaar T M, Furht B. Text data augmentation for deep learning[J]. Journal of Big Data, 2021, 8(1): Article No.101. 19 Shakeel M H, Karim A, Khan I. A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts[J]. Information Processing & Management, 2020, 57(3): 102204. 20 Shah P K, Perez-Iratxeta C, Bork P, et al. Information extraction from full text scientific articles: Where are the keywords?[J]. BMC Bioinformatics, 2003, 4(1): Article No.20. 21 Zadeh B Q, Handschuh S. Investigating context parameters in technology term recognition[C]// Proceedings of the COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language. Stroudsburg & Dublin: Association for Computational Linguistics and Dublin City University, 2014: 1-10. 22 Augenstein I, Das M, Riedel S, et al. SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555. 23 Zhang C Z, Mayr P, Lu W, et al. Guest editorial: Extraction and evaluation of knowledge entities in the age of artificial intelligence[J]. Aslib Journal of Information Management, 2023, 75(3): 433-437. 24 Hong Z, Tchoua R, Chard K, et al. SciNER: extracting named entities from scientific literature[C]// Proceedings of the 20th International Conference on Computational Science. Cham: Springer, 2020: 308-321. 25 Hou L L, Zhang J, Wu O, et al. Method and dataset entity mining in scientific literature: a CNN + BiLSTM model with self-attention[J]. Knowledge-Based Systems, 2022, 235: 107621. 26 Kumar A, Starly B. “FabNER”: information extraction from manufacturing process science domain literature using named entity recognition[J]. Journal of Intelligent Manufacturing, 2022, 33(8): 2393-2407. 27 Brack A, D’Souza J, Hoppe A, et al. Domain-independent extraction of scientific concepts from research articles[C]// Proceedings of the European Conference on Advances in Information Retrieval. Cham: Springer, 2020: 251-266. 28 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 29 F?rber M, Albers A, Schüber F. Identifying used methods and datasets in scientific publications[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper19. 30 Shen S, Liu J F, Lin L T, et al. SciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. 31 Puccetti G, Giordano V, Spada I, et al. Technology identification from patent texts: a novel named entity recognition method[J]. Technological Forecasting and Social Change, 2023, 186: 122160. 32 Li R, Li D, Yang J X, et al. Joint extraction of entities and relations via an entity correlated attention neural model[J]. Information Sciences, 2021, 581: 179-193. 33 Wu H Y, Huang J. Joint entity and relation extraction network with enhanced explicit and implicit semantic information[J]. Applied Sciences, 2022, 12(12): 6231. 34 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232. 35 Ma Y Q, Liu J W, Lu W, et al. From “what” to “how”: extracting the procedural scientific information toward the metric-optimization in AI[J]. Information Processing & Management, 2023, 60(3): 103315. 36 Ding B S, Qin C W, Liu L L, et al. Is GPT-3 a good data annotator?[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 11173-11195. 37 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24. 38 Dai X, Adel H. An analysis of simple data augmentation for named entity recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020: 3861-3867. 39 Li K, Chen C B, Quan X J, et al. Conditional augmentation for aspect term extraction via masked sequence-to-sequence generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7056-7066. 40 Ding B S, Liu L L, Bing L D, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 6045-6057. 41 Zheng C M, Cai Y, Xu J Y, et al. A boundary-aware neural model for nested named entity recognition[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 357-366. 42 Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 2692-2700. 43 Li J, Ye D H, Shang S. Adversarial transfer for named entity boundary detection with pointer networks[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2019: 5053-5059. 44 Yan H, Gui T, Dai J Q, et al. A unified generative framework for various NER subtasks[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5808-5822. 45 Samuel J, Yuan X H, Yuan X J, et al. Mining online full-text literature for novel protein interaction discovery[C]// Proceedings of the 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops. Piscataway: IEEE, 2010: 277-282. 46 Syed S, Spruit M. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation[C]// Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics. Piscataway: IEEE, 2017: 165-174. 47 Dang V B, Aizawa A. Multi-class named entity recognition via bootstrapping with dependency tree-based patterns[C]// Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Heidelberg: Springer, 2008: 76-87. 48 Zeng X J, Li Y L, Zhai Y C, et al. Counterfactual generator: a weakly-supervised method for named entity recognition[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 7270-7280. 49 Toulmin S. Human understanding[M]. Princeton: Princeton University Press, 1977. 50 Houngbo H, Mercer R E. Method mention extraction from scientific research papers[C]// Proceedings of COLING 2012. The COLING 2012 Organizing Committee, 2012: 1211-1222. 51 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9. 52 Chu H T, Ke Q. Research methods: What’s in the name?[J]. Library & Information Science Research, 2017, 39(4): 284-294. 53 Qasemizadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2016:1862-1868. 54 Wang Z H, Shang J B, Liu L Y, et al. CrossWeigh: training named entity tagger from imperfect annotations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 5154-5163. 55 张颖怡. 学术论文中“问题-方法”关系抽取研究[D]. 南京: 南京理工大学, 2022. 56 孙向东, 刘拥军, 陈雯雯, 等. 箱线图法在动物卫生数据异常值检验中的运用[J]. 中国动物检疫, 2010, 27(7): 66-68. 57 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091. |
|
|
|