基于公式化表达脱敏与边界识别加强的学术论文研究问题与方法识别研究

doi:10.3772/j.issn.1000-0135.2024.06.007

情报学报

2024, Vol. 43

Issue (6): 712-732 DOI: 10.3772/j.issn.1000-0135.2024.06.007

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

基于公式化表达脱敏与边界识别加强的学术论文研究问题与方法识别研究

张颖怡¹, 章成志²

1.苏州大学社会学院档案与电子政务系,苏州 215123
2.南京理工大学经济管理学院信息管理系,南京 210094

Identification of Problem and Method in Scientific Papers Based on Formulaic Expression Desensitization and Enhanced Boundary Recognition

Zhang Yingyi¹, Zhang Chengzhi²

1.Department of Archives and E-government, School of Social Science, Soochow University, Suzhou 215123
2.Department of Information Management, School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (2519 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要研究问题和方法是学术论文中的重要组成部分，其在学术论文组织、管理与检索以及科研成果评价中具有重要意义。为缓解研究问题与方法识别中存在的公式化表达依赖和词语边界识别错误等问题，本文提出一种联合公式化表达脱敏和边界识别加强的模型。具体地，公式化表达脱敏使用数据增强方法实现，边界识别加强使用指针网络与序列标注模型实现。随着学术论文的开放获取，学术论文全文被研究者用于实体识别任务中。为证明使用学术论文全文的必要性，本文人工构建了自然语言处理领域的摘要和全文标注数据集，同时设计了数值和内容指标，用于分析两类数据集中的问题和方法识别结果以及问题与方法关系对抽取结果的差异。十折交叉实验结果表明，本文模型的宏平均F₁值优于SciBERT-BiLSTM-CRF基线模型3.69个百分点且存在显著性差异。根据摘要与全文实体识别和关系对抽取结果的对比，发现摘要中包含的问题与方法实体的表意较宽泛，全文中具有更多描述模型设计和训练细节的实体和关系对。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张颖怡
	章成志

关键词 ：知识实体识别, 研究问题和方法识别, 指针网络, 数据增强

收稿日期: 2023-07-24

基金资助:国家自然科学基金项目“基于学术文献全文内容的细粒度算法实体抽取与评估研究”（72074113）。

作者简介: 张颖怡，女，1992年生，博士，讲师，主要研究领域为学术文本挖掘与自然语言处理；章成志，通信作者，男，1977年生，博士，教授，博士生导师，主要研究领域为信息组织、信息检索、数据挖掘及自然语言处理，E-mail：zhangcz@njust.edu.cn；

引用本文:

张颖怡, 章成志. 基于公式化表达脱敏与边界识别加强的学术论文研究问题与方法识别研究[J]. 情报学报, 2024, 43(6): 712-732.
Zhang Yingyi, Zhang Chengzhi. Identification of Problem and Method in Scientific Papers Based on Formulaic Expression Desensitization and Enhanced Boundary Recognition. 情报学报, 2024, 43(6): 712-732.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2024.06.007 或 https://qbxb.istic.ac.cn/CN/Y2024/V43/I6/712

1 李丹. 科学研究活动中的知识管理研究[D]. 武汉: 武汉大学, 2005.
2 Luo Z R, Lu W, He J G, et al. Combination of research questions and methods: a new measurement of scientific novelty[J]. Journal of Informetrics, 2022, 16(2): 101282.
3 Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382.
4 Kova?evi? A, Konjovi? Z, Milosavljevi? B, et al. Mining methodologies from NLP publications: a case study in automatic terminology recognition[J]. Computer Speech & Language, 2012, 26(2): 105-126.
5 伊惠芳. 基于问题-解决方案(P-S)的技术机会发现研究[D]. 北京: 中国科学院大学(中国科学院文献情报中心), 2022.
6 马费成, 张帅. 我国图书情报领域新兴交叉学科发展探析[J]. 中国图书馆学报, 2023, 49(2): 4-14.
7 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600.
8 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650.
9 王玉琢, 章成志. 考虑全文本内容的算法学术影响力分析研究[J]. 图书情报工作, 2017, 61(23): 6-14.
10 章成志, 丁睿祎, 王玉琢. 基于学术论文全文内容的算法使用行为及其影响力研究[J]. 情报学报, 2018, 37(12): 1175-1187.
11 Westergaard D, St?rfeldt H H, T?nsberg C, et al. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts[J]. PLoS Computational Biology, 2018, 14(2): e1005962.
12 Lin J. Is searching full text more effective than searching abstracts?[J]. BMC Bioinformatics, 2009, 10(1): Article No.46.
13 Yang H C, Aguirre C, Hsu W. PIEKM: ML-based procedural information extraction and knowledge management system for materials science literature[C]// Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2022: 57-62.
14 Yang H C, Hsu W. Named entity recognition from synthesis procedural text in materials science domain with attention-based approach[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper15.
15 Zhang H H, Ren F L. BERTatDE at SemEval-2020 task 6: extracting term-definition Pairs in free text using pre-trained model[C]// Proceedings of the Fourteenth Workshop on Semantic Evaluation. Stroudsburg: International Committee for Computational Linguistics, 2020: 690-696.
16 Wray A. Formulaic sequences in second language teaching: principle and practice[J]. Applied Linguistics, 2000, 21(4): 463-489.
17 Liakata M, Teufel S, Siddharthan A, et al. Corpora for the conceptualisation and zoning of scientific papers[C]// Proceedings of the 7th International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2010: 2054-2061.
18 Shorten C, Khoshgoftaar T M, Furht B. Text data augmentation for deep learning[J]. Journal of Big Data, 2021, 8(1): Article No.101.
19 Shakeel M H, Karim A, Khan I. A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts[J]. Information Processing & Management, 2020, 57(3): 102204.
20 Shah P K, Perez-Iratxeta C, Bork P, et al. Information extraction from full text scientific articles: Where are the keywords?[J]. BMC Bioinformatics, 2003, 4(1): Article No.20.
21 Zadeh B Q, Handschuh S. Investigating context parameters in technology term recognition[C]// Proceedings of the COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language. Stroudsburg & Dublin: Association for Computational Linguistics and Dublin City University, 2014: 1-10.
22 Augenstein I, Das M, Riedel S, et al. SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555.
23 Zhang C Z, Mayr P, Lu W, et al. Guest editorial: Extraction and evaluation of knowledge entities in the age of artificial intelligence[J]. Aslib Journal of Information Management, 2023, 75(3): 433-437.
24 Hong Z, Tchoua R, Chard K, et al. SciNER: extracting named entities from scientific literature[C]// Proceedings of the 20th International Conference on Computational Science. Cham: Springer, 2020: 308-321.
25 Hou L L, Zhang J, Wu O, et al. Method and dataset entity mining in scientific literature: a CNN + BiLSTM model with self-attention[J]. Knowledge-Based Systems, 2022, 235: 107621.
26 Kumar A, Starly B. “FabNER”: information extraction from manufacturing process science domain literature using named entity recognition[J]. Journal of Intelligent Manufacturing, 2022, 33(8): 2393-2407.
27 Brack A, D’Souza J, Hoppe A, et al. Domain-independent extraction of scientific concepts from research articles[C]// Proceedings of the European Conference on Advances in Information Retrieval. Cham: Springer, 2020: 251-266.
28 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620.
29 F?rber M, Albers A, Schüber F. Identifying used methods and datasets in scientific publications[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper19.
30 Shen S, Liu J F, Lin L T, et al. SciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263.
31 Puccetti G, Giordano V, Spada I, et al. Technology identification from patent texts: a novel named entity recognition method[J]. Technological Forecasting and Social Change, 2023, 186: 122160.
32 Li R, Li D, Yang J X, et al. Joint extraction of entities and relations via an entity correlated attention neural model[J]. Information Sciences, 2021, 581: 179-193.
33 Wu H Y, Huang J. Joint entity and relation extraction network with enhanced explicit and implicit semantic information[J]. Applied Sciences, 2022, 12(12): 6231.
34 Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232.
35 Ma Y Q, Liu J W, Lu W, et al. From “what” to “how”: extracting the procedural scientific information toward the metric-optimization in AI[J]. Information Processing & Management, 2023, 60(3): 103315.
36 Ding B S, Qin C W, Liu L L, et al. Is GPT-3 a good data annotator?[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 11173-11195.
37 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24.
38 Dai X, Adel H. An analysis of simple data augmentation for named entity recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020: 3861-3867.
39 Li K, Chen C B, Quan X J, et al. Conditional augmentation for aspect term extraction via masked sequence-to-sequence generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7056-7066.
40 Ding B S, Liu L L, Bing L D, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 6045-6057.
41 Zheng C M, Cai Y, Xu J Y, et al. A boundary-aware neural model for nested named entity recognition[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 357-366.
42 Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 2692-2700.
43 Li J, Ye D H, Shang S. Adversarial transfer for named entity boundary detection with pointer networks[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2019: 5053-5059.
44 Yan H, Gui T, Dai J Q, et al. A unified generative framework for various NER subtasks[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5808-5822.
45 Samuel J, Yuan X H, Yuan X J, et al. Mining online full-text literature for novel protein interaction discovery[C]// Proceedings of the 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops. Piscataway: IEEE, 2010: 277-282.
46 Syed S, Spruit M. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation[C]// Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics. Piscataway: IEEE, 2017: 165-174.
47 Dang V B, Aizawa A. Multi-class named entity recognition via bootstrapping with dependency tree-based patterns[C]// Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Heidelberg: Springer, 2008: 76-87.
48 Zeng X J, Li Y L, Zhai Y C, et al. Counterfactual generator: a weakly-supervised method for named entity recognition[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 7270-7280.
49 Toulmin S. Human understanding[M]. Princeton: Princeton University Press, 1977.
50 Houngbo H, Mercer R E. Method mention extraction from scientific research papers[C]// Proceedings of COLING 2012. The COLING 2012 Organizing Committee, 2012: 1211-1222.
51 Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9.
52 Chu H T, Ke Q. Research methods: What’s in the name?[J]. Library & Information Science Research, 2017, 39(4): 284-294.
53 Qasemizadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2016:1862-1868.
54 Wang Z H, Shang J B, Liu L Y, et al. CrossWeigh: training named entity tagger from imperfect annotations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 5154-5163.
55 张颖怡. 学术论文中“问题-方法”关系抽取研究[D]. 南京: 南京理工大学, 2022.
56 孙向东, 刘拥军, 陈雯雯, 等. 箱线图法在动物卫生数据异常值检验中的运用[J]. 中国动物检疫, 2010, 27(7): 66-68.
57 Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091.