0 引 言
样本序号 | 学术论文句子 | 人工标注的方法词 | 序列标注模型识别的方法词 |
1 | We propose that learning the contexts for the application of these linguistic operations can be viewed as per-operation classification problems. | Null | learning the contexts |
2 | In this paper, we explore a new theory of discourse structure that stresses the role of purpose and processing in discourse. | theory of discourse structure | structure |
3 | combining speech recognition and natural language processing to achieve speech understanding. | speech recognition | speech recognition and |
注: 下划线代表人工标注方法词,粗体代表句子中的公式化表达;Null表示结果为空。
1 相关研究概述
1.1 学术论文实体识别模型研究概述
1.1.1 适用于实体识别的数据增强方法概述
1.1.2 适用于实体识别的边界加强模型概述
1.2 学术论文摘要和全文实体识别结果对比研究概述
1.2.1 生物关系对发现任务
1.2.2 关键词抽取任务
1.2.3 主题识别任务
1.3 现有相关研究工作总结
2 研究方法
图1 研究框架
2.1 学术论文问题与方法实体识别模型
2.1.1 基于公式化表达脱敏的数据增强模型
该数据增强方法分为4个步骤。第一,句子的依存树解析。该步骤使用Stanford CoreNLP中的依存分析模块得到训练集中每个句子的依存树。第二,问题和方法实体的依存子树解析。该步骤构建句子中问题和方法的依存子树。第三,公式化表达词典构建。该步骤选择问题和方法依存子树的父节点以及父节点与该子树相连接的介词作为公式化表达词来构建公式化表达词典。第四,词语替换。该步骤按比例随机从句子中选择存在于公式化表达词典的词语进行替换,本文统一替换为标记“_”。词语替换时需要避开句子中的问题和方法实体。本文的公式化表达选择的方法受Dang
图2 基于公式化表达脱敏的数据增强方法的公式化表达选择策略
图3 数据增强方法中的词语质量判断器
2.1.2 结合边界识别加强的序列标注模型
图4 边界识别加强的序列标注模型
其一,问题实体开始位置的预测。假设存在词序列,其中,表示文档中第个句子中的第个词。首先,经过预训练语言模型后转变为句子中每个词的向量表示。其次,被输入BiLSTM编码器,得到编码器最后一层中每个时间步的输出和各层中最后一个时间步的输出和。再其次,和被输入LSTM(long short-term memory)解码器,得到解码器隐状态。最后,和BiLSTM层的输出被一起输入注意力机制模块,即
(1) |
(2) |
在该模块中,首先,选择多个权重向量每个维度的最大值组成新的权重向量。其次,将和BiLSTM的输出相乘得到句子中每个单词新的向量表示。再其次,使用两个add & norm层和一个feed forword层对进行优化。最后,将优化后的输入条件随机场(conditional random field,CRF)中进行标签预测,得到每个词语的标签。CRF可为预测的标签添加约束,来保证标签前后顺序符合常理。
2.2 问题与方法实体人工标注数据集构建
2.2.1 数据集选择
2.2.2 问题与方法定义
类型 | 解释 | 文献依据 | |
问题 | 困难 |
科学中未能解释的现象或未解决的难题。 例子:The performance of middle-paused punctuation prediction is fairly low between all methods, which shows predicting middle-paused punctuations is a difficult task. |
[ |
研究任务 |
作者旨在进行的探索工作。 例子:This paper presents a new approach to statistical sentence generation in which alternative phrases are represented as packed sets of trees, or forests, and then ranked statistically to choose the best one. |
[ | |
障碍/差距 |
已有的或提出的方法与理想的差距。 例子:The conceptual retrieval systems, though quite effective, are not yet mature enough to be considered in serious information retrieval applications, the major problems being their extreme inefficiency and the need for manual encoding of domain knowledge (Mauldin, 1991). |
[ | |
方法 | 模型/系统/框架 |
模型、系统和框架等。这类研究方法词语常常带有固定的后缀,如algorithm、system、approach、model、framework等。 例子:In this paper, we describe the pronominal anaphora resolution module of Lucy, a portable English understanding system. |
[ |
工具/程序库 |
已实现的技术和库的名称,如Pytorch等。 例子:STTK, a statistical machine translation toolkit, will be introduced and used to build a working translation system. | ||
数据集/语料 |
数据或数据产品,如Yelp dataset等。 例子:Our training data of transition-based dependency trees are converted from phrasal structure trees in English Web Treebank (LDC2012T13) and the English portion of OntoNotes 4.0 (LDC2011T03) by the Stanford Conversion toolkit (Marneffe et al., 2006). | ||
评价指标 |
评价指标工具,如准确率等。 例子:BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU. | ||
操作 |
为解决问题所执行的具体做法,该类往往以动名词短语的形式出现。 例子:In order to understand the described world, the authors try to reconstruct the geometric model of the global scene from the scenic descriptions drawing a space. | ||
其他 |
不包括在以上类别中的方法,如position information(位置信息)等模型特征。 例子:For example, our extraposition model presented above depends upon the value of the verb-position feature, which is predicted upstream in the pipeline. | ||
研究方法 |
一般的、较宽泛的方法,如实验法、建模法和问卷调查法等。 例子:The machine learning approach also facilitates adaptation of the system to a new domain or language. |
[ |
注: 下划线表示句子中的研究问题和研究方法实体。
2.2.3 数据标注流程
实验数据集的标注规范根据2.2.2节中问题和方法的定义进行设计。除了定义上的限定,还需考虑其他情况,例如,是否需标注定冠词、是否需标注括号中的内容等。ACL RD-TEC数据标注规
词语标注基于问题与方法定义和ACL RD-TEC数据标注规范进行。在正式标注前,招募2名研究方向为文本挖掘与自然语言处理的标注人员,包括1位硕士在读人员和1位博士在读人员。2名标注人员独立标注从摘要数据集中随机选择的30篇摘要,标注结束后使用F1
数据集 | 实体类型 | |||||||
问题 | 方法 | |||||||
标注错误 | 预测错误 | 标注错误 | 预测错误 | |||||
数量 | 百分比(%) | 数量 | 百分比(%) | 数量 | 百分比(%) | 数量 | 百分比(%) | |
摘要标注数据集 | 52 | 12.01 | 381 | 87.99 | 26 | 4.97 | 497 | 95.03 |
全文标注数据集 | 185 | 11.46 | 1430 | 88.54 | 214 | 8.87 | 2198 | 91.13 |
数据集 | 类型 | |||
问题 | 方法 | |||
总数 | 篇均数量 | 总数 | 篇均数量 | |
摘要标注数据集 | 1214 | 3.98 | 1725 | 5.66 |
全文标注数据集 | 4284 | 57.89 | 6914 | 93.43 |
3 实验与结果分析
3.1 问题与方法实体识别基线模型选择
3.1.1 基于公式化表达脱敏的数据增强方法的基线模型
(4)DAGA(data augmentation with a generation approach)-训练数据生
3.1.2 结合边界识别加强的序列标注模型的基线模型
3.2 实验参数设置
(2)基于word2vec的模型的参数设置。摘要和全文数据集的训练轮数(epoch)分别为20和10,批数量(batch size)为32,学习率为0.005,BiLSTM神经元数量为200个。
(3)基于SciBERT的模型的参数设置。该模型的epoch和batch size与基于word2vec的模型相同。该模型的学习率为3e-5,最大句子长度限定为512,BiLSTM的神经元数量为150个。在结合边界识别加强的序列标注模型中,损失函数的权重、和分别设置为0.4、0.3和0.3。实验中使用十折交叉验证方法。
(5)ChatGPT版本选择。ChatGPT版本选择GPT-3.5-turbo,即ChatGPT中使用的大语言模型。使用OpenAI API调用该模
3.3 结果评价方法
(3) |
(4) |
(5) |
3.4 实验结果
3.4.1 基于公式化表达脱敏的数据增强模型的性能分析
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
SciBERT-BiLSTM-CRF | 65.14 | 73.00 | 68.69 | 61.92 | 72.38 | 66.67 | 63.11 | 72.65 | 67.68 |
词典-实体替换 | 68.43 | 72.99 | 70.56 | 70.68 | 69.26 | 69.84 | 69.66 | 70.82 | 70.19 |
MASS-上下文替换 | 65.39 | 75.16 | 69.82 | 65.70 | 72.92 | 69.03 | 65.54 | 74.04 | 69.43 |
MASS-实体替换 | 65.36 | 72.95 | 68.85 | 66.68 | 70.15 | 68.23 | 66.02 | 71.55 | 68.54 |
DAGA-训练数据生成 | 66.59 | 74.51 | 70.22 | 61.47 | 73.66 | 66.87 | 64.03 | 74.08 | 68.54 |
ChatGPT-训练数据生成 | 64.88 | 73.41 | 68.78 | 62.16 | 73.31 | 67.25 | 63.52 | 73.36 | 68.02 |
公式化表达脱敏 | 68.17 | 74.66 | 71.12 | 67.25 | 73.45 | 70.15 | 67.71 | 74.06 | 70.64 |
注: 粗体表示对应指标上模型的最优结果。
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
SciBERT-BiLSTM-CRF | 63.14 | 72.26 | 67.32 | 67.23 | 71.98 | 69.47 | 65.18 | 72.12 | 68.39 |
词典-实体替换 | 70.90 | 68.10 | 69.37 | 73.64 | 66.99 | 70.08 | 72.27 | 67.54 | 69.72 |
MASS-上下文替换 | 67.80 | 68.02 | 67.86 | 72.78 | 66.22 | 69.31 | 70.29 | 67.12 | 68.58 |
MASS-实体替换 | 68.41 | 67.23 | 67.74 | 71.84 | 67.21 | 69.36 | 70.13 | 67.22 | 68.55 |
DAGA-训练数据生成 | 62.52 | 74.68 | 67.99 | 63.52 | 74.91 | 68.65 | 61.52 | 74.46 | 67.33 |
ChatGPT-训练数据生成 | 64.18 | 71.50 | 67.55 | 67.88 | 71.42 | 69.55 | 66.03 | 71.46 | 68.55 |
公式化表达脱敏 | 64.77 | 73.69 | 68.83 | 67.22 | 73.98 | 70.43 | 66.66 | 73.83 | 69.99 |
注: 粗体表示对应指标上模型的最优结果。
3.4.2 结合边界识别加强的序列标注模型的性能分析
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
word2vec-BiLSTM | 49.73 | 49.22 | 49.31 | 53.99 | 55.56 | 54.61 | 51.94 | 52.87 | 51.96 |
word2vec-边界感知 | 49.37 | 47.94 | 48.57 | 48.16 | 56.13 | 51.69 | 48.77 | 52.03 | 50.13 |
char-边界感知 | 51.73 | 49.55 | 50.49 | 49.31 | 57.84 | 52.93 | 50.52 | 53.70 | 51.71 |
SciBERT-BiLSTM-CRF | 65.14 | 73.00 | 68.69 | 61.92 | 72.38 | 66.67 | 63.11 | 72.65 | 67.68 |
BART | 67.72 | 70.25 | 68.91 | 64.99 | 70.00 | 67.26 | 66.35 | 70.12 | 68.08 |
seq2seq | 54.33 | 62.57 | 58.14 | 48.26 | 54.09 | 50.94 | 51.30 | 58.33 | 54.54 |
ChatGPT-prompt | 21.24 | 11.13 | 14.58 | 26.88 | 25.85 | 26.25 | 24.06 | 18.49 | 20.41 |
边界识别加强 | 64.08 | 76.11 | 69.47 | 63.92 | 72.38 | 67.60 | 64.00 | 74.24 | 68.54 |
注: 粗体表示对应指标上模型的最优结果。
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
word2vec-BiLSTM | 53.24 | 54.43 | 53.74 | 56.88 | 55.57 | 56.18 | 55.40 | 55.12 | 54.96 |
word2vec-边界感知 | 53.93 | 52.30 | 53.00 | 59.12 | 56.41 | 57.66 | 56.52 | 54.36 | 55.33 |
char-边界感知 | 53.47 | 55.58 | 54.37 | 56.71 | 57.89 | 57.27 | 55.09 | 56.74 | 55.82 |
SciBERT-BiLSTM-CRF | 63.14 | 72.26 | 67.32 | 67.23 | 71.98 | 69.47 | 65.18 | 72.12 | 68.39 |
BART | 63.56 | 74.37 | 68.49 | 63.65 | 74.30 | 68.51 | 63.61 | 74.34 | 68.50 |
seq2seq | 57.35 | 78.69 | 66.31 | 45.98 | 65.93 | 54.16 | 51.66 | 72.31 | 60.24 |
ChatGPT-prompt | 13.92 | 12.92 | 13.39 | 18.11 | 32.81 | 23.33 | 16.02 | 22.87 | 18.36 |
边界识别加强 | 63.63 | 73.72 | 68.21 | 67.96 | 71.79 | 69.73 | 65.79 | 72.75 | 68.97 |
注: 粗体表示对应指标上模型的最优结果。
3.4.3 联合模型的性能分析
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
词典-实体替换 | 68.43 | 72.99 | 70.56 | 70.68 | 69.26 | 69.84 | 69.66 | 70.82 | 70.19 |
公式化表达脱敏 | 68.17 | 74.66 | 71.12 | 67.25 | 73.45 | 70.15 | 67.71 | 74.06 | 70.64 |
边界识别加强 | 64.08 | 76.11 | 69.47 | 63.92 | 72.38 | 67.60 | 64.00 | 74.24 | 68.54 |
边界识别加强+词典-实体替换 | 68.68 | 74.74 | 71.50 | 69.77 | 71.89 | 70.75 | 69.22 | 73.32 |
71.1 |
边界识别加强+公式化表达脱敏 | 68.48 | 77.06 | 72.48 | 67.64 | 73.19 | 70.26 | 68.06 | 75.12 |
71.3 |
注: 粗体表示对应指标上模型的最优结果。**表示与SciBERT-BiLSTM-CRF模型具有显著性差异,且P<0.01。
模型 | 指标 | ||||||||
问题 | 方法 | macro | |||||||
P | R | F1 | P | R | F1 | P | R | F1 | |
词典-实体替换 | 70.90 | 68.10 | 69.37 | 73.64 | 66.99 | 70.08 | 72.27 | 67.54 | 69.72 |
公式化表达脱敏 | 64.77 | 73.69 | 68.83 | 67.22 | 73.98 | 70.43 | 66.66 | 73.83 | 69.99 |
边界识别加强 | 63.63 | 73.72 | 68.21 | 67.96 | 71.79 | 69.73 | 65.79 | 72.75 | 68.97 |
边界识别加强+词典-实体替换 | 71.83 | 67.80 | 69.69 | 74.11 | 67.50 | 70.60 | 72.97 | 67.65 |
70.1 |
边界识别加强+公式化表达脱敏 | 68.42 | 71.81 | 70.01 | 70.97 | 71.83 | 71.37 | 69.70 | 71.82 |
70.6 |
注: 粗体表示对应指标上模型的最优结果。*和**表示与SciBERT-BiLSTM-CRF模型具有显著性差异,且*表示P<0.05,**表示P<0.01。
3.4.4 本文模型在其他领域数据集上的性能分析
实体类型 | 领域 | ||
农学 | 地球科学 | 数学 | |
过程 | research; model; in; induce; focus | release; analyse; play; dominate; review | construction; consider; extension; calculation; approach |
数据 | increase; concentration; at; level; measure | show; illustrate; evaluate; explain; from | prove; construction; show; use; result |
材料 | use; establish; input; system; of | activity; record; correlation; set; associate | sequence; group; surface; set; closure |
领域 | 模型 | 实体类型 | |||||
过程 | 数据 | 材料 | macro | ||||
F1 | F1 | F1 | P | R | F1 | ||
天文学 | SciBERT-BiLSTM-CRF | 8.58 | 44.14 | 42.90 | 28.07 | 46.44 | 31.87 |
公式化表达脱敏 | 7.71 | 49.27 | 49.48 | 32.02 | 46.40 | 35.48 | |
边界识别加强 | 7.44 | 54.96 | 61.79 | 38.72 | 50.27 | 41.40 | |
边界识别加强+公式化表达脱敏 | 8.81 | 54.30 | 62.91 | 39.55 | 53.51 | 42.01 | |
农学 | SciBERT-BiLSTM-CRF | 12.63 | 28.84 | 32.69 | 20.63 | 41.90 | 24.72 |
公式化表达脱敏 | 12.73 | 27.48 | 41.49 | 23.77 | 45.66 | 27.24 | |
边界识别加强 | 18.71 | 33.97 | 32.62 | 24.07 | 41.81 | 28.44 | |
边界识别加强+公式化表达脱敏 | 17.58 | 34.27 | 42.01 | 27.18 | 42.38 | 31.29 | |
生物学 | SciBERT-BiLSTM-CRF | 18.74 | 30.31 | 47.00 | 27.61 | 45.97 | 32.02 |
公式化表达脱敏 | 16.60 | 29.28 | 56.88 | 29.76 | 47.11 | 34.25 | |
边界识别加强 | 15.45 | 33.00 | 64.19 | 33.52 | 48.81 | 37.55 | |
边界识别加强+公式化表达脱敏 | 15.38 | 37.57 | 64.32 | 35.38 | 50.43 | 39.09 | |
化学 | SciBERT-BiLSTM-CRF | 9.84 | 27.08 | 42.51 | 22.81 | 40.30 | 26.47 |
公式化表达脱敏 | 8.53 | 29.70 | 45.56 | 24.81 | 41.04 | 27.93 | |
边界识别加强 | 10.02 | 29.30 | 42.09 | 23.32 | 41.64 | 27.14 | |
边界识别加强+公式化表达脱敏 | 10.25 | 34.24 | 48.99 | 27.83 | 44.58 | 31.16 | |
计算机科学 | SciBERT-BiLSTM-CRF | 12.85 | 38.61 | 26.15 | 22.36 | 42.27 | 25.87 |
公式化表达脱敏 | 12.99 | 44.14 | 26.78 | 24.73 | 44.83 | 27.97 | |
边界识别加强 | 13.22 | 46.00 | 27.98 | 24.95 | 45.82 | 29.07 | |
边界识别加强+公式化表达脱敏 | 13.66 | 47.51 | 34.10 | 27.90 | 48.11 | 31.76 | |
地球科学 | SciBERT-BiLSTM-CRF | 7.89 | 37.08 | 47.31 | 27.97 | 39.97 | 30.76 |
公式化表达脱敏 | 7.05 | 36.90 | 56.71 | 31.55 | 39.72 | 33.55 | |
边界识别加强 | 6.00 | 37.01 | 56.54 | 30.66 | 40.46 | 33.19 | |
边界识别加强+公式化表达脱敏 | 8.01 | 39.45 | 56.44 | 32.09 | 45.31 | 34.63 | |
工学 | SciBERT-BiLSTM-CRF | 15.41 | 50.08 | 64.37 | 40.10 | 55.60 | 43.29 |
公式化表达脱敏 | 12.67 | 55.56 | 64.27 | 42.02 | 54.46 | 44.16 | |
边界识别加强 | 14.97 | 56.49 | 64.35 | 42.84 | 60.11 | 45.27 | |
边界识别加强+公式化表达脱敏 | 14.35 | 57.41 | 66.19 | 43.21 | 62.13 | 45.98 | |
材料科学 | SciBERT-BiLSTM-CRF | 7.72 | 42.91 | 56.77 | 32.04 | 50.42 | 35.80 |
公式化表达脱敏 | 8.49 | 48.44 | 62.35 | 35.99 | 54.91 | 39.76 | |
边界识别加强 | 10.37 | 51.14 | 60.86 | 37.17 | 56.31 | 40.79 | |
边界识别加强+公式化表达脱敏 | 9.98 | 52.25 | 69.45 | 40.64 | 58.47 | 43.89 | |
数学 | SciBERT-BiLSTM-CRF | 1.25 | 28.25 | 15.46 | 12.81 | 21.25 | 14.99 |
公式化表达脱敏 | 4.28 | 25.78 | 17.16 | 13.34 | 28.71 | 15.74 | |
边界识别加强 | 2.31 | 40.45 | 15.07 | 16.45 | 33.65 | 19.28 | |
边界识别加强+公式化表达脱敏 | 2.62 | 38.01 | 17.32 | 18.20 | 24.39 | 19.32 | |
医学 | SciBERT-BiLSTM-CRF | 16.29 | 32.42 | 15.81 | 19.72 | 25.50 | 21.51 |
公式化表达脱敏 | 28.99 | 26.47 | 12.46 | 21.58 | 25.18 | 22.64 | |
边界识别加强 | 5.07 | 37.92 | 40.41 | 26.46 | 31.16 | 27.80 | |
边界识别加强+公式化表达脱敏 | 14.84 | 38.27 | 38.17 | 28.71 | 33.96 | 30.43 |
注: 粗体表示对应指标上模型的最优结果。
4 摘要和全文的问题与方法实体识别结果对比分析
4.1 数据准备与实体词识别
采用ACL会议1979—2020年的学术论文摘要和全文作为数据集。其中,1979—2015年的论文由ACL Antology开源,2016—2020年的论文为课题组自行收集,共计7347篇论文。由于PDF解析错误导致部分论文缺少摘要,将这部分论文剔除,筛选后共有6749篇论文。论文分为摘要数据和全文数据,摘要数据包括标题和摘要,全文数据包括标题、摘要和正文。
4.2 问题与方法关系对识别
4.3 分析指标设计
本节设计两类分析指标,分别是数值指标与内容指标。其中,数值指标包括实体数量指标和关系对数量指标,内容指标包括高频实体指标和高频关系对指标。实体数量指标和关系对数量指标分别是指各篇论文包含的消歧后实体和关系对的数量平均值。高频实体指标和高频关系对指标是指被提及频次为Top N的实体和关系对。在提及频次计算中,使用消歧后的实体进行统计且仅统计提及的论文数,不考虑一篇论文中的多次提及。
4.4 问题和方法实体识别结果对比分析
4.4.1 数值指标分析
词语类型 | 数据集 | ||||||
语义时期 | 传统机器学习时期 | 深度学习时期 | |||||
摘要 | 全文 | 摘要 | 全文 | 摘要 | 全文 | ||
问题词 | 类型指标 | 1.53 | 6.12 | 1.94 | 11.36 | 2.12 | 14.96 |
方法词 | 类型指标 | 3.18 | 24.83 | 3.85 | 39.39 | 4.61 | 54.47 |
图5 3个时间段中摘要和全文数据集中每篇论文的实体数量分布
关系类型 | 数据集 | |||||
语义时期 | 传统机器学习时期 | 深度学习时期 | ||||
摘要 | 全文 | 摘要 | 全文 | 摘要 | 全文 | |
平均关系对数量 | 0.24 | 2.56 | 0.46 | 6.93 | 0.54 | 10.41 |
图6 3个时间段中摘要和全文数据集中每篇论文的关系对数量分布
4.4.2 内容指标分析
数据集 | 语义时期 | 传统机器学习时期 | 深度学习时期 | |||
词语 | 数量 | 词语 | 数量 | 词语 | 数量 | |
摘要 | parsing task | 54 | natural language processing | 178 | natural language processing | 351 |
natural language processing | 33 | parsing task | 138 | machine translation task | 150 | |
machine translation task | 17 | machine translation task | 129 | named entity recognition | 115 | |
semantic understanding | 16 | statistical machine translation task | 117 | question answering | 103 | |
semantic representation | 9 | classification task | 72 | classification task | 102 | |
computational linguistics task | 8 | part of speech tagging | 57 | sentiment classification | 97 | |
natural language generation | 7 | information retrieval task | 50 | parsing task | 77 | |
resolve syntactic ambiguity | 6 | speech recognition task | 48 | neural machine translation task | 74 | |
natural language understanding | 6 | information extraction task | 46 | learning word representation | 66 | |
speech recognition task | 5 | word alignment task | 43 | text classification task | 57 | |
全文 | parsing task | 207 | natural language processing | 1313 | natural language processing | 2180 |
natural language processing | 147 | machine translation task | 845 | machine translation task | 1128 | |
machine translation task | 75 | parsing task | 821 | classification task | 1097 | |
computational linguistics task | 72 | classification task | 788 | named entity recognition | 843 | |
classification task | 71 | computational linguistics task | 480 | question answering | 683 | |
semantic representation | 58 | named entity recognition | 463 | learning word representation | 662 | |
information retrieval task | 54 | information retrieval task | 451 | parsing task | 576 | |
semantic understanding | 47 | statistical machine translation task | 391 | sentiment classification | 527 | |
syntactic representation | 47 | part of speech tagging | 390 | learning model parameter | 487 | |
knowledge representation | 46 | data sparsity | 361 | feature learning | 454 |
注: 粗体表示各时期摘要和全文中特有的高频词语。
数据集 | 语义时期 | 传统机器学习时期 | 深度学习时期 | |||
词语 | 数量 | 词语 | 数量 | 词语 | 数量 | |
摘要 | parsing method | 67 | machine translation approach | 280 | neural network model | 432 |
computation method | 35 | parsing method | 184 | embedding technique | 262 | |
machine translation approach | 22 | classification method | 129 | machine translation approach | 222 | |
grammar based approach | 18 | clustering method | 109 | classification method | 123 | |
natural language processing method | 17 | bilingual parallel corpus | 103 | parsing method | 122 | |
parse tree | 14 | part of speech tagging | 92 | BERT | 115 | |
complexity metric | 12 | accuracy measure | 88 | word embedding method | 114 | |
statistical approach | 12 | machine learning model | 74 | attention mechanism | 111 | |
computational linguistic approach | 12 | F1 score metric | 71 | generation approach | 105 | |
part of speech tagging | 11 | generation approach | 66 | accuracy measure | 102 | |
全文 | parsing method | 213 | part of speech tagging | 1175 | embedding technique | 1966 |
computation method | 128 | classification method | 919 | neural network model | 1947 | |
machine translation approach | 124 | accuracy measure | 899 | loss function | 1276 | |
complexity metric | 122 | parsing method | 872 | classification method | 1146 | |
computational linguistic approach | 101 | machine translation approach | 868 | word embedding method | 1110 | |
part of speech tagging | 99 | clustering method | 656 | accuracy measure | 1100 | |
grammar based approach | 95 | machine learning model | 592 | learning rate strategy | 996 | |
parse tree | 87 | F1 score metric | 571 | LSTM-based system | 994 | |
accuracy measure | 84 | complexity metric | 567 | F1 score metric | 978 | |
lexical entry | 83 | supervised learning method | 505 | encoding method | 964 |
注: 粗体表示各时期摘要和全文中特有的高频词语。
数据集 | 语义时期 | 传统机器学习时期 | 深度学习时期 | |||
词语 | 数量 | 词语 | 数量 | 词语 | 数量 | |
摘要 | natural language processing & computation method | 2 | statistical machine translation & machine translation approach | 7 | natural language processing & neural network model | 8 |
parsing task & parsing method | 2 | statistical machine translation & decoder | 5 | natural language processing & deep learning approach | 7 | |
natural language processing & semantic lexicon | 1 | statistical machine translation & open source toolkit | 3 | natural language processing & word embedding method | 6 | |
parsing task & distributional method | 1 | machine translation task & discriminative algorithm | 3 | Chinese word segmentation & neural network model | 5 | |
classification task & logical analysis | 1 | parsing task & parsing method | 3 | question answering & knowledge base | 5 | |
全文 | parsing task & parsing method | 20 | machine translation task & machine translation approach | 57 | natural language processing & neural network model | 125 |
parsing task & lr parsing | 6 | word alignment problem & giza + + toolkit | 54 | avoid over- fitting & dropout rate | 50 | |
parsing task & generalized lr parsing | 5 | parsing task & parsing method | 51 | natural language processing & deep learning approach | 49 | |
natural language processing & parsing method | 5 | classification task & classification method | 43 | natural language processing & word embedding method | 43 | |
machine translation task & machine translation approach | 4 | classification task & SVM algorithm | 34 | learning word representation & embedding technique | 43 |
注: 粗体表示各时期摘要和全文中特有的高频关系对,&前后分别是问题实体和方法实体。
5 结论与展望
参 考 文 献
李丹. 科学研究活动中的知识管理研究[D]. 武汉: 武汉大学, 2005. [百度学术]
Luo Z R, Lu W, He J G, et al. Combination of research questions and methods: a new measurement of scientific novelty[J]. Journal of Informetrics, 2022, 16(2): 101282. [百度学术]
Heffernan K, Teufel S. Identifying problems and solutions in scientific text[J]. Scientometrics, 2018, 116(2): 1367-1382. [百度学术]
Kovačević A, Konjović Z, Milosavljević B, et al. Mining methodologies from NLP publications: a case study in automatic terminology recognition[J]. Computer Speech & Language, 2012, 26(2): 105-126. [百度学术]
伊惠芳. 基于问题-解决方案(P-S)的技术机会发现研究[D]. 北京: 中国科学院大学(中国科学院文献情报中心), 2022. [百度学术]
马费成, 张帅. 我国图书情报领域新兴交叉学科发展探析[J]. 中国图书馆学报, 2023, 49(2): 4-14. [百度学术]
章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. [百度学术]
张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. [百度学术]
王玉琢, 章成志. 考虑全文本内容的算法学术影响力分析研究[J]. 图书情报工作, 2017, 61(23): 6-14. [百度学术]
章成志, 丁睿祎, 王玉琢. 基于学术论文全文内容的算法使用行为及其影响力研究[J]. 情报学报, 2018, 37(12): 1175-1187. [百度学术]
Westergaard D, Stærfeldt H H, Tønsberg C, et al. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts[J]. PLoS Computational Biology, 2018, 14(2): e1005962. [百度学术]
Lin J. Is searching full text more effective than searching abstracts?[J]. BMC Bioinformatics, 2009, 10(1): Article No.46. [百度学术]
Yang H C, Aguirre C, Hsu W. PIEKM: ML-based procedural information extraction and knowledge management system for materials science literature[C]// Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations. Stroudsburg: Association for Computational Linguistics, 2022: 57-62. [百度学术]
Yang H C, Hsu W. Named entity recognition from synthesis procedural text in materials science domain with attention-based approach[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper15. [百度学术]
Zhang H H, Ren F L. BERTatDE at SemEval-2020 task 6: extracting term-definition Pairs in free text using pre-trained model[C]// Proceedings of the Fourteenth Workshop on Semantic Evaluation. Stroudsburg: International Committee for Computational Linguistics, 2020: 690-696. [百度学术]
Wray A. Formulaic sequences in second language teaching: principle and practice[J]. Applied Linguistics, 2000, 21(4): 463-489. [百度学术]
Liakata M, Teufel S, Siddharthan A, et al. Corpora for the conceptualisation and zoning of scientific papers[C]// Proceedings of the 7th International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2010: 2054-2061. [百度学术]
Shorten C, Khoshgoftaar T M, Furht B. Text data augmentation for deep learning[J]. Journal of Big Data, 2021, 8(1): Article No.101. [百度学术]
Shakeel M H, Karim A, Khan I. A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts[J]. Information Processing & Management, 2020, 57(3): 102204. [百度学术]
Shah P K, Perez-Iratxeta C, Bork P, et al. Information extraction from full text scientific articles: Where are the keywords?[J]. BMC Bioinformatics, 2003, 4(1): Article No.20. [百度学术]
Zadeh B Q, Handschuh S. Investigating context parameters in technology term recognition[C]// Proceedings of the COLING Workshop on Synchronic and Diachronic Approaches to Analyzing Technical Language. Stroudsburg & Dublin: Association for Computational Linguistics and Dublin City University, 2014: 1-10. [百度学术]
Augenstein I, Das M, Riedel S, et al. SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications[C]// Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2017: 546-555. [百度学术]
Zhang C Z, Mayr P, Lu W, et al. Guest editorial: Extraction and evaluation of knowledge entities in the age of artificial intelligence[J]. Aslib Journal of Information Management, 2023, 75(3): 433-437. [百度学术]
Hong Z, Tchoua R, Chard K, et al. SciNER: extracting named entities from scientific literature[C]// Proceedings of the 20th International Conference on Computational Science. Cham: Springer, 2020: 308-321. [百度学术]
Hou L L, Zhang J, Wu O, et al. Method and dataset entity mining in scientific literature: a CNN + BiLSTM model with self-attention[J]. Knowledge-Based Systems, 2022, 235: 107621. [百度学术]
Kumar A, Starly B. “FabNER”: information extraction from manufacturing process science domain literature using named entity recognition[J]. Journal of Intelligent Manufacturing, 2022, 33(8): 2393-2407. [百度学术]
Brack A, D’Souza J, Hoppe A, et al. Domain-independent extraction of scientific concepts from research articles[C]// Proceedings of the European Conference on Advances in Information Retrieval. Cham: Springer, 2020: 251-266. [百度学术]
Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. [百度学术]
Färber M, Albers A, Schüber F. Identifying used methods and datasets in scientific publications[C]// Proceedings of the Workshop on Scientific Document Understanding. CEUR-WS.org, 2021: paper19. [百度学术]
Shen S, Liu J F, Lin L T, et al. SciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. [百度学术]
Puccetti G, Giordano V, Spada I, et al. Technology identification from patent texts: a novel named entity recognition method[J]. Technological Forecasting and Social Change, 2023, 186: 122160. [百度学术]
Li R, Li D, Yang J X, et al. Joint extraction of entities and relations via an entity correlated attention neural model[J]. Information Sciences, 2021, 581: 179-193. [百度学术]
Wu H Y, Huang J. Joint entity and relation extraction network with enhanced explicit and implicit semantic information[J]. Applied Sciences, 2022, 12(12): 6231. [百度学术]
Luan Y, He L H, Ostendorf M, et al. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 3219-3232. [百度学术]
Ma Y Q, Liu J W, Lu W, et al. From “what” to “how”: extracting the procedural scientific information toward the metric-optimization in AI[J]. Information Processing & Management, 2023, 60(3): 103315. [百度学术]
Ding B S, Qin C W, Liu L L, et al. Is GPT-3 a good data annotator?[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2023: 11173-11195. [百度学术]
张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24. [百度学术]
Dai X, Adel H. An analysis of simple data augmentation for named entity recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 2020: 3861-3867. [百度学术]
Li K, Chen C B, Quan X J, et al. Conditional augmentation for aspect term extraction via masked sequence-to-sequence generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7056-7066. [百度学术]
Ding B S, Liu L L, Bing L D, et al. DAGA: data augmentation with a generation approach for low-resource tagging tasks[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 6045-6057. [百度学术]
Zheng C M, Cai Y, Xu J Y, et al. A boundary-aware neural model for nested named entity recognition[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 357-366. [百度学术]
Vinyals O, Fortunato M, Jaitly N. Pointer networks[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2015: 2692-2700. [百度学术]
Li J, Ye D H, Shang S. Adversarial transfer for named entity boundary detection with pointer networks[C]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2019: 5053-5059. [百度学术]
Yan H, Gui T, Dai J Q, et al. A unified generative framework for various NER subtasks[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 5808-5822. [百度学术]
Samuel J, Yuan X H, Yuan X J, et al. Mining online full-text literature for novel protein interaction discovery[C]// Proceedings of the 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops. Piscataway: IEEE, 2010: 277-282. [百度学术]
Syed S, Spruit M. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation[C]// Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics. Piscataway: IEEE, 2017: 165-174. [百度学术]
Dang V B, Aizawa A. Multi-class named entity recognition via bootstrapping with dependency tree-based patterns[C]// Proceedings of the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Heidelberg: Springer, 2008: 76-87. [百度学术]
Zeng X J, Li Y L, Zhai Y C, et al. Counterfactual generator: a weakly-supervised method for named entity recognition[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 7270-7280. [百度学术]
Toulmin S. Human understanding[M]. Princeton: Princeton University Press, 1977. [百度学术]
Houngbo H, Mercer R E. Method mention extraction from scientific research papers[C]// Proceedings of COLING 2012. The COLING 2012 Organizing Committee, 2012: 1211-1222. [百度学术]
Gupta S, Manning C D. Analyzing the dynamics of research by extracting key aspects of scientific papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 2011: 1-9. [百度学术]
Chu H T, Ke Q. Research methods: What’s in the name?[J]. Library & Information Science Research, 2017, 39(4): 284-294. [百度学术]
Qasemizadeh B, Schumann A K. The ACL RD-TEC 2.0: a language resource for evaluating term extraction and entity recognition methods[C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. Paris: European Language Resources Association, 2016:1862-1868. [百度学术]
Wang Z H, Shang J B, Liu L Y, et al. CrossWeigh: training named entity tagger from imperfect annotations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 5154-5163. [百度学术]
张颖怡. 学术论文中“问题-方法”关系抽取研究[D]. 南京: 南京理工大学, 2022. [百度学术]
孙向东, 刘拥军, 陈雯雯, 等. 箱线图法在动物卫生数据异常值检验中的运用[J]. 中国动物检疫, 2010, 27(7): 66-68. [百度学术]
Wang Y Z, Zhang C Z. Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing[J]. Journal of Informetrics, 2020, 14(4): 101091. [百度学术]