基于引文共现层次采样的学术文献表示学习

doi:10.3772/j.issn.1000-0135.2024.03.006

情报学报

2024, Vol. 43

Issue (3): 313-326 DOI: 10.3772/j.issn.1000-0135.2024.03.006

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

基于引文共现层次采样的学术文献表示学习

丁恒, 张静, 陈佳卓, 曹高辉

华中师范大学信息管理学院，武汉 430079

Co-occurrence Hierarchical Sampling for Academic Document Representation Learning

Ding Heng, Zhang Jing, Chen Jiazhuo, Cao Gaohui

School of Information Management, Central China Normal University, Wuhan 430079

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (2890 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要有效进行学术文献特征表示，不仅可以提高学术文献的检索、分类和排序效率，还可以为用户提供更加智能、有效的文献推荐和个性化服务。受图书情报学领域引文邻近分析研究启发，本文基于自监督对比学习框架，提出了基于引文共现的层次采样算法，从结构化全文数据中挖掘文献间的潜在关联，构造自监督前置训练任务用于训练文献层级的学术文本表示模型，即CCHT（citation co-occurrence hierarchical transformer）。使用S2ORC数据集和SPECTER训练集从句子共现、段落共现和章节共现层次来构造三元组集合，训练对应的模型，并用于论文分类、用户行为预测、引文预测和论文推荐四大SciDocs基准测试集任务。对于不同任务，本文采用了不同的评估指标。在论文分类任务中，使用F1值进行评估；在用户行为预测和引文预测任务中，使用nDCG（normalized discounted cumulative gain）和MAP（mean average precision）进行评估；在论文推荐任务中，使用P@1和nDCG^进行评估。研究结果表明，①CCHT模型在SciDocs基准测试集中的性能优于其他基线模型，并且当固定正样本采样层次为句子共现时性能最佳；②基于引文层次共现进行困难负样本采样，易受噪声数据影响，导致模型性能出现逐步降低的趋势。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	丁恒
	张静
	陈佳卓
	曹高辉

关键词 ：表示学习, 对比学习, 采样策略, 排序学习

收稿日期: 2023-04-19

基金资助:国家自然科学基金青年科学基金项目“基于深度语义表示和多文档摘要的学术文献自动综述研究”（71904058）。

作者简介: 丁恒，男，1988年生，博士，副教授，硕士生导师，主要研究领域为信息检索、自然语言处理、人智交互；张静，女，1999年生，硕士研究生，主要研究领域为数据挖掘；陈佳卓，女，2002年生，本科，主要研究领域为自然语言处理；曹高辉，通信作者，男，1980年生，教授，博士生导师，主要研究领域为事理图谱、知识组织、信息行为，E-mail：ghcao@ccnu.edu.cn；

引用本文:

丁恒, 张静, 陈佳卓, 曹高辉. 基于引文共现层次采样的学术文献表示学习[J]. 情报学报, 2024, 43(3): 313-326.
Ding Heng, Zhang Jing, Chen Jiazhuo, Cao Gaohui. Co-occurrence Hierarchical Sampling for Academic Document Representation Learning. 情报学报, 2024, 43(3): 313-326.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2024.03.006 或 https://qbxb.istic.ac.cn/CN/Y2024/V43/I3/313

1 Harper C A, Tillett B B. Library of Congress controlled vocabularies and their application to the semantic web[J]. Cataloging & Classification Quarterly, 2007, 43(3/4): 47-68.
2 Hjorland B. Fundamentals of knowledge organization[J]. Knowledge Organization, 2003, 30(2): 87-111.
3 Schatz B R. Information retrieval in digital libraries: bringing search to the net[J]. Science, 1997, 275(5298): 327-334.
4 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016, 53(2): 247-261.
5 Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237.
6 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 5998-6008.
7 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
8 Joshi M, Chen D Q, Liu Y H, et al. SpanBERT: improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
9 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12): 1320-1329.
10 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10): 51-60.
11 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650.
12 王倩, 曾金, 刘家伟, 等. 基于深度学习的学术文本段落结构功能识别研究[J]. 情报科学, 2020, 38(3): 64-69.
13 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620.
14 Cohan A, Feldman S, Beltagy I, et al. SPECTER: document-level representation learning using citation-informed transformers[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2270-2282.
15 Eto M. Evaluations of context-based co-citation searching[J]. Scientometrics, 2013, 94(2): 651-673.
16 Elkiss A, Shen S W, Fader A, et al. Blind men and elephants: what do citation summaries tell us about a research article?[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 51-62.
17 Turney P D, Pantel P. From frequency to meaning: vector space models of semantics[J]. Journal of Artificial Intelligence Research, 2010, 37: 141-188.
18 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436.
19 Dourado í C, Galante R, Gon?alves M A, et al. Bag of textual graphs (BoTG): a general graph-based text representation model[J]. Journal of the Association for Information Science and Technology, 2019, 70(8): 817-829.
20 Balinsky H, Balinsky A, Simske S J. Automatic text summarization and small-world networks[C]// Proceedings of the 11th ACM Symposium on Document Engineering. New York: ACM Press, 2011: 175-184.
21 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
22 Harrag F, El-Qawasmah E, Al-Salman A M S. Stemming as a feature reduction technique for Arabic text categorization[C]// Proceedings of the 2011 10th International Symposium on Programming and Systems. Piscataway: IEEE, 2011: 128-133.
23 Peng X Y, Ke D F, Chen Z B, et al. Automated Chinese essay scoring using vector space models[C]// Proceedings of the 2010 4th International Universal Communication Symposium. Piscataway: IEEE, 2010: 149-153.
24 Mao W L, Chu W W. The phrase-based vector space model for automatic retrieval of free-text medical documents[J]. Data & Knowledge Engineering, 2007, 61(1): 76-92.
25 Aleahmad A, Hakimian P, Mahdikhani F, et al. n-gram and local context analysis for Persian text retrieval[C]// Proceedings of the 2007 9th International Symposium on Signal Processing and Its Applications. Piscataway: IEEE, 2007: 1-4.
26 Harrag F, Hamdi-Cherif A, Al-Salman A, et al. Experiments in improvement of Arabic information retrieval[C]// Proceedings of the 3rd International Conference on Arabic Language Processing, Rabat, Morocco, 2009: 71-81.
27 Sang J G, Pang S C, Zha Y, et al. Design and analysis of a general vector space model for data classification in Internet of Things[J]. EURASIP Journal on Wireless Communications and Networking, 2019, 2019: Article No.263.
28 Radovanovi? M, Nanopoulos A, Ivanovi? M. On the existence of obstinate results in vector space models[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2010: 186-193.
29 Ramage D, Hall D, Nallapati R, et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2009: 248-256.
30 Blei D M, Lafferty J D. A correlated topic model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
31 Xun G X, Li Y L, Zhao W X, et al. A correlated topic model using word embeddings[C]// Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence, 2017: 4207-4213.
32 Hai Z, Cong G, Chang K Y, et al. Analyzing sentiments in one go: a supervised joint topic modeling approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(6): 1172-1185.
33 Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(suppl_1): 5220-5227.
34 Steyvers M, Smyth P, Rosen-Zvi M, et al. Probabilistic author-topic models for information discovery[C]// Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2004: 306-315.
35 Mihalcea R, Tarau P. TextRank: bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411.
36 Gottron T, Anderka M, Stein B. Insights into explicit semantic analysis[C]// Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2011: 1961-1964.
37 Yamada I, Shindo H, Takeda H, et al. Learning distributed representations of texts and entities from knowledge base[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 397-411.
38 Muhammad P F, Kusumaningrum R, Wibowo A. Sentiment analysis using word2vec and long short-term memory (LSTM) for Indonesian hotel reviews[J]. Procedia Computer Science, 2021, 179: 728-735.
39 Alammary A S. BERT models for Arabic text classification: a systematic review[J]. Applied Sciences, 2022, 12(11): 5720.
40 Lan Z Z, Chen M D, Goodman S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[C]// Proceedings of the 8th International Conference on Learning Representations. Appleton: ICLR, 2020: 1-14.
41 Lewis M, Liu Y H, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7871-7880.
42 Zhang J Q, Zhao Y, Saleh M, et al. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 11328-11339.
43 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901.
44 丁恒, 任卫强, 曹高辉. 基于无监督图神经网络的学术文献表示学习研究[J]. 情报学报, 2022, 41(1): 62-72.
45 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600.
46 Gon?alves S, Cortez P, Moro S. A deep learning classifier for sentence classification in biomedical and computer science abstracts[J]. Neural Computing and Applications, 2020, 32(11): 6793-6807.
47 Lu W, Huang Y, Bu Y, et al. Functional structure identification of scientific documents in computer science[J]. Scientometrics, 2018, 115(1): 463-486.
48 Liu Y, Lapata M. Learning structured text representations[J]. Transactions of the Association for Computational Linguistics, 2018, 6: 63-75.
49 Li P Z, Gu J X, Kuen J, et al. SelfDoc: self-supervised document representation learning[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 5648-5656.
50 Lu Y H, Luo J Y, Xiao Y, et al. Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment[J]. Scientometrics, 2021, 126(8): 6937-6963.
51 Ostendorff M, Rethmeier N, Augenstein I, et al. Neighborhood contrastive learning for scientific document representations with citation embeddings[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 11670-11688.
52 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010.
53 Lo K, Wang L L, Neumann M, et al. S2ORC: the semantic scholar open research corpus[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4969-4983.
54 Lipscomb C E. Medical subject headings (MeSH)[J]. Bulletin of the Medical Library Association, 2000, 88(3): 265-266.
55 Le Q, Mikolov T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 2014: II-1188-II-1196.
56 Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
57 Arora S, Liang Y Y, Ma T Y. A simple but tough-to-beat baseline for sentence embeddings[C]// Proceedings of the International Conference on Learning Representations. Appleton: ICLR, 2017.
58 Bhagavatula C, Feldman S, Power R, et al. Content-based citation recommendation[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 238-251.
59 Wu F, Souza A, Zhang T, et al. Simplifying graph convolutional networks[C]// Proceedings of the 36th International Conference on Machine Learning. Lille: PMLR Press, 2019: 6861-6871.
60 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992.
61 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910.
62 Izacard G, Caron M, Hosseini L, et al. Unsupervised dense information retrieval with contrastive learning[J/OL]. Transactions on Machine Learning Research, (2022-08-29). https://openreview.net/pdf?id=jKN1pXi7b0.
63 Chuang Y S, Dangovski R, Luo H Y, et al. DiffCSE: difference-based contrastive learning for sentence embeddings[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2022: 4207-4218.
64 丁恒, 阮靖龙. 基于算法归因框架的LIS领域学者施引影响因素实证研究[J]. 图书情报知识, 2022, 39(2): 83-97.