|
|
Co-occurrence Hierarchical Sampling for Academic Document Representation Learning |
Ding Heng, Zhang Jing, Chen Jiazhuo, Cao Gaohui |
School of Information Management, Central China Normal University, Wuhan 430079 |
|
|
Abstract Effective feature representations of academic papers can be used in the classification and ranking of academic papers, thereby improving the efficiency of searches to provide users with more intelligent and effective literature recommendations and personalized services. Inspired by the study of citation proximity analysis (CPA) in information science, we utilize a self-supervised contrastive learning framework, to propose a co-citation hierarchical sampling algorithm that allows mining of potential associations among documents from structured full-text data. A self-supervised prior training task is constructed for training the citation co-occurrence hierarchical transformer (CCHT), which is an academic text representation model at the document level. The S2ORC and SPECTER training sets were used to construct triplets from co-citations of the same sentence, paragraph, and chapter to train the proposed research models, which were subsequently applied to the four major SciDocs benchmark tasks of document classification, user behavior prediction, citation prediction, and paper recommendation. Different evaluation metrics were adopted for the different tasks. Specifically, in the document classification task, the F1 metric; in the user behavior prediction and citation prediction tasks, the normalized discounted cumulative gain (nDCG) and mean average precision (MAP) metrics; and in the paper recommendation task, P@1 and nDCG^![]() were used for evaluation. The results demonstrate that (1) the CCHT model outperformed the other baseline models in the SciDocs benchmark test set, performing best when positive samples with fixed sampling levels were co-citations of the same sentence; (2) hierarchical citation co-occurrence based hard negative sampling may introduce noisy data during training, which degrades performance.
|
Received: 19 April 2023
|
|
|
|
1 Harper C A, Tillett B B. Library of Congress controlled vocabularies and their application to the semantic web[J]. Cataloging & Classification Quarterly, 2007, 43(3/4): 47-68. 2 Hjorland B. Fundamentals of knowledge organization[J]. Knowledge Organization, 2003, 30(2): 87-111. 3 Schatz B R. Information retrieval in digital libraries: bringing search to the net[J]. Science, 1997, 275(5298): 327-334. 4 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016, 53(2): 247-261. 5 Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237. 6 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2017: 5998-6008. 7 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 8 Joshi M, Chen D Q, Liu Y H, et al. SpanBERT: improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77. 9 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别——基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12): 1320-1329. 10 谢靖, 刘江峰, 王东波. 古代中国医学文献的命名实体识别研究——以Flat-lattice增强的SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(10): 51-60. 11 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650. 12 王倩, 曾金, 刘家伟, 等. 基于深度学习的学术文本段落结构功能识别研究[J]. 情报科学, 2020, 38(3): 64-69. 13 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3615-3620. 14 Cohan A, Feldman S, Beltagy I, et al. SPECTER: document-level representation learning using citation-informed transformers[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2270-2282. 15 Eto M. Evaluations of context-based co-citation searching[J]. Scientometrics, 2013, 94(2): 651-673. 16 Elkiss A, Shen S W, Fader A, et al. Blind men and elephants: what do citation summaries tell us about a research article?[J]. Journal of the American Society for Information Science and Technology, 2008, 59(1): 51-62. 17 Turney P D, Pantel P. From frequency to meaning: vector space models of semantics[J]. Journal of Artificial Intelligence Research, 2010, 37: 141-188. 18 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8): 1423-1436. 19 Dourado í C, Galante R, Gon?alves M A, et al. Bag of textual graphs (BoTG): a general graph-based text representation model[J]. Journal of the Association for Information Science and Technology, 2019, 70(8): 817-829. 20 Balinsky H, Balinsky A, Simske S J. Automatic text summarization and small-world networks[C]// Proceedings of the 11th ACM Symposium on Document Engineering. New York: ACM Press, 2011: 175-184. 21 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119. 22 Harrag F, El-Qawasmah E, Al-Salman A M S. Stemming as a feature reduction technique for Arabic text categorization[C]// Proceedings of the 2011 10th International Symposium on Programming and Systems. Piscataway: IEEE, 2011: 128-133. 23 Peng X Y, Ke D F, Chen Z B, et al. Automated Chinese essay scoring using vector space models[C]// Proceedings of the 2010 4th International Universal Communication Symposium. Piscataway: IEEE, 2010: 149-153. 24 Mao W L, Chu W W. The phrase-based vector space model for automatic retrieval of free-text medical documents[J]. Data & Knowledge Engineering, 2007, 61(1): 76-92. 25 Aleahmad A, Hakimian P, Mahdikhani F, et al. n-gram and local context analysis for Persian text retrieval[C]// Proceedings of the 2007 9th International Symposium on Signal Processing and Its Applications. Piscataway: IEEE, 2007: 1-4. 26 Harrag F, Hamdi-Cherif A, Al-Salman A, et al. Experiments in improvement of Arabic information retrieval[C]// Proceedings of the 3rd International Conference on Arabic Language Processing, Rabat, Morocco, 2009: 71-81. 27 Sang J G, Pang S C, Zha Y, et al. Design and analysis of a general vector space model for data classification in Internet of Things[J]. EURASIP Journal on Wireless Communications and Networking, 2019, 2019: Article No.263. 28 Radovanovi? M, Nanopoulos A, Ivanovi? M. On the existence of obstinate results in vector space models[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2010: 186-193. 29 Ramage D, Hall D, Nallapati R, et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2009: 248-256. 30 Blei D M, Lafferty J D. A correlated topic model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35. 31 Xun G X, Li Y L, Zhao W X, et al. A correlated topic model using word embeddings[C]// Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence, 2017: 4207-4213. 32 Hai Z, Cong G, Chang K Y, et al. Analyzing sentiments in one go: a supervised joint topic modeling approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(6): 1172-1185. 33 Erosheva E, Fienberg S, Lafferty J. Mixed-membership models of scientific publications[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(suppl_1): 5220-5227. 34 Steyvers M, Smyth P, Rosen-Zvi M, et al. Probabilistic author-topic models for information discovery[C]// Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2004: 306-315. 35 Mihalcea R, Tarau P. TextRank: bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2004: 404-411. 36 Gottron T, Anderka M, Stein B. Insights into explicit semantic analysis[C]// Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2011: 1961-1964. 37 Yamada I, Shindo H, Takeda H, et al. Learning distributed representations of texts and entities from knowledge base[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 397-411. 38 Muhammad P F, Kusumaningrum R, Wibowo A. Sentiment analysis using word2vec and long short-term memory (LSTM) for Indonesian hotel reviews[J]. Procedia Computer Science, 2021, 179: 728-735. 39 Alammary A S. BERT models for Arabic text classification: a systematic review[J]. Applied Sciences, 2022, 12(11): 5720. 40 Lan Z Z, Chen M D, Goodman S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[C]// Proceedings of the 8th International Conference on Learning Representations. Appleton: ICLR, 2020: 1-14. 41 Lewis M, Liu Y H, Goyal N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7871-7880. 42 Zhang J Q, Zhao Y, Saleh M, et al. PEGASUS: pre-training with extracted gap-sentences for abstractive summarization[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 11328-11339. 43 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901. 44 丁恒, 任卫强, 曹高辉. 基于无监督图神经网络的学术文献表示学习研究[J]. 情报学报, 2022, 41(1): 62-72. 45 章成志, 张颖怡. 基于学术论文全文的研究方法实体自动识别研究[J]. 情报学报, 2020, 39(6): 589-600. 46 Gon?alves S, Cortez P, Moro S. A deep learning classifier for sentence classification in biomedical and computer science abstracts[J]. Neural Computing and Applications, 2020, 32(11): 6793-6807. 47 Lu W, Huang Y, Bu Y, et al. Functional structure identification of scientific documents in computer science[J]. Scientometrics, 2018, 115(1): 463-486. 48 Liu Y, Lapata M. Learning structured text representations[J]. Transactions of the Association for Computational Linguistics, 2018, 6: 63-75. 49 Li P Z, Gu J X, Kuen J, et al. SelfDoc: self-supervised document representation learning[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 5648-5656. 50 Lu Y H, Luo J Y, Xiao Y, et al. Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment[J]. Scientometrics, 2021, 126(8): 6937-6963. 51 Ostendorff M, Rethmeier N, Augenstein I, et al. Neighborhood contrastive learning for scientific document representations with citation embeddings[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2022: 11670-11688. 52 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 53 Lo K, Wang L L, Neumann M, et al. S2ORC: the semantic scholar open research corpus[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 4969-4983. 54 Lipscomb C E. Medical subject headings (MeSH)[J]. Bulletin of the Medical Library Association, 2000, 88(3): 265-266. 55 Le Q, Mikolov T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. JMLR.org, 2014: II-1188-II-1196. 56 Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146. 57 Arora S, Liang Y Y, Ma T Y. A simple but tough-to-beat baseline for sentence embeddings[C]// Proceedings of the International Conference on Learning Representations. Appleton: ICLR, 2017. 58 Bhagavatula C, Feldman S, Power R, et al. Content-based citation recommendation[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 238-251. 59 Wu F, Souza A, Zhang T, et al. Simplifying graph convolutional networks[C]// Proceedings of the 36th International Conference on Machine Learning. Lille: PMLR Press, 2019: 6861-6871. 60 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992. 61 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 62 Izacard G, Caron M, Hosseini L, et al. Unsupervised dense information retrieval with contrastive learning[J/OL]. Transactions on Machine Learning Research, (2022-08-29). https://openreview.net/pdf?id=jKN1pXi7b0. 63 Chuang Y S, Dangovski R, Luo H Y, et al. DiffCSE: difference-based contrastive learning for sentence embeddings[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2022: 4207-4218. 64 丁恒, 阮靖龙. 基于算法归因框架的LIS领域学者施引影响因素实证研究[J]. 图书情报知识, 2022, 39(2): 83-97. |
|
|
|