A Paper Semantic Representation Method Incorporating Academic Network with Content Information
Shi Bin1,2, Wang Hao1,2, Li Xiaomin1,2, Zhou Shu1,2
1.School of Information Management, Nanjing University, Nanjing 210023 2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023
摘要随着科研工作者人数的不断增加,科技论文的发表数量呈现快速增长的趋势。面对海量的科技论文,文献的归档、录入和分析工作变得越发繁重。当前,针对文献的分类模型主要关注论文的内容信息,而忽略了论文相关的关联信息。为此,本文提出一种融合内容信息与学术网络的论文表征模型PAITKG(paper analysis by incorporating text and knowledge graph),引入知识图谱嵌入技术对文献的多重关联信息进行表征,采用Adapter微调的SciBERT提取内容特征,并将二者融合。在训练过程中,本文改进了动态对抗损失函数来引导模型更好地关注错误结果,并将该方法在数字人文和多模态学习两个领域的文献数据集上进行实验。在科技文献的学科多标签分类任务上,PAITKG比Baselines有显著改善,很好地提高了分类精度。除此以外,通过上游任务的学习,PAITKG的表征获得了更广泛的应用,在没有任何额外训练的情况下,本文模型提取的特征向量能够较好地应用于主题聚类、学者推荐等分析任务。研究结果表明,PAITKG通过构建并表征论文的学术网络,有效融合了文献的关联信息,提高了对文献数据的理解能力,而且其学习到的表征具有优秀的泛化潜力,能够应用于各种文献分析工作。
1 Nicolaisen J. Bibliometrics and citation analysis: from the science citation index to cybermetrics[J]. Journal of the American Society for Information Science and Technology, 2010, 61(1): 205-207. 2 Shu F, Julien C A, Zhang L, et al. Comparing journal and paper level classifications of science[J]. Journal of Informetrics, 2019, 13(1): 202-225. 3 Abiodun E O, Alabdulatif A, Abiodun O I, et al. A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities[J]. Neural Computing & Applications, 2021, 33(22): 15091-15118. 4 El Hindi K, AlSalman H, Qasem S, et al. Building an ensemble of fine-tuned naive Bayesian classifiers for text classification[J]. Entropy, 2018, 20(11): 857. 5 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]// Proceedings of the 1st International Conference on Learning Representations. Appleton: ICLR, 2013: 1-12. 6 吕琦, 上官燕红, 张琳, 等. 基于文本内容自动分类的跨学科测度研究[J]. 数据分析与知识发现, 2023, 7(4): 56-67. 7 Lv Y Q, Xie Z, Zuo X J, et al. A multi-view method of scientific paper classification via heterogeneous graph embeddings[J]. Scientometrics, 2022, 127(8): 4847-4872. 8 Huang X J, Wu Z B, Wang G S, et al. ResGAT: an improved graph neural network based on multi-head attention mechanism and residual network for paper classification[J]. Scientometrics, 2024, 129(2): 1015-1036. 9 Tripathy J K, Sethuraman S C, Cruz M V, et al. Comprehensive analysis of embeddings and pre-training in NLP[J]. Computer Science Review, 2021, 42: 100433. 10 高鸿睿. 科技文献语义表示学习与搜索服务构件研究[D]. 北京: 北京邮电大学, 2023. 11 Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning[OL]. (2015-10-17). https://arxiv.org/pdf/1506.00019. 12 Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1724-1734. 13 Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 207-212. 14 Munkhdalai T, Lalor J P, Yu H. Citation analysis with neural attention models[C]// Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis. Stroudsburg: Association for Computational Linguistics, 2016: 69-77. 15 Cohan A, Ammar W, van Zuylen M, et al. Structural scaffolds for citation intent classification in scientific publications[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 3586-3596. 16 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 17 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 18 Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[R/OL]. (2018-06-09). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. 19 Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[R/OL]. (2019-02-15). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_ multitask_learners.pdf. 20 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901. 21 Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[OL]. (2022-03-04). https://arxiv.org/pdf/2203.02155. 22 OpenAI. GPT-4 technical report[OL]. (2023-03-15). https://arxiv.org/pdf/2303.08774. 23 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 24 Cohan A, Feldman S, Beltagy I, et al. SPECTER: document-level representation learning using citation-informed transformers[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2270-2282. 25 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3613-3618. 26 Shen S, Liu J F, Lin L T, et al. SsciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. 27 Liu L, Tsai W T, Bhuiyan M Z A, et al. Automatic blockchain whitepapers analysis via heterogeneous graph neural network[J]. Journal of Parallel and Distributed Computing, 2020, 145: 1-12. 28 丁恒, 任卫强, 曹高辉. 基于无监督图神经网络的学术文献表示学习研究[J]. 情报学报, 2022, 41(1): 62-72. 29 安波. 结构信息增强的文献分类方法研究[J]. 农业图书情报学报, 2023, 35(3): 15-24. 30 李俊飞, 徐黎明, 汪洋, 等. 基于深度学习技术的科技文献引文分类研究综述[J]. 数据与计算发展前沿, 2023, 5(4): 86-100. 31 安波. 基于提示学习的小样本文献分类方法[J]. 图书馆论坛, 2024, 44(5): 96-104. 32 Zhou D, Zhu S H, Yu K, et al. Learning multiple graphs for document recommendations[C]// Proceedings of the 17th International Conference on World Wide Web. New York: ACM Press, 2008: 141-150. 33 Bordes A, Usunier N, Garcia-Durán A, et al. Translating embeddings for modeling multi-relational data[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 2787-2795. 34 Moon C, Jones P, Samatova N F, et al. Learning entity type embeddings for knowledge graph completion[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM Press, 2017: 2215-2218. 35 Sun Z Q, Deng Z H, Nie J Y, et al. RotatE: knowledge graph embedding by relational rotation in complex space[OL]. (2019-02-26). https://arxiv.org/pdf/1902.10197. 36 Socher R, Chen D Q, Manning C D, et al. Reasoning with neural tensor networks for knowledge base completion[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 926-934. 37 Dettmers T, Minervini P, Stenetorp P, et al. Convolutional 2D knowledge graph embeddings[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 1811-1818. 38 Zhang Y Q, Yao Q M, Chen L. Interstellar: searching recurrent architecture for knowledge graph embedding[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 10030-10040. 39 Yang B S, Yih W T, He X D, et al. Embedding entities and relations for learning and inference in knowledge bases[C]// Proceedings of the 3rd International Conference on Learning Representations. Appleton: ICLR, 2014: 1-12. 40 Trouillon T, Welbl J, Riedel S, et al. Complex embeddings for simple link prediction[C]// Proceedings of the 33nd International Conference on Machine Learning. JMLR.org, 2016: 2071-2080. 41 Zhang Y Q, Yao Q M, Dai W Y, et al. AutoSF: searching scoring functions for knowledge graph embedding[C]// Proceedings of the 36th IEEE International Conference on Data Engineering. Piscataway: IEEE, 2020: 433-444. 42 Zhang Y Q, Yao Q M, Kwok J T. Bilinear scoring function search for knowledge graph learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 1458-1473. 43 Chao L L, He J S, Wang T F, et al. PairRE: knowledge graph embeddings via paired relation vectors[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 4360-4369. 44 Yu L, Luo Z C, Liu H Y, et al. TripleRE: knowledge graph embeddings via tripled relation vectors[OL]. (2022-09-17). https://arxiv.org/pdf/2209.08271. 45 Galkin M, Denis E G, Wu J P, et al. NodePiece: compositional and parameter-efficient representations of large knowledge graphs[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-14. 46 Li H Z, Gao X R, Feng L H, et al. StarGraph: knowledge representation learning based on incomplete two-hop subgraph[OL]. (2023-01-03). https://arxiv.org/pdf/2205.14209. 47 Shi B, Wang H, Li Y Y, et al. RelaGraph: improving embedding on small-scale sparse knowledge graphs by neighborhood relations[J]. Information Processing & Management, 2023, 60(5): 103447. 48 陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响: 多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878. 49 Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]// Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019: 2790-2799. 50 Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[OL]. (2015-03-09). https://arxiv.org/pdf/1503.02531. 51 Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09288. 52 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 53 Zhang M L, Zhou Z H. A review on multi-label learning algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1819-1837. 54 杨秀璋, 武帅, 张苗, 等. 基于层次聚类和社交网络的贵州旅游发展文献主题挖掘[J]. 现代计算机, 2021, 27(23): 79-85, 90. 55 吴利俊, 辛继宾, 宋元明. 数字人文研究概念、学科和热点的演变与趋势[J]. 图书馆理论与实践, 2023(4): 104-112. 56 齐欣欣. 数字人文取向的现当代文学应用研究分析[J]. 情报探索, 2023(10): 74-80. 57 戚笑雨, 秦炜杰. 传统文化数字出版的版权困境与应对[J]. 河南科技, 2023, 42(17): 111-115.