|
|
A Paper Semantic Representation Method Incorporating Academic Network with Content Information |
Shi Bin1,2, Wang Hao1,2, Li Xiaomin1,2, Zhou Shu1,2 |
1.School of Information Management, Nanjing University, Nanjing 210023 2.Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023 |
|
|
Abstract With the increasing number of scientific research workers, the publication of scientific and technological papers published has increased rapidly, making the work of archiving, inputting, and analyzing documents increasingly burdensome. Most of the classification models focus on the content information of the paper, ignoring the relevant information. To solve this problem, this study proposes a paper representation model called PAITKG, which integrates content information and academic networks. Knowledge graph embedding technology is introduced to characterize multiple relationship patterns of literature; SciBERT, which is fine-tuned by Adapter, is used to extract content features and integrate the two. In the training process, this study improves the dynamic counter loss function to guide the model to pay more attention to error results. It applies this method to literature classification and analysis in the field of digital humanities. In the multilabel classification of scientific and technological literature, PAITKG showed significant improvement compared with the baselines, which greatly improved the classification accuracy. In addition, the representation of PAITKG has been more widely applied through the learning of upstream tasks. Without any additional training, the feature vectors extracted by the model can be applied to analysis tasks such as topic clustering and scholar recommendation. The experiments show that PAITKG can effectively integrate the associated literature information and improve the understanding of literature data by constructing and characterizing the academic networks of papers. Moreover, the representations learned by PAITKG have excellent generalization potential and can be applied to various literature analysis work.
|
Received: 06 January 2024
|
|
|
|
1 Nicolaisen J. Bibliometrics and citation analysis: from the science citation index to cybermetrics[J]. Journal of the American Society for Information Science and Technology, 2010, 61(1): 205-207. 2 Shu F, Julien C A, Zhang L, et al. Comparing journal and paper level classifications of science[J]. Journal of Informetrics, 2019, 13(1): 202-225. 3 Abiodun E O, Alabdulatif A, Abiodun O I, et al. A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities[J]. Neural Computing & Applications, 2021, 33(22): 15091-15118. 4 El Hindi K, AlSalman H, Qasem S, et al. Building an ensemble of fine-tuned naive Bayesian classifiers for text classification[J]. Entropy, 2018, 20(11): 857. 5 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]// Proceedings of the 1st International Conference on Learning Representations. Appleton: ICLR, 2013: 1-12. 6 吕琦, 上官燕红, 张琳, 等. 基于文本内容自动分类的跨学科测度研究[J]. 数据分析与知识发现, 2023, 7(4): 56-67. 7 Lv Y Q, Xie Z, Zuo X J, et al. A multi-view method of scientific paper classification via heterogeneous graph embeddings[J]. Scientometrics, 2022, 127(8): 4847-4872. 8 Huang X J, Wu Z B, Wang G S, et al. ResGAT: an improved graph neural network based on multi-head attention mechanism and residual network for paper classification[J]. Scientometrics, 2024, 129(2): 1015-1036. 9 Tripathy J K, Sethuraman S C, Cruz M V, et al. Comprehensive analysis of embeddings and pre-training in NLP[J]. Computer Science Review, 2021, 42: 100433. 10 高鸿睿. 科技文献语义表示学习与搜索服务构件研究[D]. 北京: 北京邮电大学, 2023. 11 Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning[OL]. (2015-10-17). https://arxiv.org/pdf/1506.00019. 12 Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1724-1734. 13 Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2016: 207-212. 14 Munkhdalai T, Lalor J P, Yu H. Citation analysis with neural attention models[C]// Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis. Stroudsburg: Association for Computational Linguistics, 2016: 69-77. 15 Cohan A, Ammar W, van Zuylen M, et al. Structural scaffolds for citation intent classification in scientific publications[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 3586-3596. 16 Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2017: 6000-6010. 17 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 18 Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[R/OL]. (2018-06-09). https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. 19 Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[R/OL]. (2019-02-15). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_ multitask_learners.pdf. 20 Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 1877-1901. 21 Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[OL]. (2022-03-04). https://arxiv.org/pdf/2203.02155. 22 OpenAI. GPT-4 technical report[OL]. (2023-03-15). https://arxiv.org/pdf/2303.08774. 23 Touvron H, Lavril T, Izacard G, et al. LLaMA: open and efficient foundation language models[OL]. (2023-02-27). https://arxiv.org/pdf/2302.13971. 24 Cohan A, Feldman S, Beltagy I, et al. SPECTER: document-level representation learning using citation-informed transformers[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2270-2282. 25 Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3613-3618. 26 Shen S, Liu J F, Lin L T, et al. SsciBERT: a pre-trained language model for social science texts[J]. Scientometrics, 2023, 128(2): 1241-1263. 27 Liu L, Tsai W T, Bhuiyan M Z A, et al. Automatic blockchain whitepapers analysis via heterogeneous graph neural network[J]. Journal of Parallel and Distributed Computing, 2020, 145: 1-12. 28 丁恒, 任卫强, 曹高辉. 基于无监督图神经网络的学术文献表示学习研究[J]. 情报学报, 2022, 41(1): 62-72. 29 安波. 结构信息增强的文献分类方法研究[J]. 农业图书情报学报, 2023, 35(3): 15-24. 30 李俊飞, 徐黎明, 汪洋, 等. 基于深度学习技术的科技文献引文分类研究综述[J]. 数据与计算发展前沿, 2023, 5(4): 86-100. 31 安波. 基于提示学习的小样本文献分类方法[J]. 图书馆论坛, 2024, 44(5): 96-104. 32 Zhou D, Zhu S H, Yu K, et al. Learning multiple graphs for document recommendations[C]// Proceedings of the 17th International Conference on World Wide Web. New York: ACM Press, 2008: 141-150. 33 Bordes A, Usunier N, Garcia-Durán A, et al. Translating embeddings for modeling multi-relational data[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 2787-2795. 34 Moon C, Jones P, Samatova N F, et al. Learning entity type embeddings for knowledge graph completion[C]// Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. New York: ACM Press, 2017: 2215-2218. 35 Sun Z Q, Deng Z H, Nie J Y, et al. RotatE: knowledge graph embedding by relational rotation in complex space[OL]. (2019-02-26). https://arxiv.org/pdf/1902.10197. 36 Socher R, Chen D Q, Manning C D, et al. Reasoning with neural tensor networks for knowledge base completion[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 926-934. 37 Dettmers T, Minervini P, Stenetorp P, et al. Convolutional 2D knowledge graph embeddings[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2018: 1811-1818. 38 Zhang Y Q, Yao Q M, Chen L. Interstellar: searching recurrent architecture for knowledge graph embedding[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2020: 10030-10040. 39 Yang B S, Yih W T, He X D, et al. Embedding entities and relations for learning and inference in knowledge bases[C]// Proceedings of the 3rd International Conference on Learning Representations. Appleton: ICLR, 2014: 1-12. 40 Trouillon T, Welbl J, Riedel S, et al. Complex embeddings for simple link prediction[C]// Proceedings of the 33nd International Conference on Machine Learning. JMLR.org, 2016: 2071-2080. 41 Zhang Y Q, Yao Q M, Dai W Y, et al. AutoSF: searching scoring functions for knowledge graph embedding[C]// Proceedings of the 36th IEEE International Conference on Data Engineering. Piscataway: IEEE, 2020: 433-444. 42 Zhang Y Q, Yao Q M, Kwok J T. Bilinear scoring function search for knowledge graph learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 1458-1473. 43 Chao L L, He J S, Wang T F, et al. PairRE: knowledge graph embeddings via paired relation vectors[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 4360-4369. 44 Yu L, Luo Z C, Liu H Y, et al. TripleRE: knowledge graph embeddings via tripled relation vectors[OL]. (2022-09-17). https://arxiv.org/pdf/2209.08271. 45 Galkin M, Denis E G, Wu J P, et al. NodePiece: compositional and parameter-efficient representations of large knowledge graphs[C]// Proceedings of the Tenth International Conference on Learning Representations. Appleton: ICLR, 2022: 1-14. 46 Li H Z, Gao X R, Feng L H, et al. StarGraph: knowledge representation learning based on incomplete two-hop subgraph[OL]. (2023-01-03). https://arxiv.org/pdf/2205.14209. 47 Shi B, Wang H, Li Y Y, et al. RelaGraph: improving embedding on small-scale sparse knowledge graphs by neighborhood relations[J]. Information Processing & Management, 2023, 60(5): 103447. 48 陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响: 多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878. 49 Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]// Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019: 2790-2799. 50 Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[OL]. (2015-03-09). https://arxiv.org/pdf/1503.02531. 51 Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models[OL]. (2023-07-19). https://arxiv.org/pdf/2307.09288. 52 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 53 Zhang M L, Zhou Z H. A review on multi-label learning algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8): 1819-1837. 54 杨秀璋, 武帅, 张苗, 等. 基于层次聚类和社交网络的贵州旅游发展文献主题挖掘[J]. 现代计算机, 2021, 27(23): 79-85, 90. 55 吴利俊, 辛继宾, 宋元明. 数字人文研究概念、学科和热点的演变与趋势[J]. 图书馆理论与实践, 2023(4): 104-112. 56 齐欣欣. 数字人文取向的现当代文学应用研究分析[J]. 情报探索, 2023(10): 74-80. 57 戚笑雨, 秦炜杰. 传统文化数字出版的版权困境与应对[J]. 河南科技, 2023, 42(17): 111-115. |
|
|
|