|
|
Interdisciplinary Topic Identification Method Based on Semantic Similarity Relationship |
Wang Weijun1,2,3, Ning Zhiyuan2,3, Dong Hao2,3, Qiao Ziyue2,3, Du Yi2,3, Zhou Yuanchun2,3 |
1.Library of Henan University of Economics and Law, Zhengzhou 450046 2.Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190 3.University of Chinese Academy of Sciences, Beijing 100049 |
|
|
Abstract Identifying the research content shared among different disciplines is the research idea of interdisciplinary knowledge discovery. Research content with similar semantics better reflects the integration and exchange of knowledge between disciplines. To address the problem of obtaining semantically similar interdisciplinary research topics from scientific and technical literature data, this study proposes an unsupervised contrastive learning method for semantic similarity relationship representation learning of scientific and technical literature and keywords, and then constructs a semantically similar interdisciplinary topic identification model. The model uses the Spearman correlation coefficient as an index for evaluating interdisciplinary topics, thus addressing the lack of interdisciplinary research datasets in current research. Experiments reveal that the model correctly captures the semantic similarity relationship between scientific and technical literature and their keywords, and that the experimental results properly represent the intersection tendency between the two disciplines.
|
Received: 15 February 2023
|
|
|
|
1 Easton D. The division, integration, and transfer of knowledge[J]. Bulletin of the American Academy of Arts and Sciences, 1991, 44(4): 8-27. 2 许海云, 董坤, 隗玲. 学科交叉主题识别与预测方法研究[M]. 北京: 科学技术文献出版社, 2019. 3 Xu J, Bu Y, Ding Y, et al. Understanding the formation of interdisciplinary research from the perspective of keyword evolution: a case study on joint attention[J]. Scientometrics, 2018, 117(2): 973-995. 4 Dong K, Xu H Y, Luo R, et al. An integrated method for interdisciplinary topic identification and prediction: a case study on information science and library science[J]. Scientometrics, 2018, 115(2): 849-868. 5 Xu H Y, Guo T, Yue Z H, et al. Interdisciplinary topics of information science: a study based on the terms interdisciplinarity index series[J]. Scientometrics, 2016, 106(2): 583-601. 6 Ba Z C, Cao Y J, Mao J, et al. A hierarchical approach to analyzing knowledge integration between two fields—a case study on medical informatics and computer science[J]. Scientometrics, 2019, 119(3): 1455-1486. 7 赵晓春. 跨学科研究与科研创新能力建设[D]. 合肥: 中国科学技术大学, 2007. 8 李丽刚. 中国高校跨学科研究的发展研究[D]. 长沙: 国防科学技术大学, 2005. 9 张琳, 黄颖. 交叉科学: 测度、评价与应用[M]. 北京: 科学出版社, 2019. 10 杨良斌, 周秋菊, 金碧辉. 基于文献计量的跨学科测度及实证研究[J]. 图书情报工作, 2009, 53(10): 87-90, 115. 11 Kwakkel J H, Cunningham S W. Managing polysemy and synonymy in science mapping using the mixtures of factor analyzers model[J]. Journal of the American Society for Information Science and Technology, 2009, 60(10): 2064-2078. 12 Mennes J, Pedersen T, Lefever E. Approaching terminological ambiguity in cross-disciplinary communication as a word sense induction task: a pilot study[J]. Language Resources and Evaluation, 2019, 53(4): 889-917. 13 Chen B T, Ding Y, Ma F C. Semantic word shifts in a scientific domain[J]. Scientometrics, 2018, 117(1): 211-226. 14 Nichols L G. A topic model approach to measuring interdisciplinarity at the National Science Foundation[J]. Scientometrics, 2014, 100(3): 741-754. 15 He Q. Knowledge discovery through co-word analysis[J]. Library Trends, 1999, 48(1): 133-159. 16 中国中文信息学会语言与知识计算专委会. 知识图谱发展报告(2018)[R]. 北京: 中国中文信息学会语言与知识计算专委会, 2018. 17 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016, 53(2): 247-261. 18 Firth J R. A synopsis of linguistic theory, 1930-1955[M]// Studies in Linguistic Analysis. Oxford: The Philological Society, 1957: 1-32. 19 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 3111-3119. 20 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[C]// Proceedings of the Workshop of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013. 21 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 22 Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237. 23 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 24 Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. [2023-02-01]. https://paperswithcode.com/paper/improving-language-understanding-by. 25 Ethayarajh K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 55-65. 26 Li B H, Zhou H, He J X, et al. On the sentence embeddings from pre-trained language models[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9119-9130. 27 Wang T Z, Isola P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 9929-9939. 28 Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2021: 6894-6910. 29 Reimers N, Beyer P, Gurevych I. Task-oriented intrinsic evaluation of semantic textual similarity[C]// Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka: The COLING 2016 Organizing Committee, 2016: 87-96. 30 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). https://arxiv.org/pdf/1907.11692.pdf. 31 Su J L. SimBERT: integrating retrieval and generation into BERT[EB/OL]. [2023-12-15]. https://github.com/ZhuiyiTechnology/simbert. 32 Bao H B, Dong L, Wei F R, et al. UniLMv2: pseudo-masked language models for unified language model pre-training[C]// Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020: 642-652. 33 Stirling A. A general framework for analysing diversity in science, technology and society[J]. Journal of the Royal Society Interface, 2007, 4(15): 707-719. 34 许海云, 刘春江, 雷炳旭, 等. 学科交叉的测度、可视化研究及应用——一个情报学文献计量研究案例[J]. 图书情报工作, 2014, 58(12): 95-101. |
|
|
|