|
|
A Study on the Stability of Semantic Representation of Entities in the Technology Domain-Comparison of Multiple Word Embedding Models |
Chen Guo1, Xu Zan1, Hong Siqi1, Wu Jiahuan1, Xiao Lu2 |
1.School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094 2.School of Journalism, Nanjing University of Finance & Economics, Nanjing 210023 |
|
|
Abstract Lexical semantic analysis is crucial in the science and technology literature intelligence analysis field. Distributed word embedding techniques (e.g., fastText, GloVe, and Word2Vec), which can effectively represent lexical semantics and conveniently characterize the semantic similarity of lexical words, have recently become the mainstream technology for technological lexical semantic analysis. The use of word embedding techniques for lexical semantic analysis is highly dependent on computing the nearest semantic neighbors of words based on word vectors. However, because of random initialization of the word embedding model, even if the nearest semantic neighbors generated by repeated training on the exact same data are not identical, the randomly perturbed nearest semantic neighbors introduce untrue information. To minimize the impact of random initialization, enhance reproducibility, and obtain more reliable and effective semantic analysis results, this study comprehensively examined the influence of dataset size, model type, training algorithm, keyword frequency, vector dimension, and context window size and designed a quantitative stability assessment index and corresponding experimental scheme. The present study investigated the Microsoft Academic Graph (MAG) paper corpus in four distinct fields: artificial intelligence, immunology, monetary policy, and quantum entanglement. Specifically, we trained word embedding models on a corpus of MAG papers, performed word vector semantic representations for the keywords of the papers, and calculated evaluation metrics to ascertain the stability of semantic representations in conjunction with quantitative results. The results on the four domains demonstrate that the larger the dataset, the more stable the semantic representation. However, this is not the case for GloVe. Different models and training algorithms must be targeted when considering structural grammatical information, such as lexical composition, character similarity, and keyword frequency. Furthermore, setting the vector dimension to 300 and the context window to 5 is a more appropriate choice. This empirical study offers a point of reference for intelligence workers engaged in the semantic analysis of scientific and technological vocabulary.
|
Received: 09 May 2024
|
|
|
|
1 曹树金, 闫颂. 基于语义角色信息的科技论文创新段落定位及功能句识别方法研究——以中文情报学领域论文为例[J]. 情报理论与实践, 2022, 45(11): 1-9, 20. 2 Kutuzov A, ?vrelid L, Szymanski T, et al. Diachronic word embeddings and semantic shifts: a survey[C]// Proceedings of the 27th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2018: 1384-1397. 3 Kulkarni V, Al-Rfou R, Perozzi B, et al. Statistically significant detection of linguistic change[C]// Proceedings of the 24th International Conference on World Wide Web. Republic and Canton of Geneva: International World Wide Web Conferences Steering Committee, 2015: 625-635. 4 Rettenmeier L. Word embeddings: stability and semantic change[D/OL]. Heidelberg: University of Heidelberg, (2020-07-23). https://arxiv.org/pdf/2007.16006. 5 钟丽萍, 冷伏海, 罗世猛. 情报研究有效性的影响因素分析[J]. 情报理论与实践, 2013, 36(7): 6-9. 6 Chen G, Hong S Q, Du C X, et al. Comparing semantic representation methods for keyword analysis in bibliometric research[J]. Journal of Informetrics, 2024, 18(3): 101529. 7 段庆锋, 陈红, 闫绪娴, 等. 基于知识结构突变的学科新兴主题识别研究[J]. 情报学报, 2023, 42(9): 1018-1028. 8 刘志辉, 郑彦宁. 基于作者关键词耦合分析的研究专业识别方法研究[J]. 情报学报, 2013, 32(8): 788-796. 9 张颖怡, 章成志, 陈果. 基于关键词的学术文本聚类集成研究[J]. 情报学报, 2019, 38(8): 860-871. 10 陆泉, 曹越, 陈静. 基于语义关联与模糊聚类的共词分析方法[J]. 情报学报, 2022, 41(10): 1003-1014. 11 潘俊, 吴宗大. 词汇表示学习研究进展[J]. 情报学报, 2019, 38(11): 1222-1240. 12 Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828. 13 Gries S T. Particle movement: a cognitive and functional approach[J]. Cognitive Linguistics, 1999, 10(2): 105-145. 14 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 15 Wang Y X, Hou Y T, Che W X, et al. From static to dynamic word representations: a survey[J]. International Journal of Machine Learning and Cybernetics, 2020, 11(7): 1611-1630. 16 周潇, 高雅倩, 樊嘉逸. 基于BERT词嵌入的专利检索策略研究[J]. 情报学报, 2023, 42(11): 1347-1357. 17 程秀峰, 邹晶晶, 叶光辉, 等. 融合Word2Vec的半积累引用共词网络的领域主题演化研究[J]. 情报学报, 2023, 42(7): 801-815. 18 王卫军, 姚畅, 乔子越, 等. 基于词嵌入的国家自然科学基金学科交叉知识发现方法——以“人工智能”与“信息管理”为例[J]. 情报学报, 2021, 40(8): 831-845. 19 陈果, 许天祥. 小规模知识库指导下的细分领域实体关系发现研究[J]. 情报学报, 2019, 38(11): 1200-1211. 20 韩普, 王东波, 王子敏. 词汇相似度计算和相似词挖掘研究进展[J]. 情报科学, 2016, 34(9): 161-165. 21 Alahmari S S, Goldgof D B, Mouton P R, et al. Challenges for the repeatability of deep learning models[J]. IEEE Access, 2020, 8: 211860-211868. 22 Rinaldo A, Singh A, Nugent R, et al. Stability of density-based clustering[J]. Journal of Machine Learning Research, 2012, 13: 905-948. 23 Wendlandt L, Kummerfeld J K, Mihalcea R. Factors influencing the surprising instability of word embeddings[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2092-2102. 24 Chugh M, Whigham P A, Dick G. Stability of word embeddings using word2vec[C]// Proceedings of the 31st Australasian Joint Conference on Artificial Intelligence. Cham: Springer, 2018: 812-818. 25 Dridi A, Gaber M M, Azad R M A, et al. k-NN embedding stability for word2vec hyper-parametrisation in scientific text[C]// Proceedings of the 21st International Conference on Discovery Science. Cham: Springer, 2018: 328-343. 26 Borah A, Barman M P, Awekar A. Are word embedding methods stable and should we care about it?[C]// Proceedings of the 32nd ACM Conference on Hypertext and Social Media. New York: ACM Press, 2021: 45-55. 27 陈果, 陈晶, 肖璐. 词汇语义链: 领域分析视角下的词汇语义挖掘理论框架[J]. 情报理论与实践, 2022, 45(4): 170-176, 183. 28 Newman-Griffis D, Fosler-Lussier E. Second-order word embeddings from nearest neighbor topological features[OL]. (2017-05-23). https://arxiv.org/pdf/1705.08488. 29 刘知远, 刘扬, 涂存超, 等. 词汇语义变化与社会变迁定量观测与分析[J]. 语言战略研究, 2016, 1(6): 47-54. 30 潘俊, 吴宗大. 知识发现视角下词汇历时语义挖掘与可视化研究[J]. 情报学报, 2021, 40(10): 1052-1064. 31 张涛. 中文文本中未知词语的词义知识获取[D]. 太原: 山西大学, 2005. 32 陈果, 王盼停, 王曰芬. 文献集规模对科技领域情报分析的影响: 多种任务场景下的实证分析[J]. 情报学报, 2021, 40(8): 869-878. 33 Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[OL]. (2013-09-07). https://arxiv.org/pdf/1301.3781. 34 张剑, 屈丹, 李真. 基于词向量特征的循环神经网络语言模型[J]. 模式识别与人工智能, 2015, 28(4): 299-305. |
|
|
|