|
|
Research on Text Representation Based on Improved Word Mover's Embedding |
Cen Yonghua1,2, Li Wenjing3, Liu Xianzu3,2 |
1.Management School, Tianjin Normal University, Tianjin 300387 2.Institute for Big Data Science, Tianjin Normal University, Tianjin 300387 3.School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094 |
|
|
Abstract High-quality text representation serves as the foundation and guarantee for downstream text-processing tasks such as sentiment analysis and text classification. In response to the insufficient semantic accuracy and limited context window of traditional models, recent models based on word mover's/rotator's distance (WMD/WRD) or word mover's embedding (WME) have drawn special attention. To further this endeavor, this study introduces an improved word-mover's embedding method based on latent Dirichlet allocation (LDA), namely LDA-WFR-WME. This approach overcomes the semantic bias arising from the uniform topic distribution assumption of general WME by initializing text-embedding dimensions through LDA modeling and rectifies the distance distortion caused by the excessive semantic difference between documents using WFR text distance. Experiments on multi-group short text sentiment analysis, long text classification, and text clustering demonstrated the superiority of the proposed method in text embedding over competitive models such as Doc2Vec, attention-bidirectional long short-term memory (BiLSTM), bidirectional encoder representations from transformers (BERT), attention-bidirectional gated recurrent unit - convolutional neural network (Attention-BiGRU-CNN) and bidirectional graph attention network (BiGAT). The supervised topic-proxy document generation combined with WFR document distance enhanced the semantic embedding of text.
|
Received: 10 May 2024
|
|
|
|
1 张爽, 刘非凡, 罗双玲, 等. Smap: 基于文献语义的学科知识图景可视化[J]. 情报学报, 2023, 42(1): 74-89. 2 王平, 侯景瑞, 吴任力. 基于递归张量神经网络的微信公众号文章的新颖度评估方法[J]. 情报学报, 2019, 38(2): 159-169. 3 Lu Y H, Luo J Y, Xiao Y, et al. Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment[J]. Scientometrics, 2021, 126(8): 6937-6963. 4 Patil R, Boit S, Gudivada V, et al. A survey of text representation and embedding techniques in NLP[J]. IEEE Access, 2023, 11: 36120-36146. 5 Shade B, Altmann E G. Quantifying the dissimilarity of texts[J]. Information, 2023, 14(5): 271. 6 Zhao W X, Liu J, Ren R Y, et al. Dense text retrieval based on pretrained language models: a survey[J]. ACM Transactions on Information Systems, 2024, 42(4): Article No.89. 7 Wang Y H, Zhang B H, Liu W K, et al. STMAP: a novel semantic text matching model augmented with embedding perturbations[J]. Information Processing & Management, 2024, 61(1): 103576. 8 Wang Z H, Liu Y F. SEA-PS: semantic embedding with attention to measuring patent similarity by leveraging various text fields[J]. Journal of Information Science, 2024, 50(4): 831-850. 9 Kusner M J, Sun Y, Kolkin N I, et al. From word embeddings to document distances[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. JMLR.org, 2015: 957-966. 10 Yokoi S, Takahashi R, Akama R, et al. Word rotator’s distance[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 2944-2960. 11 Wu L F, Yen I E, Xu K, et al. Word mover’s embedding: from Word2Vec to document embedding[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4524-4534. 12 Sato R, Yamada M, Kashima H. Re-evaluating word mover’s distance[J]. Proceedings of Machine Learning Research, 2022, 162: 19231-19249. 13 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(4-5): 993-1022. 14 Wang Z H, Zhou D T, Yang M, et al. Robust document distance with Wasserstein-Fisher-Rao metric[J]. Proceedings of Machine Learning Research, 2020, 129: 721-736. 15 靳嘉林, 王曰芬, 巴志超, 等. 基金项目研究的主题挖掘与动态演化分析——以美国NSF数据中AI领域为例[J]. 情报学报, 2022, 41(9): 967-979. 16 Salton G. Automatic text analysis: automatic document indexing and classification methods are examined and their effectiveness is assessed[J]. Science, 1970, 168(3929): 335-343. 17 Sp?rck Jones K. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 2004, 60(5): 493-502. 18 Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. 19 Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization[J]. Nature, 1999, 401(6755): 788-791. 20 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 27th Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 3111-3119. 21 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543. 22 Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146. 23 Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237. 24 Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[OL]. (2018-06-11). https://www.mikecaptain.com/resources/pdf/GPT-1.pdf. 25 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186. 26 Asudani D S, Nagwani N K, Singh P. Impact of word embedding models on text analytics in deep learning environment: a review[J]. Artificial Intelligence Review, 2023, 56(9): 10345-10425. 27 Gupta V, Saw A, Nokhiz P, et al. P-SIF: document embeddings using partition averaging[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 7863-7870. 28 Le Q, Mikolov T. Distributed representations of sentences and documents[J]. Proceedings of Machine Learning Research, 2014, 32(2): 1188-1196. 29 柳美君, 石静, 杨斯杰, 等. 科学家研究主题演化速度与科研绩效的关系研究: 基于计算机科学领域的分析[J]. 图书情报工作, 2024, 68(6): 72-82. 30 Kim D, Seo D, Cho S, et al. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec[J]. Information Sciences, 2019, 477: 15-29. 31 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 32 Chung J, Gulcehre C, Cho Ket al. Empirical evaluation of gated recurrent neural networks on sequence modeling[OL]. (2014-12-11). https://arxiv.org/pdf/1412.3555. 33 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). https://arxiv.org/pdf/1907.11692. 34 Briskilal J, Subalalitha C N. An ensemble model for classifying idioms and literal texts using BERT and RoBERTa[J]. Information Processing & Management, 2022, 59(1): 102756. 35 Yang Z L, Dai Z H, Yang Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 5753-5763. 36 Dong Z C, Tang T Y, Li L Y, et al. A survey on long text modeling with transformers[OL]. (2025-06-10). https://arxiv.org/pdf/2302.14502. 37 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 38 Niu Z Y, Zhong G Q, Yu H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62. 39 Minaee S, Kalchbrenner N, Cambria E, et al. Deep learning—based text classification: a comprehensive review[J]. ACM Computing Surveys, 2021, 54(3): Article No.62. 40 Incitti F, Urli F, Snidaro L. Beyond word embeddings: a survey[J]. Information Fusion, 2023, 89: 418-436. 41 Yang Y, Zhang K P, Fan Y Y. sDTM: a supervised Bayesian deep topic model for text analytics[J]. Information Systems Research, 2023, 34(1): 137-156. 42 Amriza R N S, Ngafidin K N M. BiGRU-CNN-AT: classifiying emotion on social media[J]. Data Technologies and Applications, 2025, 59(2): 250-275. 43 Bhatti U A, Tang H, Wu G L, et al. Deep learning with graph convolutional networks: an overview and latest applications in computational intelligence[J]. International Journal of Intelligent Systems, 2023, 2023(1): 8342104. 44 Wang K Z, Ding Y H, Han S C. Graph neural networks for text classification: a survey[J]. Artificial Intelligence Review, 2024, 57(8): Article No.190. 45 Bijari K, Zare H, Kebriaei E, et al. Leveraging deep graph-based text representation for sentiment polarity applications[J]. Expert Systems with Applications, 2020, 144: 113090. 46 Yao L, Mao C S, Luo Y. Graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7370-7377. 47 Liu X E, You X X, Zhang X, et al. Tensor graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8409-8416. 48 Veli?kovi? P, Cucurull G, Casanova A, et al. Graph attention networks[C]// Proceedings of the 6th International Conference on Learning Representations. Vancouver: OpenReview.net, 2018: 164. 49 Yang T C, Hu L M, Shi C, et al. HGAT: heterogeneous graph attention networks for semi-supervised short text classification[J]. ACM Transactions on Information Systems, 2021, 39(3): Article No.32. 50 Lin M, Wang T, Zhu Y F, et al. A heterogeneous directed graph attention network for inductive text classification using multilevel semantic embeddings[J]. Knowledge-Based Systems, 2024, 295: 111797. 51 Shan Y X, Che C, Wei X P, et al. Bi-graph attention network for aspect category sentiment classification[J]. Knowledge-Based Systems, 2022, 258: 109972. 52 Flamary R, Courty N, Gramfort A, et al. POT: python optimal transport[J]. Journal of Machine Learning Research, 2021, 22: Article No.78. 53 李文敬. 基于词移嵌入的文本表示方法及其在典型文本处理任务中的应用研究[D]. 南京: 南京理工大学, 2023. 54 Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[OL]. (2022-03-11). https://arxiv.org/pdf/2203.05794. 55 Angelov D. Top2Vec: distributed representations of topics[OL]. (2020-08-19). https://arxiv.org/pdf/2008.09470. 56 Huang S Z, Lu W, Cheng Q K, et al. Evolutions of semantic consistency in research topic via contextualized word embedding[J]. Information Processing & Management, 2024, 61(6): 103859. 57 Yu D J, Xiang B. An ESTs detection research based on paper entity mapping: combining scientific text modeling and neural prophet[J]. Journal of Informetrics, 2024, 18(4): 101551. 58 Xie Q, Zhang X Y, Ding Y, et al. Monolingual and multilingual topic analysis using LDA and BERT embeddings[J]. Journal of Informetrics, 2020, 14(3): 101055. 59 Li Y T, Chen Y, Wang Q Y. Evolution and diffusion of information literacy topics[J]. Scientometrics, 2021, 126(5): 4195-4224. 60 Nie Z J, Feng Z C, Li M X, et al. When text embedding meets large language model: a comprehensive survey[OL]. (2025-03-20). https://arxiv.org/pdf/2412.09165. 61 Wang L, Yang N, Huang X L, et al. Improving text embeddings with large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 11897-11916. |
|
|
|