基于改进词移嵌入的文本表示方法研究

doi:10.3772/j.issn.1000-0135.2025.09.008

情报学报

2025, Vol. 44

Issue (9): 1173-1191 DOI: 10.3772/j.issn.1000-0135.2025.09.008

情报技术与应用

本期目录 | 过刊浏览 | 高级检索

基于改进词移嵌入的文本表示方法研究

岑咏华^1,2, 李文敬³, 刘贤祖^3,2

1.天津师范大学管理学院，天津 300387
2.天津师范大学大数据科学研究院，天津 300387
3.南京理工大学经济管理学院，南京 210094

Research on Text Representation Based on Improved Word Mover's Embedding

Cen Yonghua^1,2, Li Wenjing³, Liu Xianzu^3,2

1.Management School, Tianjin Normal University, Tianjin 300387
2.Institute for Big Data Science, Tianjin Normal University, Tianjin 300387
3.School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094

摘要
图/表
参考文献
相关文章 (0)

全文: PDF (3028 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要文本表示是文本处理和分析的基础工作，在情感分析、文本分类等下游任务中发挥关键作用。鉴于传统模型存在语义精度不足、上下文窗口受限等局限性，基于词移距离（word mover's distance，WMD）和词移嵌入（word mover's embedding，WME）的文本表示模型近年来受到特别关注。本文提出一种基于潜在狄利克雷分配（latent Dirichlet allocation，LDA）初始化和WFR（Wasserstein-Fisher-Rao）文本距离的改进词移嵌入表示方法LDA-WFR-WME。该方法通过LDA建模初始化嵌入维度，弥补一般词移嵌入模型通过随机文档表征嵌入维度时由均匀分布采样导致语义偏差的缺陷；引入WFR文本距离，解决文档间语义细节因差异过大而引起的距离失真问题。本文以多组短文本情感分析、长文本分类以及文本聚类任务为例，以Doc2Vec（document to vector）、Attention-BiLSTM（bidirectional long short-term memory）、BERT（bidirectional encoder representations from transformers）、Attention-BiGRU-CNN（attention-bidirectional gated recurrent unit - convolutional neural network）、BiGAT（bidirectional graph attention network）等为竞争模型，进行实验对比分析。结果表明，LDA-WFR-WME方法在文本篇章的嵌入式表示方面体现出更优的性能。

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	岑咏华
	李文敬
	刘贤祖

关键词 ：文本嵌入表示, 词移嵌入, WFR文档距离, LDA-WFR-WME

收稿日期: 2024-05-10

基金资助:国家社会科学基金一般项目“基于凸显性效应视角和财经媒体信息深度处理的投资者有限关注行为研究”（21BTQ060）。

作者简介: 岑咏华，男，1979年生，博士，教授，硕士生导师，主要研究方向为大数据分析、行为金融分析、科技创新计量与管理等，E-mail：cen@tjnu.edu.cn；李文敬，女，1997年生，硕士研究生，主要研究方向为文本挖掘与金融数据分析；刘贤祖，男，1998年生，博士研究生，主要研究方向为文本分析、科学计量等；

引用本文:

岑咏华, 李文敬, 刘贤祖. 基于改进词移嵌入的文本表示方法研究[J]. 情报学报, 2025, 44(9): 1173-1191.
Cen Yonghua, Li Wenjing, Liu Xianzu. Research on Text Representation Based on Improved Word Mover's Embedding. 情报学报, 2025, 44(9): 1173-1191.

链接本文:

https://qbxb.istic.ac.cn/CN/10.3772/j.issn.1000-0135.2025.09.008 或 https://qbxb.istic.ac.cn/CN/Y2025/V44/I9/1173

1 张爽, 刘非凡, 罗双玲, 等. Smap: 基于文献语义的学科知识图景可视化[J]. 情报学报, 2023, 42(1): 74-89.
2 王平, 侯景瑞, 吴任力. 基于递归张量神经网络的微信公众号文章的新颖度评估方法[J]. 情报学报, 2019, 38(2): 159-169.
3 Lu Y H, Luo J Y, Xiao Y, et al. Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment[J]. Scientometrics, 2021, 126(8): 6937-6963.
4 Patil R, Boit S, Gudivada V, et al. A survey of text representation and embedding techniques in NLP[J]. IEEE Access, 2023, 11: 36120-36146.
5 Shade B, Altmann E G. Quantifying the dissimilarity of texts[J]. Information, 2023, 14(5): 271.
6 Zhao W X, Liu J, Ren R Y, et al. Dense text retrieval based on pretrained language models: a survey[J]. ACM Transactions on Information Systems, 2024, 42(4): Article No.89.
7 Wang Y H, Zhang B H, Liu W K, et al. STMAP: a novel semantic text matching model augmented with embedding perturbations[J]. Information Processing & Management, 2024, 61(1): 103576.
8 Wang Z H, Liu Y F. SEA-PS: semantic embedding with attention to measuring patent similarity by leveraging various text fields[J]. Journal of Information Science, 2024, 50(4): 831-850.
9 Kusner M J, Sun Y, Kolkin N I, et al. From word embeddings to document distances[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. JMLR.org, 2015: 957-966.
10 Yokoi S, Takahashi R, Akama R, et al. Word rotator’s distance[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 2944-2960.
11 Wu L F, Yen I E, Xu K, et al. Word mover’s embedding: from Word2Vec to document embedding[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2018: 4524-4534.
12 Sato R, Yamada M, Kashima H. Re-evaluating word mover’s distance[J]. Proceedings of Machine Learning Research, 2022, 162: 19231-19249.
13 Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(4-5): 993-1022.
14 Wang Z H, Zhou D T, Yang M, et al. Robust document distance with Wasserstein-Fisher-Rao metric[J]. Proceedings of Machine Learning Research, 2020, 129: 721-736.
15 靳嘉林, 王曰芬, 巴志超, 等. 基金项目研究的主题挖掘与动态演化分析——以美国NSF数据中AI领域为例[J]. 情报学报, 2022, 41(9): 967-979.
16 Salton G. Automatic text analysis: automatic document indexing and classification methods are examined and their effectiveness is assessed[J]. Science, 1970, 168(3929): 335-343.
17 Sp?rck Jones K. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of Documentation, 2004, 60(5): 493-502.
18 Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
19 Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization[J]. Nature, 1999, 401(6755): 788-791.
20 Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 27th Advances in Neural Information Processing Systems. Red Hook: Curran Associates, 2013: 3111-3119.
21 Pennington J, Socher R, Manning C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1532-1543.
22 Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
23 Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2018: 2227-2237.
24 Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[OL]. (2018-06-11). https://www.mikecaptain.com/resources/pdf/GPT-1.pdf.
25 Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2019: 4171-4186.
26 Asudani D S, Nagwani N K, Singh P. Impact of word embedding models on text analytics in deep learning environment: a review[J]. Artificial Intelligence Review, 2023, 56(9): 10345-10425.
27 Gupta V, Saw A, Nokhiz P, et al. P-SIF: document embeddings using partition averaging[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 7863-7870.
28 Le Q, Mikolov T. Distributed representations of sentences and documents[J]. Proceedings of Machine Learning Research, 2014, 32(2): 1188-1196.
29 柳美君, 石静, 杨斯杰, 等. 科学家研究主题演化速度与科研绩效的关系研究: 基于计算机科学领域的分析[J]. 图书情报工作, 2024, 68(6): 72-82.
30 Kim D, Seo D, Cho S, et al. Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec[J]. Information Sciences, 2019, 477: 15-29.
31 Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
32 Chung J, Gulcehre C, Cho Ket al. Empirical evaluation of gated recurrent neural networks on sequence modeling[OL]. (2014-12-11). https://arxiv.org/pdf/1412.3555.
33 Liu Y H, Ott M, Goyal N, et al. RoBERTa: a robustly optimized BERT pretraining approach[OL]. (2019-07-26). https://arxiv.org/pdf/1907.11692.
34 Briskilal J, Subalalitha C N. An ensemble model for classifying idioms and literal texts using BERT and RoBERTa[J]. Information Processing & Management, 2022, 59(1): 102756.
35 Yang Z L, Dai Z H, Yang Y M, et al. XLNet: generalized autoregressive pretraining for language understanding[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates, 2019: 5753-5763.
36 Dong Z C, Tang T Y, Li L Y, et al. A survey on long text modeling with transformers[OL]. (2025-06-10). https://arxiv.org/pdf/2302.14502.
37 He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
38 Niu Z Y, Zhong G Q, Yu H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62.
39 Minaee S, Kalchbrenner N, Cambria E, et al. Deep learning—based text classification: a comprehensive review[J]. ACM Computing Surveys, 2021, 54(3): Article No.62.
40 Incitti F, Urli F, Snidaro L. Beyond word embeddings: a survey[J]. Information Fusion, 2023, 89: 418-436.
41 Yang Y, Zhang K P, Fan Y Y. sDTM: a supervised Bayesian deep topic model for text analytics[J]. Information Systems Research, 2023, 34(1): 137-156.
42 Amriza R N S, Ngafidin K N M. BiGRU-CNN-AT: classifiying emotion on social media[J]. Data Technologies and Applications, 2025, 59(2): 250-275.
43 Bhatti U A, Tang H, Wu G L, et al. Deep learning with graph convolutional networks: an overview and latest applications in computational intelligence[J]. International Journal of Intelligent Systems, 2023, 2023(1): 8342104.
44 Wang K Z, Ding Y H, Han S C. Graph neural networks for text classification: a survey[J]. Artificial Intelligence Review, 2024, 57(8): Article No.190.
45 Bijari K, Zare H, Kebriaei E, et al. Leveraging deep graph-based text representation for sentiment polarity applications[J]. Expert Systems with Applications, 2020, 144: 113090.
46 Yao L, Mao C S, Luo Y. Graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 7370-7377.
47 Liu X E, You X X, Zhang X, et al. Tensor graph convolutional networks for text classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8409-8416.
48 Veli?kovi? P, Cucurull G, Casanova A, et al. Graph attention networks[C]// Proceedings of the 6th International Conference on Learning Representations. Vancouver: OpenReview.net, 2018: 164.
49 Yang T C, Hu L M, Shi C, et al. HGAT: heterogeneous graph attention networks for semi-supervised short text classification[J]. ACM Transactions on Information Systems, 2021, 39(3): Article No.32.
50 Lin M, Wang T, Zhu Y F, et al. A heterogeneous directed graph attention network for inductive text classification using multilevel semantic embeddings[J]. Knowledge-Based Systems, 2024, 295: 111797.
51 Shan Y X, Che C, Wei X P, et al. Bi-graph attention network for aspect category sentiment classification[J]. Knowledge-Based Systems, 2022, 258: 109972.
52 Flamary R, Courty N, Gramfort A, et al. POT: python optimal transport[J]. Journal of Machine Learning Research, 2021, 22: Article No.78.
53 李文敬. 基于词移嵌入的文本表示方法及其在典型文本处理任务中的应用研究[D]. 南京: 南京理工大学, 2023.
54 Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[OL]. (2022-03-11). https://arxiv.org/pdf/2203.05794.
55 Angelov D. Top2Vec: distributed representations of topics[OL]. (2020-08-19). https://arxiv.org/pdf/2008.09470.
56 Huang S Z, Lu W, Cheng Q K, et al. Evolutions of semantic consistency in research topic via contextualized word embedding[J]. Information Processing & Management, 2024, 61(6): 103859.
57 Yu D J, Xiang B. An ESTs detection research based on paper entity mapping: combining scientific text modeling and neural prophet[J]. Journal of Informetrics, 2024, 18(4): 101551.
58 Xie Q, Zhang X Y, Ding Y, et al. Monolingual and multilingual topic analysis using LDA and BERT embeddings[J]. Journal of Informetrics, 2020, 14(3): 101055.
59 Li Y T, Chen Y, Wang Q Y. Evolution and diffusion of information literacy topics[J]. Scientometrics, 2021, 126(5): 4195-4224.
60 Nie Z J, Feng Z C, Li M X, et al. When text embedding meets large language model: a comprehensive survey[OL]. (2025-03-20). https://arxiv.org/pdf/2412.09165.
61 Wang L, Yang N, Huang X L, et al. Improving text embeddings with large language models[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2024: 11897-11916.