摘要由于短文本的特征稀疏性,传统的LDA或PLSA主题模型分析短文本的效果并不理想。结合社交网络社区发现技术,提出CA-LDA模型(Latent Dirichlet Allocation Model with Co-word network Analysis)。在传统LDA模型的基础上加入共词网络分析,考虑词汇在不同文档间的共现情况,构建词汇社交网络;利用词汇社交网络隐含空间降维的方法,以自同构等价规则,合并在网络中结构特征相同的词汇,在不损失信息的前提下,降低了词汇矩阵稀疏性;考虑词汇搭配关系(网络节点的邻接),以共词网络特征向量中心度调节主题模型中的词汇权重,通过递归累加,提高与重要词汇搭配的词汇的重要性;在传统LDA主题模型吉布斯采样(Gibbs Sampling)过程中,同时增加隐含位置聚类模型的社区发现算法,提高了具有相同搭配关系词汇划分在同一主题下的概率。实验证明该模型在短文本分析中有较好的效果。
蔡永明, 长青. 共词网络LDA模型的中文短文本主题分析[J]. 情报学报, 2018, 37(3): 305-317.
Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-word Network Analysis. 情报学报, 2018, 37(3): 305-317.
[1] Largeron C, Moulin C, Géry M.Entropy based feature selection for text categorization[C]// Proceedings of the 2011 ACM Symposium on Applied Computing. New York: ACM Press, 2011: 924-928. [2] Cunningham P.Dimension reduction[J]. Lecture Notes in Applied & Computational Mechanics, 2008, 626(2): 91-112. [3] Sahami M, Heilman T D.A web-based kernel function for measuring the similarity of short text snippets[C]// Proceedings of the 15th International Conference on World Wide Web. New York: ACM Press, 2006: 377-386. [4] 吴鹏, 马文虎, 严明. 基于Wordnet关系数据库的专利本体半自动构建研究[J]. 情报学报, 2011, 30(6): 598-604. [5] Ramage D, Hall D, Nallapati R, et al.Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2009, 1: 248-256. [6] Teh Y W, Jordan M I, Beal M J, et al.Hierarchical dirichlet processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581. [7] Petinot Y, McKeown K, Thadani K. A hierarchical model of Web summaries[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Oregon, Portland, 2011: 670-675. [8] Perotte A, Bartlett N, Elhadad N, et al.Hierarchically supervised latent dirichlet allocation[J]. Advances in Neural Information Processing Systems, 2011, 24: 2609-2617. [9] Jardine J, Teufel S.Topical PageRank: A model of scientific expertise for bibliographic search[C]// Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden, 2014: 501-510. [10] Chakrabarti S, Joshi M, Tawde V.Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks[C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, 2001: 208-216. [11] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain: 2004: 404-411. [12] 贺涛, 曹先彬, 谭辉. 基于免疫的中文网络短文本聚类算法[J]. 自动化学报, 2009, 35(7): 896-902. [13] Quan X J, Liu G, Lu Z, et al.Short text similarity based on probabilistic topics[J]. Knowledge and Information Systems, 2010, 25(3): 473-491. [14] Callon M, Courtial J P, Turner W A, et al.From translations to problematic networks: An introduction to co-word analysis[J]. Social Science Information, 1983, 22(2): 191-235. [15] Coulter N, Monarch I, Konda S.Software engineering as seen through its research literature: A study in co-word analysis[J]. Journal of the American Society for Information Science, 1998, 49(13): 1206-1223. [16] 张晓冬, 周宏丽, 胡杨, 等. 基于共词分析和社会网络分析的我国计算机集成制造系统研究热点[J]. 科技管理研究, 2016, 36(11): 145-149. [17] 马红, 蔡永明. 共词网络LDA模型的中文文本主题分析:以交通法学文献(2000-2016)为例[J]. 现代图书情报技术, 2016, 32(12): 17-26. [18] Mei Q Z, Cai D, Zhang D, et al.Topic modeling with network regularization[C]// Proceedings of the 17th International Conference on World Wide Web. New York: ACM Press, 2008: 101-110. [19] Bindra A.SocialLDA: Scalable Topic Modeling in Social Networks[D]. Seattle: University of Washington, 2012. [20] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624. [21] 刘啸剑, 谢飞, 吴信东. 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016, 35(6): 664-672. [22] Hoff P D, Raftery A E, Handcock M S.Latent space approaches to social network analysis[J]. Journal of the American Statistical Association, 2002, 97(460): 1090-1098. [23] Handcock M S, Raftery A E, Tantrum J M.Model-based clustering for social networks[J]. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2007, 170(2): 301-354. [24] Blei D M, Ng A Y, Jordan M I.Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022. [25] Newman D, Bonilla E V, Buntine W.Improving topic coherence with regularized topic models[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. Curran Associates, 2011: 496-504. [26] Blei D M, Lafferty J D.A correlated topic model of science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35. [27] Boyd-Graber J, Blei D.Syntactic topic models[J]. Advances in Neural Information Processing Systems, 2010: 185-192. [28] Borg I, Groenen P J F. Modern multidimensional scaling: Theory and applications[M]. Springer, 2005. [29] Kullback S.The kullback-leibler distance[J]. American Statistician, 1987, 41(4): 340-341. [30] Jordan M I.Learning in graphical models[M]. Cambridge: MIT Press, 1999. [31] LAN M, Tan C, Su J, et al.Supervised and traditional term weighting methods for automatic text categorization[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2009, 31(4): 721-735. [32] Bonacich P.Factoring and weighting approaches to status scores and clique identification[J]. The Journal of Mathematical Sociology, 1972, 2(1): 113-120. [33] Bonacich P.Power and centrality: A family of measures[J]. American Journal of Sociology, 1987, 92(5): 1170-1182.