|
|
Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-word Network Analysis |
Cai Yongming1, Chang Qing2 |
1. Business School, University of Jinan, Jinan 250002; 2. School of Economics and Management, Inner Mongolia University of Technology, Huhhot 010051 |
|
|
Abstract Given the sparse feature of the short text, the results of the traditional LDA or PLSA topic model is not suitable for analyzing short texts. Based on the traditional LDA model, a Latent Dirichlet Allocation Model with Co-word Network Analysis (CA-LDA) model is proposed considering the words co-occurrence network. According to the automorphic equivalence principle, the latent space model is used to reduce the dimension with minimum information loss. Eigenvector Centrality is used to revise the LDA model to raise the weights of important words by recursive accumulation. During the Gibbs Sampling, the latent position cluster model for social networks is used to raise the probability that the words with similar lexical collocation are divided into the same topic. Experimental results show the excellent performance of the model.
|
Received: 24 May 2017
|
|
|
|
[1] Largeron C, Moulin C, Géry M.Entropy based feature selection for text categorization[C]// Proceedings of the 2011 ACM Symposium on Applied Computing. New York: ACM Press, 2011: 924-928. [2] Cunningham P.Dimension reduction[J]. Lecture Notes in Applied & Computational Mechanics, 2008, 626(2): 91-112. [3] Sahami M, Heilman T D.A web-based kernel function for measuring the similarity of short text snippets[C]// Proceedings of the 15th International Conference on World Wide Web. New York: ACM Press, 2006: 377-386. [4] 吴鹏, 马文虎, 严明. 基于Wordnet关系数据库的专利本体半自动构建研究[J]. 情报学报, 2011, 30(6): 598-604. [5] Ramage D, Hall D, Nallapati R, et al.Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2009, 1: 248-256. [6] Teh Y W, Jordan M I, Beal M J, et al.Hierarchical dirichlet processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581. [7] Petinot Y, McKeown K, Thadani K. A hierarchical model of Web summaries[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Oregon, Portland, 2011: 670-675. [8] Perotte A, Bartlett N, Elhadad N, et al.Hierarchically supervised latent dirichlet allocation[J]. Advances in Neural Information Processing Systems, 2011, 24: 2609-2617. [9] Jardine J, Teufel S.Topical PageRank: A model of scientific expertise for bibliographic search[C]// Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden, 2014: 501-510. [10] Chakrabarti S, Joshi M, Tawde V.Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks[C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, 2001: 208-216. [11] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain: 2004: 404-411. [12] 贺涛, 曹先彬, 谭辉. 基于免疫的中文网络短文本聚类算法[J]. 自动化学报, 2009, 35(7): 896-902. [13] Quan X J, Liu G, Lu Z, et al.Short text similarity based on probabilistic topics[J]. Knowledge and Information Systems, 2010, 25(3): 473-491. [14] Callon M, Courtial J P, Turner W A, et al.From translations to problematic networks: An introduction to co-word analysis[J]. Social Science Information, 1983, 22(2): 191-235. [15] Coulter N, Monarch I, Konda S.Software engineering as seen through its research literature: A study in co-word analysis[J]. Journal of the American Society for Information Science, 1998, 49(13): 1206-1223. [16] 张晓冬, 周宏丽, 胡杨, 等. 基于共词分析和社会网络分析的我国计算机集成制造系统研究热点[J]. 科技管理研究, 2016, 36(11): 145-149. [17] 马红, 蔡永明. 共词网络LDA模型的中文文本主题分析:以交通法学文献(2000-2016)为例[J]. 现代图书情报技术, 2016, 32(12): 17-26. [18] Mei Q Z, Cai D, Zhang D, et al.Topic modeling with network regularization[C]// Proceedings of the 17th International Conference on World Wide Web. New York: ACM Press, 2008: 101-110. [19] Bindra A.SocialLDA: Scalable Topic Modeling in Social Networks[D]. Seattle: University of Washington, 2012. [20] 吴江宁, 刘巧凤. 基于图结构的中文文本表示方法研究[J]. 情报学报, 2010, 29(4): 618-624. [21] 刘啸剑, 谢飞, 吴信东. 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016, 35(6): 664-672. [22] Hoff P D, Raftery A E, Handcock M S.Latent space approaches to social network analysis[J]. Journal of the American Statistical Association, 2002, 97(460): 1090-1098. [23] Handcock M S, Raftery A E, Tantrum J M.Model-based clustering for social networks[J]. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2007, 170(2): 301-354. [24] Blei D M, Ng A Y, Jordan M I.Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022. [25] Newman D, Bonilla E V, Buntine W.Improving topic coherence with regularized topic models[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. Curran Associates, 2011: 496-504. [26] Blei D M, Lafferty J D.A correlated topic model of science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35. [27] Boyd-Graber J, Blei D.Syntactic topic models[J]. Advances in Neural Information Processing Systems, 2010: 185-192. [28] Borg I, Groenen P J F. Modern multidimensional scaling: Theory and applications[M]. Springer, 2005. [29] Kullback S.The kullback-leibler distance[J]. American Statistician, 1987, 41(4): 340-341. [30] Jordan M I.Learning in graphical models[M]. Cambridge: MIT Press, 1999. [31] LAN M, Tan C, Su J, et al.Supervised and traditional term weighting methods for automatic text categorization[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2009, 31(4): 721-735. [32] Bonacich P.Factoring and weighting approaches to status scores and clique identification[J]. The Journal of Mathematical Sociology, 1972, 2(1): 113-120. [33] Bonacich P.Power and centrality: A family of measures[J]. American Journal of Sociology, 1987, 92(5): 1170-1182. |
|
|
|