|
|
An Improved Similarity Measurement Model for Chinese Short Texts Based on Weighted Network |
Niu Fenggao, Gao Xuxia |
School of Mathematical Sciences, Shanxi University, Taiyuan 030006 |
|
|
Abstract With the advent of text information explosion, data mining has become a principal method of knowledge discovery. The similarity measurement of text is an important technique in data mining. To improve the accuracy of the similarity calculation of short text, we propose a new similarity measurement model based on the weighted network. First, the semantic network is weighted based on the co-occurrence frequency of words, and the weighted complex network is used to represent the short text. Second, considering the feature of low weight recognition of the weighted complex network in the short text and the position of each word node, the weighted complex network characteristic value of each word in the short text is calculated. Finally, the similarity of short texts is considered based on the new model, and the model is evaluated by clustering of short texts. Our experimental results indicate that the new method is better than the STSim model.
|
Received: 24 February 2020
|
|
|
|
1 李志宇, 梁循, 周小平. 基于属性主题分割的评论短文本词向量构建优化算法[J]. 中文信息学报, 2016, 30(5): 101-110, 120. 2 Lintean M, Rus V. Measuring semantic similarity in short texts through greedy pairing and word semantics[C]// Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference, 2012. 3 Bickmore T, Giorgino T. Health dialog systems for patients and consumers[J]. Journal of Biomedical Informatics, 2006, 39(5): 556-571. 4 Cassell J. Embodied conversational agents: representation and intelligence in user interfaces[J]. AI Magazine, 2001, 22(4): 67-83. 5 Graesser A C, Chipman P, Haynes B C, et al. AutoTutor: an intelligent tutoring system with mixed-initiative dialogue[J]. IEEE Transactions on Education, 2005, 48(4): 612-618. 6 华秀丽, 朱巧明, 李培峰. 语义分析与词频统计相结合的中文文本相似度量方法研究[J]. 计算机应用研究, 2012, 29(3):833-836. 7 詹志建, 杨小平. 基于语言网络和语义信息的文本相似度计算[J]. 计算机工程与应用, 2014, 50(5): 33-38. 8 Salton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620. 9 秦兵, 刘挺, 王洋, 等. 基于常问问题集的中文问答系统研究[J]. 哈尔滨工业大学学报, 2003, 35(10): 1179-1182. 10 熊大平, 王健, 林鸿飞. 一种基于LDA的社区问答问句相似度计算方法[J]. 中文信息学报, 2012, 26(5): 40-45. 11 王荣波, 谌志群, 周建政, 等. 基于Wikipedia的短文本语义相关度计算方法[J]. 计算机应用与软件, 2015, 32(1): 82-85, 92. 12 荆琪, 段利国, 李爱萍, 等. 基于维基百科的短文本相关度计算[J]. 计算机工程, 2018, 44(2): 197-202. 13 章成志. 基于多层特征的字符串相似度计算模型[J]. 情报学报, 2005, 24(6): 696-701. 14 金博, 史彦军, 滕弘飞. 基于语义理解的文本相似度算法[J]. 大连理工大学学报, 2005, 45(2): 291-297. 15 马慧芳, 刘文, 李志欣, 等. 融合耦合距离区分度和强类别特征的短文本相似度计算方法[J]. 电子学报, 2019, 47(6): 1331-1336. 16 Ferrer i Cancho R, Solé R V. The small world of human language[J]. Proceedings of the Royal Society B: Biological Sciences, 2001, 268(1482): 2261-2265. 17 韦洛霞, 李勇, 李伟, 等. 汉字网络的3度分隔与小世界效应[J]. 科学通报, 2004, 49(24): 2615-2616. 18 赵鹏, 蔡庆生, 王清毅, 等. 一种基于复杂网络特征的中文文档关键词抽取算法[J]. 模式识别与人工智能, 2007, 20(6): 827-831. 19 杨志墨, 刘怀亮, 赵辉. 一种基于复杂网络的中文文本表示算法[J]. 现代图书情报技术, 2014(11): 38-44. 20 Zhan Z J, Lin F, Yang X P. Semantic similarity calculation of short texts based on language network and word semantic information[C]// Proceedings of the Advanced Computer Architecture. Heidelberg: Springer, 2014, 451: 215-228. 21 涂从良, 吴明功, 温祥西. 基于接近度与评价矩阵的关键机场节点识别[J]. 火力与指挥控制, 2017, 42(10): 172-176, 182. 22 朱新华, 马润聪, 孙柳, 等. 基于知网与词林的词语语义相似度计算[J]. 中文信息学报, 2016, 30(4): 29-36. 23 Karypis G. CLUTO: a clustering toolkit[R/OL]. (2003-11-28). http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf. 24 杨秀璋. 基于LDA模型和文本聚类的水族文献主题挖掘研究[J]. 现代计算机(专业版), 2019(5): 13-17. |
|
|
|