|
|
Research on Optimization of Scientific Literature Similarity Calculation Based on the Co-citation Feature |
Han Qing, Zhou Xiaoying |
School of Information Resources Management, Renmin University of China, Beijing 100872 |
|
|
Abstract Calculating similarity for scientific literature is the basis of applications such as literature search and literature analysis, and the results have a direct impact on the final effectiveness of the related applications. The co-citation information is an important feature that is different from that of ordinary text. It can effectively represent the correlation between two text inputs. Further, it can be used to improve the validity and reliability of literature similarity calculation. Based on the vector space model, semantic features and co-citation features are introduced into the literature similarity calculation, and a hybrid model is proposed to optimize the similarity calculation of scientific literature. Through the verification of seven research fields, such as university library, online public opinion, and information quality, the results show that the proposed model can make full use of the co-citation features of scientific literature, and thus compensate for the problem of insufficient features in the vector space model and improve the overall performance of scientific literature similarity calculation.
|
Received: 07 April 2018
|
|
|
|
[1] 刘玉琴, 汪雪锋, 雷孝平. 基于文本挖掘技术的专利质量评价与实证研究[J]. 计算机工程与应用, 2007(33): 12-14. [2] 赵国光. 医学文献相似性研究[D]. 北京: 首都师范大学, 2009. [3] 周永梅, 陶红, 陈姣姣, 等. 自动问答系统中的句子相似度算法的研究[J]. 计算机技术与发展, 2012, 22(5): 75-78. [4] Cao Y, Liu F, Simpson P, et al.An online question answering system for complex clinical questions[J]. Journal of Biomedical Informatics, 2011, 44(2): 277-288. [5] 周晓英. 论信息集合的信息构建(IA)[J]. 情报学报, 2004, 23(4): 456-462. [6] Salton G, Wong A, Yang C.A vector space model for automatic indexing[J]. Communications of the ACM, 1974, 18(11): 613-620. [7] Wong S K M, Ziarko W, Wong P C N. Generalized vector spaces model in information retrieval[C]// Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1985: 18-25. [8] 曾文, 徐红姣, 李颖, 等. 基于VSM的科技期刊文献与专利文献的相似度计算方法研究[J]. 情报工程, 2016, 2(3): 37-42. [9] 张佩云, 陈恩红, 谢荣见, 等. 基于元数据与领域概念树的文本相似度计算[J]. 系统工程与电子技术, 2014, 36(3): 591-597. [10] 赵国光. 医学文献相似性研究[D]. 北京: 首都师范大学, 2009. [11] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. [12] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算[J]. 计算机科学, 2013, 40(12): 229-232. [13] 王晓笛, 祝娜, 白如江, 等. 基于语义角色标注的文献相似度检测研究[J]. 图书情报工作, 2014, 58(12): 130-135. [14] 马凤. 基于隐语义相似度分析的专业文献检索方法及实证研究[J]. 情报理论与实践, 2014, 37(1): 110-115. [15] 王鹏, 赵逢禹, 陈章. 基于分层分割的科研领域文本信息挖掘[J]. 情报学报, 2015, 34(1): 85-91. [16] 黄贤英, 张金鹏, 刘英涛, 等. 基于词项语义映射的短文本相似度算法[J]. 计算机工程与设计, 2015, 36(6): 1514-1518, 1534. [17] 廖志芳, 周国恩, 李俊锋, 等. 中文短文本语法语义相似度算法[J]. 湖南大学学报(自然科学版), 2016, 43(2): 135-140. [18] 王晋, 孙涌, 王璁玮. 基于领域本体的文本相似度算法[J]. 苏州大学学报(工科版), 2011, 31(3): 13-17, 25. [19] 王秀红, 袁艳, 赵志程, 等. 专利文献的结构树模型及其在相似度计算中的应用[J]. 情报理论与实践, 2015, 38(3): 107-111. [20] Banea C, Chen D, Mihalcea R, et al.SimCompass: Using deep learning word embeddings to assess cross-level similarity[C]// Proceedings of the 8th International Workshop on Semantic Evaluation. Stroudsburg: Association for Computational Linguistics, 2014: 560-565. [21] Wang Z, Li J, Li S, et al.Cross-lingual knowledge validation based taxonomy derivation from heterogeneous online wikis[C]// Proceeding of the 28th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2014: 180-186. [22] Islam A, Inkpen D.Semantic text similarity using corpus-based word similarity and string similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 1-25. [23] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864. [24] 张佩云, 陈传明, 黄波. 基于子树匹配的文本相似度算法[J]. 模式识别与人工智能, 2014, 27(3): 226-234. [25] 詹志建, 杨小平. 一种基于复杂网络的短文本语义相似度计算[J]. 中文信息学报, 2016, 30(4): 71-80, 89. [26] 华秀丽, 朱巧明, 李培峰. 语义分析与词频统计相结合的中文文本相似度量方法研究[J]. 计算机应用研究, 2012, 29(3): 833-836. [27] 李纲, 毛进. 文本图表示模型及其在文本挖掘中的应用[J]. 情报学报, 2013, 32(12): 1257-1264. [28] 朱甜甜. 短文本语义相似度量的方法和应用研究[D]. 上海: 华东师范大学, 2014. [29] 郑小波, 郑诚, 尹莉莉. 基于GVSM的文本相似度算法研究[J]. 微型机与应用, 2011, 30(3): 9-11. [30] 谭静. 基于向量空间模型的文本相似度算法研究[D]. 成都: 西南石油大学, 2015. [31] Sun J Y. jieba中文分词组件[EB/OL].[2017-08-28]. https://github.com/fxsjy/jieba. [32] Small H.Co-citation context analysis and the structure of paradigms[J]. Journal of Documentation, 1980, 36(3): 183-196. [33] Small H.Cited documents as concept symbols[J]. Social Studies of Science, 1978, 8(3): 327-340. [34] Small H.Co-citation in the scientific literature: A new measure of the relationship between two documents[J]. Journal of the American Society for Information Science and Technology, 1973, 24(4): 265-269. [35] 刘伙玉. 基于CNKI的图书、情报学与档案学学科文献半衰期分析[J]. 图书与情报, 2015(1): 106-111. [36] Blondel V D, Guillaume J L, Lambiotte R, et al.Fast unfolding of communities in large networks[J]. Journal of Statistical Mechanics: Theory and Experiment, 2008, 2008(10): 155-168. [37] Luxburg U.A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4): 395-416. [38] Ng A Y, Jordan M I, Weiss Y.On spectral clustering: analysis and an algorithm[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge: MIT Press, 2001: 849-856. [39] Conrad J G, Al-Kofahi K, Zhao Y, et al.Effective document clustering for large heterogeneous law firm collections[C]// Proceedings of the 10th International Conference on Artificial Intelligence and Law. New York: ACM Press, 2005: 177-187. |
|
|
|