|
|
Keyword-Based Clustering Ensembles in Academic Documents |
Zhang Yingyi1,2, Zhang Chengzhi1,2, Chen Guo1 |
1.Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094 2.Institute of Scientific and Technical Information of China, Beijing 100038 |
|
|
Abstract Clustering is an unsupervised, efficient method of classifying articles. Keyphrases are regarded as the main content of articles, and clustering according to keyphrases is an effective method of performing a text clustering task. However, studies of publication clustering utilize only the clustering algorithms. To date, several methods have had been proposed to improve the performance of clustering, and ensemble clustering is one of them. Thus, based on the concept of ensemble clustering, this paper details a study of keyphrase-based publication clustering. To analyze whether ensemble clustering is an efficient method of publication clustering, this paper compares the performance of ensemble clustering with clustering without ensemble learning. Moreover, to analyze the impact of keyphrases on ensemble clustering, this paper compares the results of various keyphrase extraction methods and various keyphrase quantities based on publication clustering. The experimental results show that ensemble clustering can improve the performance of publication clustering. Publication clustering also yields a better result when the TextRank algorithm is used for extracting keyphrases. The performance also improves when provided with more keyphrases.
|
Received: 17 July 2018
|
|
|
|
1 BornmannL, MutzR. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references[J]. Journal of the Association for Information Science and Technology, 2015, 66(11): 2215-2222. 2 WaltmanL, van EckN J. A new methodology for constructing a publication-level classification system of science[J]. Journal of the Association for Information Science and Technology, 2012, 63(12): 2378-2392. 3 FredA, JainA K. Evidence accumulation clustering based on the K-means algorithm[C]// Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition. Heidelberg: Springer, 2002: 442-451. 4 ZhaoW X, JiangJ, HeJ, et al. Topical keyphrase extraction from Twitter[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL Press, 2011: 379-388. 5 ChoiJ, CroftW B, KimJ Y. Quality models for Microblog retrieval[C]// Proceedings of the 21st ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2012: 1834-1838. 6 MarujoL, RibeiroR, GershmanA, et al. Event-based summarization using a centrality-as-relevance model[J]. Knowledge and Information Systems, 2017, 50(3): 945-968. 7 RossiR G, MarcaciniR M, RezendeS O. Analysis of domain independent statistical keyword extraction methods for incremental clustering[J]. Learning and Nonlinear Models, 2014, 12(1): 17-37. 8 王伟华. 基于主题模型的科技论文聚类推荐[D]. 北京: 华北电力大学, 2013. 9 王旭仁, 李娜, 何发镁, 等. 基于改进聚类算法的网络舆情分析系统研究[J]. 情报学报, 2014, 33(5): 530-537. 10 徐禹洪, 黄沛杰. 基于优化样本分布抽样集成学习的半监督文本分类方法研究[J]. 中文信息学报, 2017, 31(6): 180-189. 11 RojarathA, SongpanW, Pong-InwongC. Improved ensemble learning for classification techniques based on majority voting[C]// Proceedings of the 7th IEEE International Conference on Software Engineering and Service Science. New York: IEEE, 2017: 107-110. 12 杨草原, 刘大有, 杨博, 等. 聚类集成方法研究[J]. 计算机科学, 2011, 38(2): 166-170. 13 MacQueenJ. Some methods for classification and analysis of multivariate observations[C]// Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281-297. 14 ZhouZ H, TangW. Clusterer ensemble[J]. Knowledge-Based Systems, 2006, 19(1): 77-83. 15 JainA K, MurtyM N, FlynnP J. Data clustering: A review[J]. ACM Computing Surveys, 1999, 31(3): 264-323. 16 NarinF, PinskiG, GeeH H. Structure of the biomedical literature[J]. Journal of the American Society for Information Science, 1976, 27(1): 25-45. 17 LeydesdorffL, RafolsI. A global map of science based on the ISI subject categories[J]. Journal of the American Society for Information Science and Technology, 2009, 60(2): 348-362. 18 SmallH, SweeneyE. Clustering the science citation index using co-citations[J]. Scientometrics, 1985, 7(3-6): 391-409. 19 LiuF F, PennellD, LiuF, et al. Unsupervised approaches for automatic keyword extraction using meeting transcripts[C]// Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Morristown: Association for Computational Linguistics, 2009: 620-628. 20 FreundY, SchapireR E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139. 21 BreimanL. Bagging predictors[J]. Machine Learning, 1996, 24(2): 123-140. 22 Minaei-BidgoliB, TopchyA, PunchW F. Ensembles of partitions via data resampling[C]// Proceedings of the International Conference on Information Technology: Coding and Computing. New York: IEEE, 2004: 188-192. 23 DudoitS, FridlyandJ. Bagging to improve the accuracy of a clustering procedure[J]. Bioinformatics, 2003, 19(9): 1090-1099. 24 GionisA, MannilaH, TsaparasP. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): Article No. 4. 25 程凯, 钟才明, 庞永明. 聚类集成中基聚类的优化研究[J]. 计算机应用与软件, 2017, 34(9): 267-272. 26 FredA. Finding consistent clusters in data partitions[C]// Proceedings of the International Workshop on Multiple Classifier Systems. Heidelberg: Springer, 2001: 309-318. 27 WangX, YangC Y, ZhouJ. Clustering aggregation by probability accumulation[J]. Pattern Recognition, 2009, 42(5): 668-675. 28 StrehlA, GhoshJ. Cluster ensembles–A knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2002, 3(12): 583-617. 29 王丽娟, 郝志峰, 蔡瑞初, 等. 基于随机取样的选择性K-means聚类融合算法[J]. 计算机应用, 2013, 33(7): 1969-1972. 30 WittenI H, PaynterG W, FrankE, et al. KEA: Practical automatic keyphrase extraction[C]// Proceedings of the Fourth ACM Conference on Digital Libraries. New York: ACM Press, 1999: 254-255. 31 ZhangY Y, LiJ, SongY. et al. Encoding conversation context for neural keyphrase extraction from Microblog posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL Press, 2018: 1676-1686. 32 ZhangQ, WangY, GongY Y, et al. Keyphrase extraction using deep recurrent neural networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language. Stroudsburg: ACL Press, 2016: 836-845. 33 SaltonG, BuckleyC. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523. 34 MatsuoY, IshizukaM. Keyword extraction from a single document using word co-occurrence statistical information[J]. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169. 35 PalshikarG K. Keyword extraction from a single document using centrality measures[C]// Proceedings of International Conference on Pattern Recognition and Machine Intelligence. Heidelberg: Springer, 2007: 503-510. 36 MihalceaR, TarauP. TextRank: Bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL Press, 2004: 404-411. 37 SaltonG, LeskM E. Computer evaluation of indexing and text processing[J]. Journal of the ACM, 1968, 15(1): 8-36. 38 张振亚, 王进, 程红梅, 等. 基于余弦相似度的文本空间索引方法研究[J]. 计算机科学, 2005, 32(9): 160-163. 39 CoverT M, ThomasJ A. Elements of information theory[M]. New York: John Wiley & Sons, 1991. 40 ManningC D, RaghavanP, SchützeH. Introduction to information retrieval[M]. Cambridge: Cambridge University Press, 2008. |
|
|
|