|
|
Misinformation Identification Method by Automatic Iterative Clustering Data Set for Training |
Zhang Junsheng1, Sun Xiaoping2, Liu Zhihui1 |
1.Institute of Scientific and Technical Information of China, Beijing 100038 2.KL-IIP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 |
|
|
Abstract With increasing proliferation of misinformation on the Internet, automatic identification of misinformation has become an urgent need for information governance. Misinformation on the Internet is constantly generated with new events, thereby resulting in the need for iterations and updates in the machine learning model to identify such misinformation. A new training data set should be constructed for each iteration update, so that the new misinformation can be reflected in the training set. Therefore, this study proposes a misinformation recognition method of dynamically and iteratively updating the training set to build a machine learning model, and iteratively clustering the misinformation data set based on kernel density estimation. In each cluster, training set and test set samples are selected to construct the corresponding classifier training data set and test data set; this enables the samples of new events to be reflected in the training set. The experimental results show that the misinformation classifier trained by the iterative clustering method based on kernel density estimation can significantly improve the accuracy of false information classification compared with the random data set division strategy.
|
Received: 28 October 2021
|
|
|
|
1 Lazer D M J, Baum M A, Benkler Y, et al. The science of fake news[J]. Science, 2018, 359(6380): 1094-1096. 2 Bondielli A, Marcelloni F. A survey on fake news and rumour detection techniques[J]. Information Sciences, 2019, 497: 38-55. 3 刘彬. 健康传播中的虚假信息扩散机制与网络治理探究[J]. 传播与版权, 2020(4): 178-179, 185. 4 BhattacharjeeAmrita, 舒凯, 高旻, 等. 网络信息生态系统中的虚假信息:检测、缓解与挑战[J]. 计算机研究与发展, 2021, 58(7): 1353-1365. 5 黄如花, 黄雨婷. 面向重大突发公共卫生事件的虚假信息甄别——从新型冠状病毒肺炎疫情防控谈公众信息素养教育的重要性[J/OL]. 图书情报知识, (2020-04-21). https://d.wanfangdata.com.cn/periodical/ChlQZXJpb2RpY2FsQ0hJTmV3UzIwMjIxMTE1Eg90c3FienMyMDIwMDIwMDQaCHFiamFjcm1x. 6 Shu K, Sliva A, Wang S H, et al. Fake news detection on social media: a data mining perspective[J]. ACM SIGKDD Explorations Newsletter, 2017, 19(1): 22-36. 7 Afroz S, Brennan M, Greenstadt R. Detecting hoaxes, frauds, and deception in writing style online[C]// Proceedings of the 2012 IEEE Symposium on Security and Privacy. IEEE, 2012: 461-475. 8 Wu K, Yang S, Zhu K Q. False rumors detection on Sina Weibo by propagation structures[C]// Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 2015: 651-662. 9 Rubin V L, Conroy N J, Chen Y M, et al. Fake news or truth? Using satirical cues to detect potentially misleading news[C]// Proceedings of the Second Workshop on Computational Approaches to Deception Detection. Stroudsburg: Association for Computational Linguistics, 2016: 7-17. 10 Ahmed H, Traore I, Saad S. Detection of online fake news using n-gram analysis and machine learning techniques[C]// Proceedings of the International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. Cham: Springer, 2017: 127-138. 11 Rashkin H, Choi E, Jang J Y, et al. Truth of varying shades: analyzing language in fake news and political fact-checking[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2931-2937. 12 Ma J, Gao W, Mitra P, et al. Detecting rumors from microblogs with recurrent neural networks[C]// Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. Washington D.C: AAAI Press, 2016: 3818-3824. 13 Verma P K, Agrawal P, Amorim I, et al. WELFake: word embedding over linguistic features for fake news detection[J]. IEEE Transactions on Computational Social Systems, 2021, 8(4): 881-893. 14 Kaliyar R K, Goswami A, Narang P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach[J]. Multimedia Tools and Applications, 2021, 80(8): 11765-11788. 15 Vosoughi S, Mohsenvand M N, Roy D. Rumor gauge: predicting the veracity of rumors on Twitter[J]. ACM Transactions on Knowledge Discovery from Data, 2017, 11(4): Article No.50. 16 刘波, 李洋, 孟青, 等. 社交媒体内容可信性分析与评价[J]. 计算机研究与发展, 2019, 56(9): 1939-1952. 17 谢柏林, 蒋盛益, 周咏梅, 等. 基于把关人行为的微博虚假信息及早检测方法[J]. 计算机学报, 2016, 39(4): 730-744. 18 Hosseinimotlagh S, Papalexakis E E. Unsupervised content-based identification of fake news articles with tensor decomposition ensembles[C]// Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web, Los Angeles, California, USA, 2018: 1-8. 19 任亚峰, 姬东鸿, 张红斌, 等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3): 639-648. 20 Dong X S, Victor U, Qian L J. Two-path deep semisupervised learning for timely fake news detection[J]. IEEE Transactions on Computational Social Systems, 2020, 7(6): 1386-1398. 21 Ng R T, Han J W. Efficient and effective clustering methods for spatial data mining[C]// Proceedings of the 20th International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers, 1994: 144-155. 22 Sander J, Ester M, Kriegel H P, et al. Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications[J]. Data Mining and Knowledge Discovery, 1998, 2: 169-194. 23 Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]// Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Washington D.C: AAAI Press, 1996: 226-231. 24 Ankerst M, Breunig M M, Kriegel H P, et al. OPTICS: ordering points to identify the clustering structure[J]. ACM SIGMOD Record, 1999, 28(2): 49-60. 25 Silverman B W. Density estimation for statistics and data analysis[M]. London: Chapman and Hall, 1986. 26 Sheather S J, Jones M C. A reliable data-based bandwidth selection method for kernel density estimation[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1991, 53(3): 683-690. 27 Schnell P. A method to find point-groups[J]. Biometrika, 1964, 6: 47-48. 28 Hinneburg A, Keim D A. A general approach to clustering in large databases with noise[J]. Knowledge and Information Systems, 2003, 5(4): 387-415. 29 李存华, 孙志挥, 陈耿, 胡云. 核密度估计及其在聚类算法构造中的应用[J]. 计算机研究与发展, 2004, 41(10): 1712-1719. 30 Hinneburg A, Gabriel H H. DENCLUE 2.0: fast clustering based on kernel density estimation[C]// Proceedings of the International Symposium on Intelligent Data Analysis. Heidelberg: Springer, 2007: 70-80. 31 Reimers N, Gurevych I. Sentence-BERT: sentence embeddings using siamese BERT-networks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2019: 3982-3992. 责任编辑 王克平) |
|
|
|