|
|
Study on Identification of Potential “Treasures” in Massive Papers Based on Machine Learning Models |
Hu Zewen, Ren Ping, Cui Jingjing |
School of Management Science and Engineering, Nanjing University of Information Science & Technology, Nanjing 210044 |
|
|
Abstract Constructing a feature vector space of massive literature and using machine learning models to accurately and automatically identify and utilize potential “treasures” from a vast body of literature can enhance their scientific influence and facilitate advancements in science and technology. This study designs and implements machine learning models and the model framework of identifying potential “treasures” from consistent scientific and technological papers. As samples, we collected papers (and their citation data) published in international high-influencing journals and domestic journals from Web of Science and Library Information and Archives Management, respectively. Subsequently, we measured the bibliometric characteristics of all these papers and constructed a feature vector space of the literature. Thereafter, traditional machine learning models, such as support vector machine and naive Bayes model, and deep learning models, such as deep belief networks and multilayer perceptron, were used to identify potential “high-quality” papers. An receiver operating characteristic (ROC) curve and a confusion matrix were used to evaluate the recognition effect of the machine learning algorithms. The results show that deep learning models cannot efficiently identify the potential “treasures” from consistent papers, thus exhibiting a low recognition effect. However, the traditional machine learning models can efficiently identify the potential “treasures” from international high-influencing journals and domestic journals in library Information and Archives Management. While two types of machine learning models, including random forest and support vector machine, show the optimum recognition effect, relatively low recognition effect for the decision tree model and Naive Bayes model is identified. Moreover, the more influential a journal is, the higher the recognition effect. Irrespective of whether we considered international high-influencing journals from natural sciences or domestic journals from social sciences, all identified excellent papers exhibit a higher citation frequency, and extremely few review papers are found among them. Furthermore, by comparing the bibliometric features of all papers analyzed, we find that most identified excellent papers are multi-author articles supported by science foundation and present a shorter first-citation time, more references and keywords, higher citation frequency, and longer abstracts. The empirical results show that the machine learning model can accurately identify potential “high-quality” articles from massive scientific and technological literature and improve the automation scope of identifying potential “high-quality” articles. This can also provide theoretical reference and methodological support for automatic recognition, dissemination, and utilization of potential “high-quality” papers from massive literature.
|
Received: 16 December 2021
|
|
|
|
1 苏新宁. 完善评价体系, 推动科技创新[N/OL]. 人民日报, 2018-06-21(18). http://edu.people.com.cn/n1/2018/0621/c1006-30070180.html. 2 坚定文化自信把握时代脉搏聆听时代声音, 坚持以精品奉献人民用明德引领风尚[N/OL]. 光明日报, 2019-03-05(01). https://m.gmw.cn/baijia/2019-03/05/32600885.html. 3 胡泽文, 武夷山, 高继平. 图书情报学领域期刊论文零被引率的演变规律研究[J]. 情报学报, 2018, 37(3): 243-253. 4 胡泽文, 崔静静, 曹玲. 国内外科技文献低被引研究进展述评[J]. 情报学报, 2020, 39(12): 1354-1362. 5 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. 6 徐晓芹, 刘晓燕, 李春花. 基于专家审稿意见的高被引和零被引论文学术质量差异性分析[J]. 编辑学报, 2015, 27(6): 564-566. 7 van Raan A F J. Advanced bibliometric methods as quantitative core of peer review based evaluation and foresight exercises[J]. Scientometrics, 1996, 36(3): 397-420. 8 Gl?nzel W. Seven myths in bibliometrics about facts and fiction in quantitative science studies[J]. COLLNET Journal of Scientometrics and Information Management, 2008, 2(1): 9-17. 9 叶鹰. 高品质论文被引数据及其对学术评价的启示[J]. 中国图书馆学报, 2010, 36(1): 100-103. 10 曾继城, 张家榕, 叶鹰. 天鹅展翅: 高品质论文的引文模式探析[J]. 大学图书馆学报, 2019, 37(2): 83-87, 112. 11 Li J. Citation curves of “all-elements-sleeping-beauties”: “flash in the pan” first and then “delayed recognition”[J]. Scientometrics, 2014, 100(2): 595-601. 12 Li J, Ye F Y. The phenomenon of all-elements-sleeping-beauties in scientific literature[J]. Scientometrics, 2012, 92(3): 795-799. 13 Moed H F. The impact-factors debate: the ISI’s uses and limits[J]. Nature, 2002, 415(6873): 731-732. 14 Essential science indicators[EB/OL]. [2021-06-24]. http://esi.webofknowledge.com/home.cgi. 15 Garfield E. Bradford’s law and related statistical patterns[OL]. Essays of an Information Scientist, 1980, 4: 476-483. (1980-05-12). http://www.garfield.library.upenn.edu/essays/v4p476y1979-80.pdf. 16 Albarrán P, Ortu?o I, Ruiz-Castillo J. The measurement of low- and high-impact in citation distributions: technical results[J]. Journal of Informetrics, 2011, 5(1): 48-63. 17 Albarrán P, Ortu?o I, Ruiz-Castillo J. High- and low-impact citation measures: empirical applications[J]. Journal of Informetrics, 2011, 5(1): 122-145. 18 Hu Y H, Tai C T, Liu K E, et al. Identification of highly-cited papers using topic-model-based and bibliometric features: the consideration of keyword popularity[J]. Journal of Informetrics, 2020, 14(1): 101004. 19 Martin-Martin A, Orduna-Malea E, Harzing A W, et al. Can we use Google Scholar to identify highly-cited documents?[J]. Journal of Informetrics, 2017, 11(1): 152-163. 20 Garfield E. Delayed recognition in scientific discovery: citation frequency analysis aids the search for case histories[OL]. Essays of an Information Scientist, 1989, 12: 154-160. (1989-06-05). http://garfield.library.upenn.edu/essays/v12p154y1989.pdf. 21 van Raan A F J. Sleeping beauties in science[J]. Scientometrics, 2004, 59(3): 467-472. 22 Costas R, van Leeuwen T N, van Raan A F J. Is scientific literature subject to a ‘Sell-By-Date’? A general methodology to analyze the ‘durability’ of scientific documents[J]. Journal of the American Society for Information Science and Technology, 2010, 61(2): 329-339. 23 Ke Q, Ferrara E, Radicchi F, et al. Defining and identifying sleeping beauties in science[J]. Proceedings of the National Academy of Sciences of the United States of America, 2015, 112(24): 7426-7431. 24 Teixeira A A C, Vieira P C, Abreu A P. Sleeping Beauties and their princes in innovation studies[J]. Scientometrics, 2017, 110(2): 541-580. 25 Bornmann L, Ye A Y, Ye F Y. Identifying “hot papers” and papers with “delayed recognition” in large-scale datasets by using dynamically normalized citation impact scores[J]. Scientometrics, 2018, 116(2): 655-674. 26 Ye F Y, Bornmann L. “Smart girls” versus “sleeping beauties” in the sciences: the identification of instant and delayed recognition by using the citation angle[J]. Journal of the Association for Information Science and Technology, 2018, 69(3): 359-367. 27 杜建, 武夷山. 基于被引速率指标识别睡美人文献及其“王子”——以2014年诺贝尔化学奖得主Stefan Hell的睡美人文献为例[J]. 情报学报, 2015, 34(5): 508-521. 28 杜建, 武夷山. 一个用于识别睡美人文献的新的无参数指标——基于“Science”和“Nature”上睡美人文献的验证[J]. 情报理论与实践, 2017, 40(2): 19-25. 29 宋呈玉, 李秀霞, 刘黎明. 基于引文曲线导数的睡美人文献识别研究[J]. 情报资料工作, 2019, 40(3): 33-38. 30 赵又霖, 刘黎明, 葛梦真, 等. 改进的“睡美人”B值识别模型构建及学科领域因素差异探析——以ISLS和WR为例[J]. 图书与情报, 2020(2): 128-139. 31 Avramescu A. Actuality and obsolescence of scientific literature[J]. Journal of the American Society for Information Science, 1979, 30(5): 296-303. 32 Burrell Q L. Are “sleeping beauties” to be expected?[J]. Scientometrics, 2005, 65(3): 381-389. 33 Dey R, Roy A, Chakraborty T, et al. Sleeping beauties in Computer Science: characterization and early identification[J]. Scientometrics, 2017, 113(3): 1645-1663. 34 Fu L D, Aliferis C F. Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature[J]. Scientometrics, 2010, 85(1): 257-270. 35 Ibá?ez A, Larra?aga P, Bielza C. Predicting citation count of Bioinformatics papers within four years of publication[J]. Bioinformatics, 2009, 25(24): 3303-3309. 36 Ruan X M, Zhu Y Y, Li J, et al. Predicting the citation counts of individual papers via a BP neural network[J]. Journal of Informetrics, 2020, 14(3): 101039. 37 Dang Q V, Ignat C L. Quality assessment of Wikipedia articles without feature engineering[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries. IEEE, 2016: 27-30. 38 Wang P, Li X D. Assessing the quality of information on wikipedia: a deep-learning approach[J]. Journal of the Association for Information Science and Technology, 2020, 71(1): 16-28. 39 崔静静, 胡泽文, 任萍. 基于决策树和逻辑回归模型的人工智能领域潜在“精品”论文识别研究[J]. 情报科学, 2022, 40(5): 90-96. 40 胡泽文, 任萍, 周西姬. 基于随机森林的Science和Nature期刊潜在精品论文识别研究[J]. 情报科学, 2022, 40(4): 90-95, 106. 41 袁梅宇. 数据挖掘与机器学习——WEKA应用技术与实践[M]. 北京: 清华大学出版社, 2014. 42 樊海玮, 史双, 张博敏, 等. 基于MLP改进型深度神经网络学习资源推荐算法[J]. 计算机应用研究, 2020, 37(9): 2629-2633. 43 李小涛, 秦萍, 钱玲飞. 图情领域基本科学指标数据库高被引论文的知识图谱分析[J]. 情报理论与实践, 2017, 40(2): 111-116, 121. 责任编辑 王克平) |
|
|
|