|
|
Imbalanced Classification of Emerging Technologies Identification: Based on Cost-sensitive Random Forest |
Lu Xiaobin, Zhang Yangyi, Yang Guancan, Xing Jiaxin |
School of Information Resource Management, Renmin University of China, Beijing 100872 |
|
|
Abstract Automated forward-looking forecasting based on large patent data and patent characteristics has gradually become the research focus of emerging technologies identification. In addition, the introduction of machine learning technology has attracted the attention of the small probability of discovering emerging technologies from massive technological inventions represented by patents, which comprises a typical imbalanced classification problem. This study aims to improve the identification performance of the classification bias to the majority caused by imbalanced datasets in emerging technologies identification and to propose a comprehensive imbalanced classification optimization framework that integrates three levels of data, algorithm, and evaluation verified by the binary classification of whether the patents in cancer drugs field can be authorized by the Food and Drug Administration to become new drugs as emerging technologies as an example. The specific improvements are as follows: progressive resampling is verified at the data level, cost-sensitive learning is introduced with three cost matrix setting methods under the background of a lack of expert experience are studied at the evaluation level, and the cost-sensitive random forest is constructed at the algorithm level. The results show that cost-sensitive random forest based on 1∶2 undersampling and ROC (receiver operating characteristic) -Youden index threshold cost matrix can predict 82.8% of the emerging technologies and 81.6% of the common technologies, which is significantly better than the control group and the existing related results. It has a certain reference value for further mining the essence of the imbalanced classification in emerging technologies identification in the future, and has certain reference value for the future exploration of the nature of the imbalanced classification problems in emerging technologies identification.
|
Received: 16 August 2021
|
|
|
|
1 乔治·戴, 保罗·休梅克. 沃顿论新兴技术管理[M]. 石莹, 等译. 北京: 华夏出版社, 2002. 2 Lee C Y, Kim J, Kwon O, et al. Stochastic technology life cycle analysis using multiple patent indicators[J]. Technological Forecasting and Social Change, 2016, 106: 53-64. 3 周源, 刘宇飞, 薛澜. 一种基于机器学习的新兴技术识别方法: 以机器人技术为例[J]. 情报学报, 2018, 37(9): 939-955. 4 Porter A L, Garner J, Carley S F, et al. Emergence scoring to identify frontier R&D topics and key players[J]. Technological Forecasting and Social Change, 2019, 146: 628-643. 5 Cobo M J, López-Herrera A G, Herrera-Viedma E, et al. An approach for detecting, quantifying, and visualizing the evolution of a research field: a practical application to the Fuzzy Sets Theory field[J]. Journal of Informetrics, 2011, 5(1): 146-166. 6 Leone Sciabolazza V, Vacca R, Kennelly Okraku T, et al. Detecting and analyzing research communities in longitudinal scientific networks[J]. PLoS One, 2017, 12(8): e0182516. 7 Choi S, Yoon J, Kim K, et al. SAO network analysis of patents for technology trends identification: a case study of polymer electrolyte membrane technology in proton exchange membrane fuel cells[J]. Scientometrics, 2011, 88(3): 863-883. 8 Verhoeven D, Bakker J, Veugelers R. Measuring technological novelty with patent-based indicators[J]. Research Policy, 2016, 45(3): 707-723. 9 卢小宾, 杨冠灿, 徐硕, 等. 计量与演化视角下的新兴技术识别研究进展评述[J]. 情报学报, 2020, 39(6): 651-661. 10 Longadge R, Dongre S S. Class imbalance problem in data mining: review[J]. International Journal of Computer Science and Network, 2013, 2(1): 83-87. 11 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. 12 叶志飞, 文益民, 吕宝粮. 不平衡分类问题研究综述[J]. 智能系统学报, 2009, 4(2): 148-156. 13 Japkowicz N, Stephen S. The class imbalance problem: a systematic study[J]. Intelligent Data Analysis, 2002, 6(5): 429-449. 14 Weiss G M, Hirsh H. A quantitative study of small disjuncts[C/OL]// AAAI-00 Proceedings. Palo Alto: AAAI Press, 2000. https://www.aaai.org/Papers/AAAI/2000/AAAI00-102.pdf. 15 向鸿鑫, 杨云. 不平衡数据挖掘方法综述[J]. 计算机工程与应用, 2019, 55(4): 1-16. 16 Menardi G, Torelli N. Training and assessing classification rules with imbalanced data[J]. Data Mining and Knowledge Discovery, 2014, 28(1): 92-122. 17 Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357. 18 Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning[C]// Proceedings of the International Conference on Intelligent Computing. Heidelberg: Springer, 2005: 878-887. 19 He H B, Bai Y, Garcia E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IEEE, 2008: 1322-1328. 20 Prusa J, Khoshgoftaar T M, Dittman D J, et al. Using random undersampling to alleviate class imbalance on tweet sentiment data[C]// Proceedings of the 2015 IEEE International Conference on Information Reuse and Integration. IEEE, 2015: 197-202. 21 Yuan J H, Li J M, Zhang B. Learning concepts from large scale imbalanced data sets using support cluster machines[C]// Proceedings of the 14th ACM International Conference on Multimedia. New York: ACM Press, 2006: 441-450. 22 Zhang J P, Mani I. KNN approach to unbalanced data distributions: a case study involving information extraction[C]// Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets. ICML, 2003: 126. 23 Tomek I. Two modifications of CNN[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1976, SMC-6(11): 769-772. 24 Ali-Gombe A, Elyan E. MFC-GAN: class-imbalanced dataset classification using Multiple Fake Class Generative Adversarial Network[J]. Neurocomputing, 2019, 361: 212-221. 25 Douzas G, Bacao F. Effective data generation for imbalanced learning using conditional generative adversarial networks[J]. Expert Systems With Applications, 2018, 91: 464-471. 26 Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63-77. 27 李勇, 刘战东, 张海军. 不平衡数据的集成分类算法综述[J]. 计算机应用研究, 2014, 31(5): 1287-1291. 28 刘定祥, 乔少杰, 张永清, 等. 不平衡分类的数据采样方法综述[J]. 重庆理工大学学报(自然科学), 2019, 33(7): 102-112. 29 Lin E L, Chen Q, Qi X M. Deep reinforcement learning for imbalanced classification[J]. Applied Intelligence, 2020, 50(8): 2488-2502. 30 Shi M, Tang Y F, Zhu X Q, et al. Multi-class imbalanced graph convolutional network learning[C]// Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2020: 2879-2885. 31 Ghorbani M, Kazi A, Soleymani Baghshah M, et al. RA-GCN: graph convolutional network for disease prediction problems with imbalanced data[J]. Medical Image Analysis, 2022, 75: 102272. 32 Liu W K, Zhang H, Ding Z Y, et al. A comprehensive active learning method for multiclass imbalanced data streams with concept drift[J]. Knowledge-Based Systems, 2021, 215: 106778. 33 Kang X, Shi X F, Wu Y N, et al. Active learning with complementary sampling for instructing class-biased multi-label text emotion classification[J/OL]. IEEE Transactions on Affective Computing, (2020-11-16). http://dx.doi.org/10.1109/TAFFC.2020.3038401. 34 Wu X J, Meng S F. E-commerce customer churn prediction based on improved SMOTE and AdaBoost[C]// Proceedings of the 2016 13th International Conference on Service Systems and Service Management. IEEE, 2016: 1-5. 35 游子莹. 不均衡样本的分类优化方法[D]. 武汉: 华中科技大学, 2018. 36 孙炜. 基于代价敏感的改进AdaBoost算法在不平衡数据中的应用[D]. 广州: 暨南大学, 2018. 37 翟夕阳, 王晓丹, 李睿, 等. 采用多类代价指数损失函数的代价敏感AdaBoost算法[J]. 西安交通大学学报, 2017, 51(8): 33-39. 38 王学玲, 王建林. 基于代价敏感的AdaBoost算法改进[J]. 计算机应用与软件, 2013, 30(10): 123-125, 138. 39 Zhou Z H. Cost-sensitive learning[C]// Proceedings of the International Conference on Modeling Decisions for Artificial Intelligence. Heidelberg: Springer, 2011: 17-18. 40 平瑞, 周水生, 李冬. 高度不平衡数据的代价敏感随机森林分类算法[J]. 模式识别与人工智能, 2020, 33(3): 249-257. 41 Fan W, Stolfo S J, Zhang J X, et al. AdaCost: misclassification cost-sensitive boosting[C]// Proceedings of the Sixteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 1999: 97-105. 42 Domingos P. MetaCost: a general method for making classifiers cost-sensitive[C]// Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 1999: 155-164. 43 Knoll U, Nakhaeizadeh G, Tausend B. Cost-sensitive pruning of decision trees[C]// Proceedings of the European Conference on Machine Learning. Heidelberg: Springer, 1994: 383-386. 44 Rotolo D, Hicks D, Martin B R. What is an emerging technology?[J]. Research Policy, 2015, 44(10): 1827-1843. 45 Trajtenberg M. A penny for your quotes: patent citations and the value of innovations[J]. The RAND Journal of Economics, 1990, 21(1): 172-187. 46 Tong X S, Frame J D. Measuring national technological performance with patent claims data[J]. Research Policy, 1994, 23(2): 133-141. 47 Breitzman A, Thomas P. Inventor team size as a predictor of the future citation impact of patents[J]. Scientometrics, 2015, 103(2): 631-647. 48 Lanjouw J O, Schankerman M. Characteristics of patent litigation: a window on competition[J]. The RAND Journal of Economics, 2001, 32(1): 129-151. 49 Reitzig M. Improving patent valuations for management purposes—validating new indicators by analyzing application rationales[J]. Research Policy, 2004, 33(6/7): 939-957. 50 Park H, Yoon J. Assessing coreness and intermediarity of technology sectors using patent co-classification analysis: the case of Korean national R&D[J]. Scientometrics, 2014, 98(2): 853-890. 51 Sternitzke C. The international preliminary examination of patent applications filed under the Patent Cooperation Treaty—a proxy for patent value?[J]. Scientometrics, 2009, 78(2): 189-202. 52 Trajtenberg M, Henderson R, Jaffe A. University versus corporate patents: a window on the basicness of invention[J]. Economics of Innovation and New Technology, 1997, 5(1): 19-50. 53 Narin F, Noma E, Perry R. Patents as indicators of corporate technological strength[J]. Research Policy, 1987, 16(2-4): 143-155. 54 Arts S, Appio F P, Looy B. Inventions shaping technological trajectories: do existing patent indicators provide a comprehensive picture?[J]. Scientometrics, 2013, 97(2): 397-419. 55 Breiman L. Random forests[J]. Machine Learning, 2001, 45: 5-32. 56 尹华, 胡玉平. 一种代价敏感随机森林算法[J]. 武汉大学学报(工学版), 2014, 47(5): 707-711. 57 Chen C, Liaw A, Breiman L. Using random forest to learn imbalanced data[R]. Berkeley: University of California, 2004: Report No.666. 58 Fawcett T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8): 861-874. 59 赵永彬, 陈硕, 刘明, 等. 基于置信度代价敏感的支持向量机不均衡数据学习[J]. 计算机工程, 2015, 41(10): 177-180, 185. 60 Korkmaz S, ?ahman M A, Cinar A C, et al. Boosting the oversampling methods based on differential evolution strategies for imbalanced learning[J]. Applied Soft Computing, 2021, 112: 107787. 61 Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets[C]// Proceedings of the 15th European Conference on Machine Learning. Heidelberg: Springer, 2004: 39-50. 62 Mullick S S, Datta S, Dhekane S G, et al. Appropriateness of performance indices for imbalanced data classification: an analysis[J]. Pattern Recognition, 2020, 102: 107197. 63 Kim J, Kim J. The impact of imbalanced training data on machine learning for author name disambiguation[J]. Scientometrics, 2018, 117(1): 511-526. 64 Peng Y C, Li C Y, Wang K, et al. Examining imbalanced classification algorithms in predicting real-time traffic crash risk[J]. Accident Analysis & Prevention, 2020, 144: 105610. 65 Liu X. Classification accuracy and cut point selection[J]. Statistics in Medicine, 2012, 31(23): 2676-2686. 66 Maloof M A. Learning when data sets are imbalanced and when costs are unequal and unknown[C]// Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets II. Washington DC: ICML, 2003: 328-334. 67 Abdel-Aty M, Uddin N, Pande A. Split models for predicting multivehicle crashes during high-speed and low-speed operating conditions on freeways[J]. Transportation Research Record: Journal of the Transportation Research Board, 2005, 1908(1): 51-58. |
|
|
|