A review on data preprocessing methods for class imbalance problem

  • Authors

    • Haseeb Ali Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
    • Mohd Najib Mohd Salleh Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
    • Kashif Hussain Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China,
    • Arshad Ahmad Department of Computer Science, University of Swabi, Swabi, KPK, Pakistan
    • Ayaz Ullah Department of Computer Science, University of Swabi, Swabi, KPK, Pakistan
    • Arshad Muhammad Faculty Computing and Information Technology, Sohar University, Oman
    • Rashid Naseem Department of Computer Science, City University of Science and Information Technology, Peshawar, KPK, Pakistan
    • Muzammil Khan Department of Computer and software Technology, University of Swat, KPK, Pakistan
    2019-10-02
    https://doi.org/10.14419/ijet.v8i3.29508
  • Imbalanced Data, Re-Sampling, Majority Class, Minority Class, Oversampling.
  • Data mining methods are often impaired by datasets with desperate nature. Such real-world datasets contain imbalanced data distri-butions among classes, which affects the learning process negatively. In this scenario, the number of samples pertaining to one class (majority class) surpasses adequately the number of samples of other class (minority class) – resulting in ignorance of the minority class by classification methods. To address this, various useful approaches related to data preprocessing are considered mandatory for developing an effective model by using contemporary data mining algorithms. Oversampling and undersampling are two of the fundamental approaches for preprocessing data in order to balance the distribution among dataset. In this study, we thoroughly discuss about the preprocessing techniques and approaches, as well as, challenges faced by researchers to overcome the weaknesses of resampling techniques. This paper highlights the basic issues of classifiers, which endorse bias for majority class and ignore the minority class. Additionally, we synthesize viable solutions and potential suggestions on how to handle the problems in prepro-cessing of data effectively, also present open issues that call for further research.

     

     

  • References

    1. [1] Sun, Y., Wong, A. K. C., and Kamel, M. S., “Classification of imbalanced data: a review,†International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 4, pp. 687–719, 2009. https://doi.org/10.1142/S0218001409007326.

      [2] Batista, Gustavo E. A. P. A., et al., “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,†Sigkdd Explorations, vol. 6, no. 1, pp. 20–29, 2004. https://doi.org/10.1145/1007730.1007735.

      [3] Domingos, Pedro M., “MetaCost: A General Method for Making Classifiers Cost-Sensitive,†Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155–164, 1999. https://doi.org/10.1145/312129.312220.

      [4] Ting, Kai Ming., “An Instance-Weighting Method to Induce Cost-Sensitive Trees,†IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, 659–665, 2002. https://doi.org/10.1109/TKDE.2002.1000348.

      [5] Raskutti, Bhavani, and Adam Kowalczyk, “Extreme Re-Balancing for SVMs: A Case Study,†Sigkdd Explorations, vol. 6, no. 1, 60–69, 2004. https://doi.org/10.1145/1007730.1007739.

      [6] Wu, Gang, and Edward Y. Chang, “Class-Boundary Alignment for Imbalanced Dataset Learning,†ICML 2003 Workshop on learning from imbalanced data sets II, Washington, DC, 49-56, 2003.

      [7] Yan, Rong, et al., “On Predicting Rare Classes with SVM Ensembles in Scene Classification,†2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03), vol. 3, pp. 21–24, 2003.

      [8] Beyan, Cigdem, and Robert B. Fisher, “Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition,†Pattern Recognition, vol. 48, no. 5, pp. 1653–1672, 2015. https://doi.org/10.1016/j.patcog.2014.10.032.

      [9] Galar, Mikel, et al., “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,†Systems Man and Cybernetics, vol. 42, no. 4, pp. 463–484, 2012. https://doi.org/10.1109/TSMCC.2011.2161285.

      [10] Joshi, Mahesh V., et al., “Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements,†Proceedings 2001 IEEE International Conference on Data Mining, 257–264, 2001.

      [11] Ling, Charles X., and Victor S. Sheng, Cost-Sensitive Learning and the Class Imbalance Problem, University of Western Ontario, 2008.

      [12] Chawla, N., et al., “Special issues on learning from imbalanced data sets,†ACM SigKDD Explorations Newsletter, vol. 6, no. 1, pp. 1–6, 2004. https://doi.org/10.1145/1007730.1007733.

      [13] Chawla, Nitesh V., et al., “SMOTE: Synthetic Minority Over-Sampling Technique,†Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002. https://doi.org/10.1613/jair.953.

      [14] Bunkhumpornpat, Chumphol, et al., “Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem,†PAKDD ’09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 475–482, 2009. https://doi.org/10.1007/978-3-642-01307-2_43.

      [15] Nickerson, Adam, et al., Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets, AISTATS, 2001.

      [16] Han, Hui, et al., “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,†International Conference on Intelligent Computing, pp. 878–887, 2005. https://doi.org/10.1007/11538059_91.

      [17] Budgen, David, and Pearl Brereton, “Performing Systematic Literature Reviews in Software Engineering,†Proceedings of the 28th International Conference on Software Engineering, pp. 1051–1052, 2006. https://doi.org/10.1145/1134285.1134500.

      [18] Petersen, Kai, et al., “Systematic Mapping Studies in Software Engineering,†EASE’08 Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, pp. 68–77, 2008.

      [19] Brereton, Pearl, et al., “Lessons from Applying the Systematic Literature Review Process within the Software Engineering Domain,†Journal of Systems and Software, vol. 80, no. 4, pp. 571–583, 2007. https://doi.org/10.1016/j.jss.2006.07.009.

      [20] Govindan, M. E.Kannan, and Martin Brandt Jepsen., “ELECTRE: A Comprehensive Literature Review on Methodologies and Applications,†European Journal of Operational Research, vol. 250, no. 1, pp. 1–29, 2016. https://doi.org/10.1016/j.ejor.2015.07.019.

      [21] Tahir, M. A., Kittler, J., & Yan, F., “Inverse random under sampling for class imbalance problem and its application to multi-label classification,†Pattern Recognition, vol. 45, no. 10, pp. 3738–3750, 2012. https://doi.org/10.1016/j.patcog.2012.03.014.

      [22] Yu, H., Ni, J., & Zhao, J., “ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data,†Neurocomputing, vol. 101, pp. 309–318, 2013. https://doi.org/10.1016/j.neucom.2012.08.018.

      [23] Kang, Q., Chen, X., Li, S., & Zhou, M., “A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification,†IEEE Transactions on Systems, Man, and Cybernetics, vol. 47, no. 12, pp. 4263–4274, 2017. https://doi.org/10.1109/TCYB.2016.2606104.

      [24] Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S., “Clustering-based undersampling in class-imbalanced data,†Information Sciences, vol. 409, pp. 17–26, 2017. https://doi.org/10.1016/j.ins.2017.05.008.

      [25] N. Ofek, L. Rokach, R. Stern, and A. Shabtai, “Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem,†Neurocomputing, vol. 243, pp. 88–102, 2017. https://doi.org/10.1016/j.neucom.2017.03.011.

      [26] He, Haibo, and Edwardo A. Garcia, “Learning from Imbalanced Data,†IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. https://doi.org/10.1109/TKDE.2008.239.

      [27] Błaszczyński, Jerzy, and Jerzy Stefanowski, “Neighbourhood Sampling in Bagging for Imbalanced Data,†Neurocomputing, vol. 150, pp. 529–542, 2015. https://doi.org/10.1016/j.neucom.2014.07.064.

      [28] He, Haibo, et al., “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,†2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328, 2008. https://doi.org/10.1109/IJCNN.2008.4633969.

      [29] Tang, Bo, and Haibo He., “KernelADASYN: Kernel Based Adaptive Synthetic Data Generation for Imbalanced Learning,†2015 IEEE Congress on Evolutionary Computation (CEC), pp. 664–671, 2015. https://doi.org/10.1109/CEC.2015.7256954.

      [30] Y. Dong and X. Wang, “A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets,†Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7091 LNAI, pp. 343–352, 2011. https://doi.org/10.1007/978-3-642-25975-3_30.

      [31] Hu, S., Liang, Y., Ma, L., & He, Y., “MSMOTE: Improving Classification Performance When Training Data is Imbalanced,†2009 Second International Workshop on Computer Science and Engineering, vol. 2, pp. 13–17, 2009. https://doi.org/10.1109/WCSE.2009.756.

      [32] Barua, S., Islam, M. M., Yao, X., & Murase, K., “MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning,†IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405–425, 2014. https://doi.org/10.1109/TKDE.2012.232.

      [33] Zhang, H., & Li, M., “RWO-Sampling: A random walk over-sampling approach to imbalanced data classification,†Information Fusion, vol. 20, pp. 99–116, 2014. https://doi.org/10.1016/j.inffus.2013.12.003.

      [34] S. Babu and N. R. Ananthanarayanan, “EMOTE: Enhanced Minority Oversampling TEchnique,†J. Intell. Fuzzy Syst., vol. 33, no. 1, pp. 67–78, 2017. https://doi.org/10.3233/JIFS-161114.

      [35] Z. Tianlun and Y. Xi, “G-SMOTE: A GMM-BASED SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE for IMBALANCED LEARNING,†arxiv1810.10363v1, a Prepr., 2018.

      [36] S. Piri, D. Delen, and T. Liu, “A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,†Decis. Support Syst., vol. 106, pp. 15–29, 2018. https://doi.org/10.1016/j.dss.2017.11.006.

      [37] T. Zhu, Y. Lin, Y. Liu, W. Zhang, and J. Zhang, “Minority oversampling for imbalanced ordinal regression,†Knowledge-Based Syst., vol. 166, pp. 140–155, 2019. https://doi.org/10.1016/j.knosys.2018.12.021

      [38] M. Koziarski, B. Krawczyk, and M. Woźniak, “Radial-Based oversampling for noisy imbalanced data classification,†Neurocomputing, no. 2019, 2019. https://doi.org/10.1016/j.neucom.2018.04.089.

      [39] Cieslak, D. A., Chawla, N. V., & Striegel, A., “Combating imbalance in network intrusion datasets,†2006 IEEE International Conference on Granular Computing, pp. 732–737, 2006.

      [40] Nekooeimehr, I., & Lai-Yuen, S. K., “Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets,†Expert Systems with Applications, pp. 405–416, 2016. https://doi.org/10.1016/j.eswa.2015.10.031.

      [41] Douzas, G., & Bacao, F., “Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning,†Expert Systems With Applications, vol. 82, pp. 40–52, 2017. https://doi.org/10.1016/j.eswa.2017.03.073.

      [42] Douzas, G., & Bacao, F., “Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,†Information Sciences, vol. 465, pp. 1–16, 2018. https://doi.org/10.1016/j.ins.2018.06.056.

      [43] Chen, X., Kang, Q., Zhou, M., & Wei, Z., “A novel under-sampling algorithm based on Iterative-Partitioning Filters for imbalanced classification,†IEEE International Conference on Automation Science and Engineering (CASE), pp. 490–494, 2016. https://doi.org/10.1109/COASE.2016.7743445.

      [44] C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, “Under-sampling class imbalanced datasets by combining clustering analysis and instance selection,†Inf. Sci. (Ny)., vol. 477, pp. 47–54, 2019. https://doi.org/10.1016/j.ins.2018.10.029.

      [45] D. Devi, S. K. Biswas, and B. Purkayastha, “Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique,†Conn. Sci., vol. 31, no. 2, pp. 105–142, 2019. https://doi.org/10.1080/09540091.2018.1560394.

      [46] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, “A Novel Ensemble Method for Imbalanced Data Learning,†Comput. Intell. Neurosci., vol. 2017, pp. 1–11, 2017. https://doi.org/10.1155/2017/1827016.

      [47] Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W., “SMOTEBoost: Improving Prediction of the Minority Class in Boosting,†European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119, 2003. https://doi.org/10.1007/978-3-540-39804-2_12.

      [48] Barandela, R., Valdovinos, R. M., & Sánchez, J. S., “New Applications of Ensembles of Classifiers,†Pattern Analysis and Applications, vol. 6, no. 3, pp. 245–256, 2003. https://doi.org/10.1007/s10044-003-0192-z.

      [49] Seiffert, C., Khoshgoftaar, T. M., Hulse, J. V., & Napolitano, A., “RUSBoost: A Hybrid Approach to Alleviating Class Imbalance,†Systems Man and Cybernetics, vol. 40, no. 1, pp. 185–197, 2010. https://doi.org/10.1109/TSMCA.2009.2029559.

      [50] Gao, M., Hong, X., Chen, S., & Harris, C. J., “A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems,†Neurocomputing, vol. 74, no. 17, pp. 3456–3466, 2011. https://doi.org/10.1016/j.neucom.2011.06.010.

      [51] F. Rayhan, S. Ahmed, A. Mahbub, R. Jani, S. Shatabda, and D. M. Farid, “CUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification,†2nd Int. Conf. Comput. Syst. Inf. Technol. Sustain. Solut. CSITSS 2017, pp. 1–5, 2018. https://doi.org/10.1109/CSITSS.2017.8447534.

      [52] M. H. Popel, K. M. Hasib, S. A. Habib, and F. M. Shah, “A Hybrid Under-Sampling Method to Classify Imbalanced Data CANDIDATES ’ DECLARATION,†2018 21st Int. Conf. Comput. Inf. Technol., no. May 2018, pp. 1–7, 2018. https://doi.org/10.1109/ICCITECHN.2018.8631915.

      [53] M. H. Popel, K. M. Hasib, S. A. Habib, and F. M. Shah, “A Hybrid Under-Sampling Method to Classify Imbalanced Data CANDIDATES ’ DECLARATION,†2018 21st Int. Conf. Comput. Inf. Technol., no. May 2018, pp. 1–7, 2018. https://doi.org/10.1109/ICCITECHN.2018.8631915.

      [54] Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C., “DBSMOTE: Density-Based Synthetic Minority Over-Sampling Technique,†Applied Intelligence, vol. 36, no. 3, pp. 664–684, 2012. https://doi.org/10.1007/s10489-011-0287-y.

      [55] S. Babu and N. R. Ananthanarayanan, “EMOTE: Enhanced Minority Oversampling TEchnique,†J. Intell. Fuzzy Syst., vol. 33, no. 1, pp. 67–78, 2017. https://doi.org/10.3233/JIFS-161114.

      [56] Z. Tianlun and Y. Xi, “G-SMOTE: A GMM-BASED SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE FOR IMBALANCED LEARNING,†arxiv1810.10363v1, a Prepr., 2018.

      [57] S. Piri, D. Delen, and T. Liu, “A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,†Decis. Support Syst., vol. 106, pp. 15–29, 2018. https://doi.org/10.1016/j.dss.2017.11.006.

      [58] Wang, K.-J., Makond, B., Chen, K.-H., & Wang, K.-M., “A hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients,†Applied Soft Computing, vol. 20, pp. 15–24, 2014. https://doi.org/10.1016/j.asoc.2013.09.014.

      [59] José A. Sáez, Julián Luengo, Jerzy Stefanowski, Francisco Herrera, “SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering,†Information Sciences, vol. 291, pp. 184-203, 2015. https://doi.org/10.1016/j.ins.2014.08.051.

      [60] He, H., & Ma, Y., “Assessment Metrics for Imbalanced Learning,†Imbalanced Learning: Foundations, Algorithms, andApplications, 2016.

  • Downloads

  • How to Cite

    Ali, H., Najib Mohd Salleh, M., Hussain, K., Ahmad, A., Ullah, A., Muhammad, A., Naseem, R., & Khan, M. (2019). A review on data preprocessing methods for class imbalance problem. International Journal of Engineering & Technology, 8(3), 390-397. https://doi.org/10.14419/ijet.v8i3.29508