A review on data preprocessing methods  for class imbalance problem

Haseeb Ali; Mohd Najib Mohd Salleh; Kashif Hussain; Arshad Ahmad; Ayaz Ullah; Arshad Muhammad; Rashid Naseem; Muzammil Khan

doi:10.14419/ijet.v8i3.29508

Article Summary Keywords Abstract References Full Article How to cite

Authors
- Haseeb Ali Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
- Mohd Najib Mohd Salleh Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
- Kashif Hussain Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China,
- Arshad Ahmad Department of Computer Science, University of Swabi, Swabi, KPK, Pakistan
- Ayaz Ullah Department of Computer Science, University of Swabi, Swabi, KPK, Pakistan
- Arshad Muhammad Faculty Computing and Information Technology, Sohar University, Oman
- Rashid Naseem Department of Computer Science, City University of Science and Information Technology, Peshawar, KPK, Pakistan
- Muzammil Khan Department of Computer and software Technology, University of Swat, KPK, Pakistan
2019-10-02

https://doi.org/10.14419/ijet.v8i3.29508
Imbalanced Data, Re-Sampling, Majority Class, Minority Class, Oversampling.
Data mining methods are often impaired by datasets with desperate nature. Such real-world datasets contain imbalanced data distri-butions among classes, which affects the learning process negatively. In this scenario, the number of samples pertaining to one class (majority class) surpasses adequately the number of samples of other class (minority class) â€“ resulting in ignorance of the minority class by classification methods. To address this, various useful approaches related to data preprocessing are considered mandatory for developing an effective model by using contemporary data mining algorithms. Oversampling and undersampling are two of the fundamental approaches for preprocessing data in order to balance the distribution among dataset. In this study, we thoroughly discuss about the preprocessing techniques and approaches, as well as, challenges faced by researchers to overcome the weaknesses of resampling techniques. This paper highlights the basic issues of classifiers, which endorse bias for majority class and ignore the minority class. Additionally, we synthesize viable solutions and potential suggestions on how to handle the problems in prepro-cessing of data effectively, also present open issues that call for further research.
Â
Â
References
1. [1] Sun, Y., Wong, A. K. C., and Kamel, M. S., â€œClassification of imbalanced data: a review,â€ International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 4, pp. 687â€“719, 2009. https://doi.org/10.1142/S0218001409007326.
  [2] Batista, Gustavo E. A. P. A., et al., â€œA Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,â€ Sigkdd Explorations, vol. 6, no. 1, pp. 20â€“29, 2004. https://doi.org/10.1145/1007730.1007735.
  [3] Domingos, Pedro M., â€œMetaCost: A General Method for Making Classifiers Cost-Sensitive,â€ Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 155â€“164, 1999. https://doi.org/10.1145/312129.312220.
  [4] Ting, Kai Ming., â€œAn Instance-Weighting Method to Induce Cost-Sensitive Trees,â€ IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, 659â€“665, 2002. https://doi.org/10.1109/TKDE.2002.1000348.
  [5] Raskutti, Bhavani, and Adam Kowalczyk, â€œExtreme Re-Balancing for SVMs: A Case Study,â€ Sigkdd Explorations, vol. 6, no. 1, 60â€“69, 2004. https://doi.org/10.1145/1007730.1007739.
  [6] Wu, Gang, and Edward Y. Chang, â€œClass-Boundary Alignment for Imbalanced Dataset Learning,â€ ICML 2003 Workshop on learning from imbalanced data sets II, Washington, DC, 49-56, 2003.
  [7] Yan, Rong, et al., â€œOn Predicting Rare Classes with SVM Ensembles in Scene Classification,â€ 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP â€™03), vol. 3, pp. 21â€“24, 2003.
  [8] Beyan, Cigdem, and Robert B. Fisher, â€œClassifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition,â€ Pattern Recognition, vol. 48, no. 5, pp. 1653â€“1672, 2015. https://doi.org/10.1016/j.patcog.2014.10.032.
  [9] Galar, Mikel, et al., â€œA Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,â€ Systems Man and Cybernetics, vol. 42, no. 4, pp. 463â€“484, 2012. https://doi.org/10.1109/TSMCC.2011.2161285.
  [10] Joshi, Mahesh V., et al., â€œEvaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements,â€ Proceedings 2001 IEEE International Conference on Data Mining, 257â€“264, 2001.
  [11] Ling, Charles X., and Victor S. Sheng, Cost-Sensitive Learning and the Class Imbalance Problem, University of Western Ontario, 2008.
  [12] Chawla, N., et al., â€œSpecial issues on learning from imbalanced data sets,â€ ACM SigKDD Explorations Newsletter, vol. 6, no. 1, pp. 1â€“6, 2004. https://doi.org/10.1145/1007730.1007733.
  [13] Chawla, Nitesh V., et al., â€œSMOTE: Synthetic Minority Over-Sampling Technique,â€ Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321â€“357, 2002. https://doi.org/10.1613/jair.953.
  [14] Bunkhumpornpat, Chumphol, et al., â€œSafe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem,â€ PAKDD â€™09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 475â€“482, 2009. https://doi.org/10.1007/978-3-642-01307-2_43.
  [15] Nickerson, Adam, et al., Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets, AISTATS, 2001.
  [16] Han, Hui, et al., â€œBorderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,â€ International Conference on Intelligent Computing, pp. 878â€“887, 2005. https://doi.org/10.1007/11538059_91.
  [17] Budgen, David, and Pearl Brereton, â€œPerforming Systematic Literature Reviews in Software Engineering,â€ Proceedings of the 28th International Conference on Software Engineering, pp. 1051â€“1052, 2006. https://doi.org/10.1145/1134285.1134500.
  [18] Petersen, Kai, et al., â€œSystematic Mapping Studies in Software Engineering,â€ EASEâ€™08 Proceedings of the 12th International Conference on Evaluation and Assessment in Software Engineering, pp. 68â€“77, 2008.
  [19] Brereton, Pearl, et al., â€œLessons from Applying the Systematic Literature Review Process within the Software Engineering Domain,â€ Journal of Systems and Software, vol. 80, no. 4, pp. 571â€“583, 2007. https://doi.org/10.1016/j.jss.2006.07.009.
  [20] Govindan, M. E.Kannan, and Martin Brandt Jepsen., â€œELECTRE: A Comprehensive Literature Review on Methodologies and Applications,â€ European Journal of Operational Research, vol. 250, no. 1, pp. 1â€“29, 2016. https://doi.org/10.1016/j.ejor.2015.07.019.
  [21] Tahir, M. A., Kittler, J., & Yan, F., â€œInverse random under sampling for class imbalance problem and its application to multi-label classification,â€ Pattern Recognition, vol. 45, no. 10, pp. 3738â€“3750, 2012. https://doi.org/10.1016/j.patcog.2012.03.014.
  [22] Yu, H., Ni, J., & Zhao, J., â€œACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data,â€ Neurocomputing, vol. 101, pp. 309â€“318, 2013. https://doi.org/10.1016/j.neucom.2012.08.018.
  [23] Kang, Q., Chen, X., Li, S., & Zhou, M., â€œA Noise-Filtered Under-Sampling Scheme for Imbalanced Classification,â€ IEEE Transactions on Systems, Man, and Cybernetics, vol. 47, no. 12, pp. 4263â€“4274, 2017. https://doi.org/10.1109/TCYB.2016.2606104.
  [24] Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S., â€œClustering-based undersampling in class-imbalanced data,â€ Information Sciences, vol. 409, pp. 17â€“26, 2017. https://doi.org/10.1016/j.ins.2017.05.008.
  [25] N. Ofek, L. Rokach, R. Stern, and A. Shabtai, â€œFast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem,â€ Neurocomputing, vol. 243, pp. 88â€“102, 2017. https://doi.org/10.1016/j.neucom.2017.03.011.
  [26] He, Haibo, and Edwardo A. Garcia, â€œLearning from Imbalanced Data,â€ IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263â€“1284, 2009. https://doi.org/10.1109/TKDE.2008.239.
  [27] BÅ‚aszczyÅ„ski, Jerzy, and Jerzy Stefanowski, â€œNeighbourhood Sampling in Bagging for Imbalanced Data,â€ Neurocomputing, vol. 150, pp. 529â€“542, 2015. https://doi.org/10.1016/j.neucom.2014.07.064.
  [28] He, Haibo, et al., â€œADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,â€ 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322â€“1328, 2008. https://doi.org/10.1109/IJCNN.2008.4633969.
  [29] Tang, Bo, and Haibo He., â€œKernelADASYN: Kernel Based Adaptive Synthetic Data Generation for Imbalanced Learning,â€ 2015 IEEE Congress on Evolutionary Computation (CEC), pp. 664â€“671, 2015. https://doi.org/10.1109/CEC.2015.7256954.
  [30] Y. Dong and X. Wang, â€œA new over-sampling approach: Random-SMOTE for learning from imbalanced data sets,â€ Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7091 LNAI, pp. 343â€“352, 2011. https://doi.org/10.1007/978-3-642-25975-3_30.
  [31] Hu, S., Liang, Y., Ma, L., & He, Y., â€œMSMOTE: Improving Classification Performance When Training Data is Imbalanced,â€ 2009 Second International Workshop on Computer Science and Engineering, vol. 2, pp. 13â€“17, 2009. https://doi.org/10.1109/WCSE.2009.756.
  [32] Barua, S., Islam, M. M., Yao, X., & Murase, K., â€œMWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning,â€ IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405â€“425, 2014. https://doi.org/10.1109/TKDE.2012.232.
  [33] Zhang, H., & Li, M., â€œRWO-Sampling: A random walk over-sampling approach to imbalanced data classification,â€ Information Fusion, vol. 20, pp. 99â€“116, 2014. https://doi.org/10.1016/j.inffus.2013.12.003.
  [34] S. Babu and N. R. Ananthanarayanan, â€œEMOTE: Enhanced Minority Oversampling TEchnique,â€ J. Intell. Fuzzy Syst., vol. 33, no. 1, pp. 67â€“78, 2017. https://doi.org/10.3233/JIFS-161114.
  [35] Z. Tianlun and Y. Xi, â€œG-SMOTE: A GMM-BASED SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE for IMBALANCED LEARNING,â€ arxiv1810.10363v1, a Prepr., 2018.
  [36] S. Piri, D. Delen, and T. Liu, â€œA synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,â€ Decis. Support Syst., vol. 106, pp. 15â€“29, 2018. https://doi.org/10.1016/j.dss.2017.11.006.
  [37] T. Zhu, Y. Lin, Y. Liu, W. Zhang, and J. Zhang, â€œMinority oversampling for imbalanced ordinal regression,â€ Knowledge-Based Syst., vol. 166, pp. 140â€“155, 2019. https://doi.org/10.1016/j.knosys.2018.12.021
  [38] M. Koziarski, B. Krawczyk, and M. WoÅºniak, â€œRadial-Based oversampling for noisy imbalanced data classification,â€ Neurocomputing, no. 2019, 2019. https://doi.org/10.1016/j.neucom.2018.04.089.
  [39] Cieslak, D. A., Chawla, N. V., & Striegel, A., â€œCombating imbalance in network intrusion datasets,â€ 2006 IEEE International Conference on Granular Computing, pp. 732â€“737, 2006.
  [40] Nekooeimehr, I., & Lai-Yuen, S. K., â€œAdaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets,â€ Expert Systems with Applications, pp. 405â€“416, 2016. https://doi.org/10.1016/j.eswa.2015.10.031.
  [41] Douzas, G., & Bacao, F., â€œSelf-Organizing Map Oversampling (SOMO) for imbalanced data set learning,â€ Expert Systems With Applications, vol. 82, pp. 40â€“52, 2017. https://doi.org/10.1016/j.eswa.2017.03.073.
  [42] Douzas, G., & Bacao, F., â€œImproving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE,â€ Information Sciences, vol. 465, pp. 1â€“16, 2018. https://doi.org/10.1016/j.ins.2018.06.056.
  [43] Chen, X., Kang, Q., Zhou, M., & Wei, Z., â€œA novel under-sampling algorithm based on Iterative-Partitioning Filters for imbalanced classification,â€ IEEE International Conference on Automation Science and Engineering (CASE), pp. 490â€“494, 2016. https://doi.org/10.1109/COASE.2016.7743445.
  [44] C. F. Tsai, W. C. Lin, Y. H. Hu, and G. T. Yao, â€œUnder-sampling class imbalanced datasets by combining clustering analysis and instance selection,â€ Inf. Sci. (Ny)., vol. 477, pp. 47â€“54, 2019. https://doi.org/10.1016/j.ins.2018.10.029.
  [45] D. Devi, S. K. Biswas, and B. Purkayastha, â€œLearning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique,â€ Conn. Sci., vol. 31, no. 2, pp. 105â€“142, 2019. https://doi.org/10.1080/09540091.2018.1560394.
  [46] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu, â€œA Novel Ensemble Method for Imbalanced Data Learning,â€ Comput. Intell. Neurosci., vol. 2017, pp. 1â€“11, 2017. https://doi.org/10.1155/2017/1827016.
  [47] Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W., â€œSMOTEBoost: Improving Prediction of the Minority Class in Boosting,â€ European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107â€“119, 2003. https://doi.org/10.1007/978-3-540-39804-2_12.
  [48] Barandela, R., Valdovinos, R. M., & SÃ¡nchez, J. S., â€œNew Applications of Ensembles of Classifiers,â€ Pattern Analysis and Applications, vol. 6, no. 3, pp. 245â€“256, 2003. https://doi.org/10.1007/s10044-003-0192-z.
  [49] Seiffert, C., Khoshgoftaar, T. M., Hulse, J. V., & Napolitano, A., â€œRUSBoost: A Hybrid Approach to Alleviating Class Imbalance,â€ Systems Man and Cybernetics, vol. 40, no. 1, pp. 185â€“197, 2010. https://doi.org/10.1109/TSMCA.2009.2029559.
  [50] Gao, M., Hong, X., Chen, S., & Harris, C. J., â€œA combined SMOTE and PSO based RBF classifier for two-class imbalanced problems,â€ Neurocomputing, vol. 74, no. 17, pp. 3456â€“3466, 2011. https://doi.org/10.1016/j.neucom.2011.06.010.
  [51] F. Rayhan, S. Ahmed, A. Mahbub, R. Jani, S. Shatabda, and D. M. Farid, â€œCUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification,â€ 2nd Int. Conf. Comput. Syst. Inf. Technol. Sustain. Solut. CSITSS 2017, pp. 1â€“5, 2018. https://doi.org/10.1109/CSITSS.2017.8447534.
  [52] M. H. Popel, K. M. Hasib, S. A. Habib, and F. M. Shah, â€œA Hybrid Under-Sampling Method to Classify Imbalanced Data CANDIDATES â€™ DECLARATION,â€ 2018 21st Int. Conf. Comput. Inf. Technol., no. May 2018, pp. 1â€“7, 2018. https://doi.org/10.1109/ICCITECHN.2018.8631915.
  [53] M. H. Popel, K. M. Hasib, S. A. Habib, and F. M. Shah, â€œA Hybrid Under-Sampling Method to Classify Imbalanced Data CANDIDATES â€™ DECLARATION,â€ 2018 21st Int. Conf. Comput. Inf. Technol., no. May 2018, pp. 1â€“7, 2018. https://doi.org/10.1109/ICCITECHN.2018.8631915.
  [54] Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C., â€œDBSMOTE: Density-Based Synthetic Minority Over-Sampling Technique,â€ Applied Intelligence, vol. 36, no. 3, pp. 664â€“684, 2012. https://doi.org/10.1007/s10489-011-0287-y.
  [55] S. Babu and N. R. Ananthanarayanan, â€œEMOTE: Enhanced Minority Oversampling TEchnique,â€ J. Intell. Fuzzy Syst., vol. 33, no. 1, pp. 67â€“78, 2017. https://doi.org/10.3233/JIFS-161114.
  [56] Z. Tianlun and Y. Xi, â€œG-SMOTE: A GMM-BASED SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE FOR IMBALANCED LEARNING,â€ arxiv1810.10363v1, a Prepr., 2018.
  [57] S. Piri, D. Delen, and T. Liu, â€œA synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets,â€ Decis. Support Syst., vol. 106, pp. 15â€“29, 2018. https://doi.org/10.1016/j.dss.2017.11.006.
  [58] Wang, K.-J., Makond, B., Chen, K.-H., & Wang, K.-M., â€œA hybrid classifier combining SMOTE with PSO to estimate 5-year survivability of breast cancer patients,â€ Applied Soft Computing, vol. 20, pp. 15â€“24, 2014. https://doi.org/10.1016/j.asoc.2013.09.014.
  [59] JosÃ© A. SÃ¡ez, JuliÃ¡n Luengo, Jerzy Stefanowski, Francisco Herrera, â€œSMOTEâ€“IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering,â€ Information Sciences, vol. 291, pp. 184-203, 2015. https://doi.org/10.1016/j.ins.2014.08.051.
  [60] He, H., & Ma, Y., â€œAssessment Metrics for Imbalanced Learning,â€ Imbalanced Learning: Foundations, Algorithms, andApplications, 2016.
Downloads
How to Cite
Ali, H., Najib Mohd Salleh, M., Hussain, K., Ahmad, A., Ullah, A., Muhammad, A., Naseem, R., & Khan, M. (2019). A review on data preprocessing methods for class imbalance problem. International Journal of Engineering & Technology, 8(3), 390-397. https://doi.org/10.14419/ijet.v8i3.29508
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX

A review on data preprocessing methods for class imbalance problem

Authors

References

Downloads

How to Cite

Published