A Novel Approach for Handling Outliers in Imbalanced Data

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Most of the traditional classification algorithms assume their training data to be well-balanced in terms of class distribution. Real-world datasets, however, are imbalanced in nature thus degrade the performance of the traditional classifiers. To solve this problem, many strategies are adopted to balance the class distribution at the data level. The data level methods balance the imbalance distribution between majority and minority classes using either oversampling or under sampling techniques. The main concern of this paper is to remove the outliers that may generate while using oversampling techniques. In this study, we proposed a novel approach for solving the class imbalance problem at data level by using modified SMOTE to remove the outliers that may exist after synthetic data generation using SMOTE oversampling technique. We extensively compare our approach with SMOTE, SMOTE+ENN, SMOTE+Tomek-Link using 9 datasets from keel repository using classification algorithms. The result reveals that our approach improves the prediction performance for most of the classification algorithms and achieves better performance compared to the existing approaches.




  • Keywords

    Classification Algorithms, Class Imbalance Learning, SMOTE, Resampling and Mahalanobis Distance.

  • References

      [1] N. V. Chawla, N. Japkowicz, A. Kotcz, Special issue on learning from imbalanced data sets, ACM Sigkdd Explorations Newsletter 6 (1) (2004) 1{6.

      [2] V. Garc a, R. A. Mollineda, J. S. Sanchez, On the k-nn performance in a challenging scenario of imbalance and overlapping, Pattern Analysis and Applications 11 (3-4) (2008) 269{280.

      [3] D. A. Cieslak, N. V. Chawla, Start globally, optimize locally, predict globally: Improving performance on imbalanced data, in: Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on, IEEE, 2008, pp. 143{152.

      [4] Q. Yang, X. Wu, 10 challenging problems in data mining research, International Journal of Information Technology & Decision Making 5 (04) (2006) 597{604.

      [5] G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD explorations newsletter 6 (1) (2004) 20{29.

      [6] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over- technique, Journal of articial intelligence research 16 (2002) 321{357.

      [7] J. R. Quinlan, Improved estimates for the accuracy of small disjuncts, Machine Learning 6 (1) (1991) 93{98.

      [8] B. Zadrozny, C. Elkan, Learning and making decisions when costs and probabilities are both unknown, in: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2001, pp. 204{213.

      [9] G. Wu, E. Y. Chang, Kba: Kernel boundary alignment considering imbalanced data distribution, IEEE Transactions on knowledge and data engineering 17 (6) (2005) 786{795.

      [10] N. V. Chawla, D. A. Cieslak, L. O. Hall, A. Joshi, Automatically countering imbalance and its empirical relationship to cost, Data Mining and Knowledge Discovery 17 (2) (2008) 225{252.

      [11] A. Freitas, A. Costa-Pereira, P. Brazdil, Cost-sensitive decision trees applied to medical data, in: International Conference on Data Warehousing and Knowledge Discovery, Springer, 2007, pp. 303{312.

      [12] L. Rokach, Ensemble-based classiers, Arti cial Intelligence Review 33 (1-2) (2010) 1{39.

      [13] R. Polikar, Ensemble based systems in decision making, IEEE Circuits and systems magazine 6 (3) (2006) 21{45.

      [14] P. Hart, The condensed nearest neighbor rule (corresp.), IEEE transactions on information theory 14 (3) (1968) 515-516.

      [15] G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD explorations newsletter 6 (1) (2004) 20-29

      [16] P. C. Mahalanobis, On the generalized distance in statistics, National Institute of Science of India, 1936.

      [17] R. C. Team, et al., R: A language and environment for statistical computing.

      [18] I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes, S. J. Cunningham, Weka: Practical machine learning tools and techniques with java implementations.

      [19] T. Oommen, L. G. Baise, R. M. Vogel, bias and class imbalance in maximum-likelihood logistic regres-sion, Mathematical Geosciences 43 (1) (2011) 99{120.

      [20] J. R. Quinlan, C4. 5: programs for machine learning, Elsevier, 2014.

      [21] K. P. Murphy, Naive bayesclassiers, University of British Columbia 18.

      [22] L. E. Peterson, K-nearest neighbor, Scholarpedia 4 (2) (2009) 1883.

      [23] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273{297.

      [24] J. Furnkranz, G. Widmer, Incremental reduced error pruning, in: Machine Learning Proceedings 1994, Elsevier, 1994, pp. 70{77.

      J. Alcal -Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garc a, L. Sanchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework., Journal of Multiple-Valued Logic & Soft Computing 17.




Article ID: 16783
DOI: 10.14419/ijet.v7i3.1.16783

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.