An under sampled k-means approach for handlingimbalanced data using diversified distribution


  • G Shobana
  • Bhanu Prakash Battula





Data Mining, Classification, Class Imbalance Data, Under Sampling, USDD.


Some true applications uncover troubles in taking in classifiers from imbalanced information. Albeit a few techniques for enhancing classifiers have been presented, the distinguishing proof of conditions for the effective utilization of the specific strategy is as yet an open research issue. It is likewise worth to think about the idea of imbalanced information, qualities of the minority class dissemination and their impact on arrangement execution. In any case, current investigations on imbalanced information trouble factors have been predominantly finished with manufactured datasets and their decisions are not effortlessly material to this present reality issues, likewise on the grounds that the techniques for their distinguishing proof are not adequately created. In this paper, we recommended a novel approach Under Sampling Utilizing Diversified Distribution (USDD) for explaining the issues of class lopsidedness in genuine datasets by thinking about the systems of recognizable pieces of proof and expulsion of marginal, uncommon and anomalies sub groups utilizing k-implies. USDD utilizes exceptional procedure for recognizable proof of these kinds of cases, which depends on breaking down a class dissemination in a nearby neighborhood of the considered case utilizing k-closest approach. The exploratory outcomes recommend that the proposed USDD approach performs superior to the looked at approach as far as AUC, accuracy, review and f-measure.


[1] He, H. and Garcia, E.A., 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), pp.1263-1284.

[2] Estabrooks, A., Jo, T. and Japkowicz, N., 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), pp.18-36.

[3] Han, H., Wang, W.Y. and Mao, B.H., 2005, August. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (pp. 878-887). Springer, Berlin, Heidelberg.

[4] Chawla, N.V., Japkowicz, N. and Kotcz, A., 2004. Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), pp.1-6.

[5] Bhowan, U., Johnston, M. and Zhang, M., 2012. Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(2), pp.406-421.

[6] Xue, J.H. and Hall, P., 2015. Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis?. IEEE transactions on pattern analysis and machine intelligence, 37(5), pp.1109-1112.

[7] Palade, V., 2013. Class imbalance learning methods for support vector machines.

[8] López, V., Fernández, A., García, S., Palade, V. and Herrera, F., 2013. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, pp.113-141.

[9] Provost, F., 2000, July. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI’2000 workshop on imbalanced data sets (pp. 1-3).

[10] Pelayo, L. and Dick, S., 2007, June. Applying novel resampling strategies to software defect prediction. In Fuzzy Information Processing Society, 2007. NAFIPS'07. Annual Meeting of the North American (pp. 69-72). IEEE.

[11] Long, J., Yin, J.P., Zhu, E. and Zhao, W.T., 2008, July. A novel active cost-sensitive learning method for intrusion detection. In Machine Learning and Cybernetics, 2008 International Conference on (Vol. 2, pp. 1099-1104). IEEE.

[12] Zahirnia, K., Teimouri, M., Rahmani, R. and Salaq, A., 2015, October. Diagnosis of type 2 diabetes using cost-sensitive learning. In Computer and Knowledge Engineering (ICCKE), 2015 5th International Conference on (pp. 158-163). IEEE.

[13] Kubat, M., Holte, R.C. and Matwin, S., 1998. Machine learning for the detection of oil spills in satellite radar images. Machine learning, 30(2-3), pp.195-215.

[14] Fawcett, T. and Provost, F., 1997. Adaptive fraud detection. Data mining and knowledge discovery, 1(3), pp.291-316.

[15] Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J.M. and Herrera, F., 2015. ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, pp.69-79.

[16] Ding, S., Mirza, B., Lin, Z., Cao, J., Lai, X., Nguyen, T.V. and Sepulveda, J., 2018. Kernel based online learning for imbalance multiclass classification. Neurocomputing, 277, pp.139-148.

[17] Lu, Y., Cheung, Y.M. and Tang, Y.Y., 2017, August. Dynamic weighted majority for incremental learning of imbalanced data streams with concept drift. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 2393-2399). AAAI Press.

[18] Rayhan, F., Ahmed, S., Mahbub, A., Jani, M., Shatabda, S., Farid, D.M. and Rahman, C.M., 2017. MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification. arXiv preprint arXiv:1712.06658.

[19] Juba, B. and Le, H.S., 2017. Precision-Recall versus Accuracy and the Role of Large Data Sets.

[20] Toshniwal, D. and Venkoparao, G., 2017. Distributed Sparse Class-Imbalance Learning and its Applications. IEEE Transactions on Big Data.

[21] Mirza, B., Kok, S., Lin, Z., Yeo, Y.K., Lai, X., Cao, J. and Sepulveda, J., 2016, October. Efficient representation learning for high-dimensional imbalance data. In Digital Signal Processing (DSP), 2016 IEEE International Conference on (pp. 511-515). IEEE.

[22] Zhu, B., Baesens, B., Backiel, A.E. and vandenBroucke, S.K., 2018. Benchmarking sampling techniques for imbalance learning in churn prediction. Journal of the Operational Research Society, 69(1), pp.49-65.

[23] Blake, C., 1998. UCI repository of machine learning databases. http://www. ics. uci. edu/~ mlearn/MLRepository. html.

[24] Breiman, L., 2001. Random forests. Machine learning, 45(1), pp.5-32.

View Full Article: