Feature Selection using Genetic Algorithm for Clustering high Dimensional Data

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    One of the open problems of modern data mining is clustering high dimensional data. For this in the paper a new technique called GA-HDClustering is proposed, which works in two steps. First a GA-based feature selection algorithm is designed to determine the optimal feature subset; an optimal feature subset is consisting of important features of the entire data set next, a K-means algorithm is applied using the optimal feature subset to find the clusters. On the other hand, traditional K-means algorithm is applied on the full dimensional feature space.    Finally, the result of GA-HDClustering  is  compared  with  the  traditional  clustering  algorithm.  For comparison different validity  matrices  such  as  Sum  of  squared  error  (SSE),  Within  Group average distance (WGAD), Between group distance (BGD), Davies-Bouldin index(DBI),   are used .The GA-HDClustering uses genetic algorithm for searching an effective feature subspace in a large feature space. This large feature space is made of all dimensions of the data set. The experiment performed on the standard data set revealed that the GA-HDClustering is superior to traditional clustering algorithm.


  • Keywords

    feature selection; clustering; high dimensional data; Genetic algorithm.

  • References

      [1] Sun, M., Xiong, L., Sun, H., & Jiang, D. (2009, October), A GA-based feature selection for high-dimensional data clustering. In 3rd International Conference on Genetic and Evolutionary Computing WGEC'09, pp. 769-772.

      [2] Sun, H. J., & Xiong, L. H. (2009, August), Genetic algorithm-based high-dimensional data clustering technique. In Sixth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD'09, Vol. 1, pp. 485-489.

      [3] Parsons, L., Haque, E., & Liu, H. (2004), Subspace clustering for high dimensional data: a review. Acm Sigkdd Explorations Newsletter 6, 90-105.

      [4] Alzubaidi, A., Cosma, G., Brown, D., & Pockley, A. G. (2016, October), Breast cancer diagnosis using a hybrid genetic algorithm for feature selection based on mutual information. In International Conference on Interactive Technologies and Games (iTAG), pp. 70-76.

      [5] Tiwari, R., & Singh, M. P. (2010), Correlation-based attribute selection using genetic algorithm. International Journal of Computer Applications 4, 28-34.

      [6] Li, J. (2015, December), A feature subset selection algorithm based on feature activity and improved GA. In 11th International Conference on Computational Intelligence and Security (CIS), pp. 206-210.

      [7] Chaimontree, S., Atkinson, K., & Coenen, F. (2010, November). Best clustering configuration metrics: towards multiagent based clustering. In International Conference on Advanced Data Mining and Applications (pp. 48-59). Springer, Berlin, Heidelberg.

      [8] David Bouldin Index, Available at: https://en.wikipedia.org/wiki/DavieBouldin_index

      [9] Hall, M. A. (1999). Correlation-based feature selection for machine learning.

      [10] Rostami, M., & Moradi, P. (2014, May), A clustering based genetic algorithm for feature selection. In 6th Conference on Information and Knowledge Technology (IKT), pp. 112-116.

      [11] Desale, K. S., & Ade, R. (2015, January), Genetic algorithm based feature selection approach for effective intrusion detection system. In International Conference on Computer Communication and Informatics (ICCCI), pp. 1-6.

      [12] Song, Q., Ni, J., & Wang, G. (2013), A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering 25, 1-14.

      [13] Chandrashekar, G., & Sahin, F. (2014), A survey on feature selection methods. Computers & Electrical Engineering 40, 16-28.

      [14] Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley.

      [15] Han, J., Pei, J., & Kamber, M. (2011), Data mining: concepts and techniques. Elsevier.

      [16] Dunham, M. H. (2006), Data mining: Introductory and advanced topics. Pearson Education India..




Article ID: 11001
DOI: 10.14419/ijet.v7i2.11.11001

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.