A cluster Analysis for Binary Data Using Genetic Algorithms


  • Sabariah Saharan
  • Wong Yu Xian
  • Roberto Baragona






Binary Data, Clustering, Genetic Algorithms.


This research was initially driven by the lack of clustering algorithms that focus on binary data. A promising technique to analyze this type of data, namely Genetic Clustering for Unknown K (GCUK) became the main subject in this research. GCUK was applied to cluster four binary data and there is a presence of an imbalanced data in one of the data sets. The results show that GCUK is an efficient and effective clustering algorithm compared to K-means. The other contribution is the capability of GCUK for clustering the unbalanced data. Standard clustering algorithms cannot simply be applied to this type of data sets as it can cause a misclassification results.



[1] Hruschka ER, Campello R, Freitas AA & de Carvalho A (2009), A Survey of Evolutionary Algorithms for Clustering/ Systems, Man, and Cybernetics, Part C: Applications and Reviews. IEEE Transactions 39(2), 133-155.

[2] Jain AK (2010), Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8), 651-666.

[3] Ordonez C (2003), Clustering binary data streams with K-means. In DMKD03: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 12-19

[4] Baragona R, Battaglia F, Polu, I. Evolutionary Statistical Procedures, Springer, Berlin and Heidelberg, (2011).

[5] Bandyopadhyay S, Maulik U (2002), Genetic Clustering for Automatic Evolution of Clusters and Application to Image Recognition. Pattern Recognition, 35, 1197-1208.

[6] Saharan S & Baragona R (2013), A New Genetic Algorithm for Clustering Binary Data with Application to Traffic Accidents in Christchurch. Far East Journal of Theoretical Statistics 45(1), 67-89.

[7] Lin HJ, Yang FW, Kao YT (2005), An Efficient GA-based Clustering Technique. Tamkang Journal of Science and Engineering 8(2), 113-122

[8] Maulik U, Bandyopadhyay S (2000), Genetic Algorithm-based Clustering Technique. Pattern Recognition 33(9), 1455-1465.

[9] Safe M, Carballido J, Ponzoni I & Brignole N (2004), On Stopping Criteria for Genetic Algorithms. Advances in Artificial Intelligence, 405-413.

[10] Milligan G, Cheng R (1996), Measuring the influence of individual data points in a cluster analysis. Journal of Classification 13(2), 315-335.

View Full Article: