AK-means: an automatic clustering algorithm based on K-means

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    In data mining, K-means is a simple and fast algorithm for solving clustering problems, but it requires that the user provides in advance the exact number of clusters (k), which is often not obvious. Thus, this paper intends to overcome this problem by proposing a parameter-free algorithm for automatic clustering. It is based on successive adequate restarting of K-means algorithm. Experiments conducted on several standard data sets demonstrate that the proposed approach is effective and outperforms the related well known algorithm G-means, in terms of clustering accuracy and estimation of the correct number of clusters.

  • Keywords

    Automatic Clustering; G-Means; K-Means; Parameter-Free Clustering.

  • References

      [1] Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. (2009). "NP-hardness of Euclidean sum-of-squares clustering". Machine Learning 75: 245–249. http://dx.doi.org/10.1007/s10994-009-5103-0.

      [2] Lloyd. S. P. (1982). "Least squares quantization in PCM". IEEE Transactions on Information Theory 28 (2): 129–137. http://dx.doi.org/10.1109/TIT.1982.1056489.

      [3] Greg Hamerly and Charles Elkan. Learning the k in k-means. In Proceedings of the seventeenth annual conference on neural information processing systems (NIPS), pages 281–288, 2003

      [4] Asuncion, A. and Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.

      [5] H. Spath, Clustering Analysis Algorithms for Data Reduction and Classification of Objects, Ellis Horwood, Chichester, 1980.

      [6] Dan Pelleg and Andrew Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conf. on Machine Learning, pages 727–734. Morgan Kaufmann, 2000.

      [7] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a dataset via the Gap statistic. Journal of the Royal Statistical Society B, 63:411–423, 2001. http://dx.doi.org/10.1111/1467-9868.00293.

      [8] Pal, N.R. and Bezdek, J.C. (1995) On Cluster Validity for the Fuzzy c-Means Model. IEEE Transactions on Fuzzy Systems, 3, 370-379. http://dx.doi.org/10.1109/91.413225.

      [9] Kettani, O.; Tadili, B. and Ramdani, F. - A deterministic k-means algorithm based on nearest neighbor search. International Journal of Computer Applications (0975 – 8887), Vol. 63, No.15, February 2013. http://dx.doi.org/10.5120/10544-5541.

      [10] T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics, 3:1–27, 1974.

      [11] G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrica, 50:159–179, 1985. http://dx.doi.org/10.1007/BF02294245.

      [12] L. Kaufman and P. J. Rousseeuw. Finding groups in Data: "an Introduction to Cluster Analysis". Wiley, 1990. http://dx.doi.org/10.1002/9780470316801.

      [13] C. Elkan, "Using the triangle inequality to accelerate k-means", ICML 2003 Conference Proceedings, p. 147#153, 2003.




Article ID: 4749
DOI: 10.14419/jacst.v4i2.4749

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.