MCDAStream: a real-time data stream clustering based on micro-cluster density and attraction

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Real-time data stream clustering has been widely used in many fields, and it can extract useful information from massive sets of data. Most of the existing density-based algorithms cluster the data streams based on the density within the micro-clusters. These algorithms completely omit the data density in the area between the micro-clusters and recluster the micro-clusters based on erroneous assumptions about the distribution of the data within and between the micro-clusters that lead to poor clustering results. This paper describes a novel density-based clustering algorithm for evolving data streams called MCDAStream, which clusters the data stream based on micro-cluster density and attraction between the micro-clusters. The attraction of micro-clusters characterizes the positional information of the data points in each micro-cluster. We generate better clustering results by considering both micro-cluster density and attraction of micro-clusters. The quality of the proposed algorithm is evaluated on various synthetic and real-time datasets with distinct characteristics and quality metrics.

  • Keywords

    Data Stream; Data Mining; Density-Based Clustering; Grid-Based Clustering; Micro-Clusters.

  • References

      [1] Chen Y, Tu L, “Stream Data Clustering Based on Grid Density and Attraction.” ACM Transactions on Knowledge discovery Data, 3(3): Article No. 12, 2009.

      [2] Han J. and Kamber, M. “Data Mining Concepts and Techniques.” 2nd Ed. Burlington: Morgan Kauffman, 2006.

      [3] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. d. Carvalho, and J. a. Gama, “Data stream clustering: A survey,” ACM Computing Surveys, vol. 46, no. 1, pp. 13:1–13:31, Jul. 2013.

      [4] Ester M., Kriegel H., Sander J., and Xu X. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In: Proc. of 2nd International Conference on Knowledge Discovery, pp. 226–231, 1996.

      [5] Cao F, Ester M, Qian W, Zhou A. “Density-Based Clustering Over an Evolving Data Stream with Noise.” In Proc. the SIAM Conference on Data Mining, April 2006, pp.328-339.

      [6] Tasoulis D K, Ross G, Adams N M. “Visualising the Cluster Structure of Data Streams.” In Proc. the 7th International Conference on Intelligent Data Analysis, Sept. 2007, pp.81- 92.

      [7] Menasalvas E, Ruiz C, Spiliopoulou M. “C-DenStream: Using Domain Knowledge on a Data Stream.” In Proc. the 12th International Conference on Discovery Science, Oct. 2009, pp.287-301.

      [8] Jing K, Liu L, Guo Y et al. “A Three-Step Clustering Algorithm over an Evolving Data Stream.” In Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems, Nov. 2009, pp.160-164.

      [9] Ren J, Ma R. “Density-Based Data Streams Clustering over Sliding Windows.” In Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery, Aug. 2009, pp.248-252.

      [10] Lin J, Lin H. “A Density-Based Clustering over Evolving Heterogeneous Data Stream.” In Proc. The 2nd Int. Colloquium on Computing, Communication, Control, and Management, Aug. 2009, pp.275-277.

      [11] Dunham M, Isaksson C, Hahsler M. “SOStream: Self Organizing Density-Based Clustering over Data Stream.” In Lecture Notes in Computer Science 7376, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp.264-278.

      [12] Zimek A, Ntoutsi I, Palpanas T et al. “Density-Based Projected Clustering over High Dimensional Data Streams.” In Proc. The 12th SIAM Int. Conf. Data Mining, April 2012, pp.987-998.

      [13] Spaus P, Hassani M, Gaber M M, Seidl T. “Density-Based Projected Clustering of Data Streams.” In Proc. the 6th Int. Conf. Scalable Uncertainty Management, Sept. 2012, pp.311-324.

      [14] Pizzuti C, Forestiero A, Spezzano G. “A Single Pass Algorithm for Clustering Evolving Data Streams based on Swarm Intelligence.” Data Mining and Knowledge Discovery, 2013, 26(1): 1-26.

      [15] Amineh A, Teh Ying W “LeaDen-Stream: A Leader Density-Based Clustering Algorithm over Evolving Data Stream.” Journal of Computer and Communications, pp. 26-31, 2013.

      [16] Hahsler M, and Matthew B. “Clustering Data Streams Based on Shared Density between Micro-Clusters.” IEEE Transactions on Knowledge and Data Engineering, 2016.

      [17] Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In Proc. ACM SIGMOD International Conference on Management of Data, June 1996, pp.103-114.

      [18] Li J, Gao J, Zhang Z, Tan P N. An incremental Data Stream Clustering Algorithm Based on Dense Units Detection. In Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2005, pp.420-425.

      [19] Chen Y, Tu L. Density-Based Clustering for Real-Time Stream Data. In Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2007, pp.133-142.

      [20] Tan C, Jia C, Yong A. A Grid and Density-Based Clustering Algorithm for Processing Data Stream. In Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing, Sept. 2008, pp.517-521.

      [21] Ng W K, Wan L, Dang X H et al. Density-Based Clustering of Data Streams at Multiple Resolutions. ACM Trans. Knowledge Discovery from Data, 2009, 3(3).

      [22] Ren J, Cai B, Hu C. Clustering over Data Streams Based on Grid Density and Index Tree. Journal of Convergence IT, 2011, 6(1): 83-93.

      [23] Yang Y, Liu Z, Zhang J et al. Dynamic Density-Based Clustering Algorithm over Uncertain Data Streams. In Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery, May 2012, pp.2664-2670.

      [24] Teh Ying W, Amini A, DENGRIS-Stream: A Density-Grid Based Clustering Algorithm for Evolving Data Streams over Sliding Window. In Proc. International Conference on Data Mining and Computer Engineering, Dec. 2012, pp.206-210.

      [25] Kaur S, Bhatnagar V, Chakravarthy S. Clustering Data Streams using Grid-Based Synopsis. Knowledge and Information Systems, June 2013.

      [26] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K. Zhang, “Density-Based Clustering of Data Streams at Multiple Resolutions,” ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 3, pp. 1–28, 2009.

      [27] George K, Eui-Hong H, Vipin K., “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling” IEEE Computer, pp. 68-75, August 1999.

      [28] M. Hahsler, M. Bolanos, and J. Forrest, stream: Infrastructure for Data Stream Mining, 2015, R package version 1.2-2.

      [29] Bache K, Lichman M (2013). UCI Machine Learning Repository." URL




Article ID: 9051
DOI: 10.14419/ijet.v7i2.9051

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.