Evaluating K-means multidimensional big data clusters through MapReduce paradigm


  • Agnivesh .
  • Rajiv Pandey
  • Amarjeet Singh




Big Data, Cloud Computing, Clustering, Hadoop, K-Means.


In the era of big data, with the increasing use of large-scale data-driven applications, clustering and extracting useful information from big datasets has posed challenges. Prevailing clustering algorithms need globally optimized solutions for big datasets. K-means algorithm for clustering is of great interest because of its simplicity. However, there are certain limitations in K-means for analyzing big data which leave scope for successive improvements. This research work presents a new K-means clustering algorithm by improving K-means in MapReduce paradigm. The proposed work presents a method to find initial seeds of clusters instead of randomly selecting them which is a major drawback in standard K-means for clustering big data. The research minimizes MapReduce iteration dependence also. Moreover, the presented algorithm takes into consideration between cluster separation and within cluster compactness to achieve high performance. To obtain efficiency, cloud computing is applied in which Amazon Elastic MapReduce 5.x is used. It distributes the job of clustering among various nodes in parallel using low cost machines. The proposed work is simulated on some real datasets from UC Irvine Machine Learning Repository. The results confirm that the research work models an effective algorithm for clustering Big Data.





[1] Han J, Pei J & Kamber, M. Data Mining: Concepts and Techniques, Third Edition, Elsevier, 2011.

[2] Fang W, Sheng VS, Wen X & Pan W. “Meteorological data analysis using MapReduceâ€. Sci World J. 2014;2014. https://doi.org/10.1155/2014/646497.

[3] Pham DT, Dimov SS & Nguyen CD. (2005), “Selection of K in K-means clusteringâ€, Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science, https://doi.org/10.1243/095440605X8298.

[4] Du W, Qian D, Xie M & Chen W. “Research and Implementation of MapReduce Programming Oriented Graphical Modeling Systemâ€, 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, NSW, 2013, pp. 1332-1337. https://doi.org/10.1109/CSE.2013.197.

[5] Cordeiro RLF, Traina Junior C, Traina AJM, López J, Kang U & Faloutsos C. “Clustering very large multidimensional datasets with MapReduceâ€, In: Proceedings of KDD’11, ACM, California, August 21–24. 2011.

[6] Cui X, Charles JS & Potok T. “GPU enhanced parallel computing for large scale data clusteringâ€. Future Generation Computer Systems, 29(7), 1736-1741, (2013). https://doi.org/10.1016/j.future.2012.07.009.

[7] Andrade G, Ramos G, Madeira D, Sachetto R., Ferreira R & Rocha L. (2013). “G-DBSCAN: A GPU accelerated algorithm for density-based clusteringâ€. Procedia Computer Science, 18, 369-378 https://doi.org/10.1016/j.procs.2013.05.200.

[8] Cai X, Nie F & Huang, H. “Multiview K-means Clustering on Big Dataâ€. Proceedings of the Twenty-Third International Conference on Artificial Intelligence, Pages 2598-2604, Beijing, China, August 03-09, 2013, ISBN:978-1-57735-6332-2

[9] Ghosia U, Ahmad U & Ahmad M. “Improved K-Means Clustering Algorithm by Getting Initial Centroidsâ€, World Applied Sciences Journal 27 (4): 543-551, 2013, ISSN 1818-4952, © IDOSI Publications, 2013, DOI: 10.5829/idosi.wasj.2013.27.04.1142.

[10] Kodinariya1 TM & Makwana PR. “Review on determining number of Cluster in K-Means Clusteringâ€, International Journal of Advance Research in Computer Science and Management Studies, ISSN: 2321-7782 (Online), Volume 1, Issue 6, November 2013.

[11] Cui X, Zhu P, Yang X, Li K & Ji C. “Optimized big data clustering using MapReduceâ€. The journal of Supercomputing (2014). Volume 70, Issue 3, pp 1249–125970: 1249. https://doi.org/10.1007/s11227-014-1225-7 .

[12] Bide P & Shedge R. “Improved Document Clustering using k-means algorithmâ€, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, 2015, pp. 1-5. https://doi.org/10.1109/ICECCT.2015.7226065.

[13] Tsai CW, Hsieh CH & Chiang MC. “Parallel black hole clustering based on MapReduceâ€, In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, 2015. https://doi.org/10.1109/SMC.2015.445.

[14] Wu K, Zeng W, Wu T & An Y. “Research and improve on K-means based on hadoopâ€. Software Engineering and Service Science (ICSESS). 2015 6th IEEE conference, 23-15 september, 2015. https://doi.org/10.1109/ICSESS.2015.7339068.

[15] Arora P, Deepali & Varshney S. “Analysis of K-Means and K-Medoids Algorithm For Big Dataâ€, Volume 78, 2016, Pages 507-512. https://doi.org/10.1016/j.procs.2016.02.095.

[16] Ajin VM & Kumar LD. “Big data and clustering algorithmsâ€. 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS). https://doi.org/10.1109/RAINS.2016.7764405.

[17] Shridhar, C., Kasivishwanath, N. and Reddy, P. C. Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop. Journal of Big Data (2017). Springer. https://doi.org/10.1186/s40537-017-0087-2.

[18] Rehioui, H., Idrissi, A. Abourezq, M. and Zegrari, F. DENCLUE-IM: A New Approach for Big Data Clustering. Procedia Computer Science, Volume 83, 2016, pages 560-567, ELSEVIER. https://doi.org/10.1016/j.procs.2016.04.265.

[19] Aletti, G. and Micheletti, A. A clustering algorithm for multivariate data streams with correlated components. Journal of Big Data 2017:48. https://doi.org/10.1186/s40537-017-0109-0.

[20] Zhang, T. and Ma Fumin, F. (2017). “Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian functionâ€, International Journal of Computer Mathematics, 94:4, 663-675, https://doi.org/10.1080/00207160.2015.1124099.

[21] Vijay V, Raghunath VP, Singh A & Omkar SN. “Variance Based Moving K-Means Algorithm," 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, 2017, pp. 841-847. https://doi.org/10.1109/IACC.2017.0173.

[22] Jain AK & Dubes RC. (1988). Algorithm for Clustering Data, Prentice Hall.

[23] Halkidi M, Batistakis Y & Vazirgiannis M. Journal of Intelligent Information Systems (2001) 17: 107. https://doi.org/10.1023/A:1012801612483.

[24] Saini A, Minocha J, Ubriani J & Sharma D. "New approach for clustering of big data: DisK-means," 2016 International Conference on Computing, Communication and Automation (ICCCA), Noida, 2016, pp. 122-126. https://doi.org/10.1109/CCAA.2016.7813702.

[25] Pandove D & Goel S. "A comprehensive study on clustering approaches for big data mining," 2015 2nd International Conference on Electronics and Communication Systems (ICECS), Coimbatore, 2015, pp.1333-1338. https://doi.org/10.1109/ECS.2015.7124801.

[26] Qiao J & Zhang Y. "Study on K-means method based on Data-Mining," 2015 Chinese Automation Congress (CAC), Wuhan, 2015, pp. 51-54. https://doi.org/10.1109/CAC.2015.7382468.

[27] UCI Machine Learning Repository: https://archive.ics.uci.edu (2015).

View Full Article: