Evaluating K-means multidimensional big data clusters through MapReduce paradigm

  • Authors

    • Agnivesh .
    • Rajiv Pandey
    • Amarjeet Singh
    https://doi.org/10.14419/ijet.v7i4.28766

    Received date: April 7, 2019

    Accepted date: April 7, 2019

    Published date: April 19, 2026

  • Big Data, Cloud Computing, Clustering, Hadoop, K-Means.
  • Abstract

    In the era of big data, with the increasing use of large-scale data-driven applications, clustering and extracting useful information from big datasets has posed challenges. Prevailing clustering algorithms need globally optimized solutions for big datasets. K-means algorithm for clustering is of great interest because of its simplicity. However, there are certain limitations in K-means for analyzing big data which leave scope for successive improvements. This research work presents a new K-means clustering algorithm by improving K-means in MapReduce paradigm. The proposed work presents a method to find initial seeds of clusters instead of randomly selecting them which is a major drawback in standard K-means for clustering big data. The research minimizes MapReduce iteration dependence also. Moreover, the presented algorithm takes into consideration between cluster separation and within cluster compactness to achieve high performance. To obtain efficiency, cloud computing is applied in which Amazon Elastic MapReduce 5.x is used. It distributes the job of clustering among various nodes in parallel using low cost machines. The proposed work is simulated on some real datasets from UC Irvine Machine Learning Repository. The results confirm that the research work models an effective algorithm for clustering Big Data.

  • References

    1. Han J, Pei J & Kamber, M. Data Mining: Concepts and Techniques, Third Edition, Elsevier, 2011.
    2. Fang W, Sheng VS, Wen X & Pan W. “Meteorological data analysis using MapReduce”. Sci World J. 2014;2014. https://doi.org/10.1155/2014/646497.
    3. Pham DT, Dimov SS & Nguyen CD. (2005), “Selection of K in K-means clustering”, Proc. IMechE Vol. 219 Part C: J. Mechanical En-gineering Science, https://doi.org/10.1243/095440605X8298.
    4. Du W, Qian D, Xie M & Chen W. “Research and Implementation of MapReduce Programming Oriented Graphical Modeling Sys-tem”, 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, NSW, 2013, pp. 1332-1337. https://doi.org/10.1109/CSE.2013.197.
    5. Cordeiro RLF, Traina Junior C, Traina AJM, López J, Kang U & Faloutsos C. “Clustering very large multidimensional datasets with MapReduce”, In: Proceedings of KDD’11, ACM, California, Au-gust 21–24. 2011.
    6. Cui X, Charles JS & Potok T. “GPU enhanced parallel computing for large scale data clustering”. Future Generation Computer Systems, 29(7), 1736-1741, (2013). https://doi.org/10.1016/j.future.2012.07.009.
    7. Andrade G, Ramos G, Madeira D, Sachetto R., Ferreira R & Rocha L. (2013). “G-DBSCAN: A GPU accelerated algorithm for density-based clustering”. Procedia Computer Science, 18, 369-378 https://doi.org/10.1016/j.procs.2013.05.200.
    8. Cai X, Nie F & Huang, H. “Multiview K-means Clustering on Big Data”. Proceedings of the Twenty-Third International Conference on Artificial Intelligence, Pages 2598-2604, Beijing, China, August 03-09, 2013, ISBN:978-1-57735-6332-2
    9. Ghosia U, Ahmad U & Ahmad M. “Improved K-Means Clustering Algorithm by Getting Initial Centroids”, World Applied Sciences Journal 27 (4): 543-551, 2013, ISSN 1818-4952, © IDOSI Publica-tions, 2013, DOI: 10.5829/idosi.wasj.2013.27.04.1142.
    10. Kodinariya1 TM & Makwana PR. “Review on determining number of Cluster in K-Means Clustering”, International Journal of Advance Research in Computer Science and Management Studies, ISSN: 2321-7782 (Online), Volume 1, Issue 6, November 2013.
    11. Cui X, Zhu P, Yang X, Li K & Ji C. “Optimized big data clustering using MapReduce”. The journal of Supercomputing (2014). Volume 70, Issue 3, pp 1249–125970: 1249. https://doi.org/10.1007/s11227-014-1225-7 .
    12. Bide P & Shedge R. “Improved Document Clustering using k-means algorithm”, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimba-tore, 2015, pp. 1-5. https://doi.org/10.1109/ICECCT.2015.7226065.
    13. Tsai CW, Hsieh CH & Chiang MC. “Parallel black hole clustering based on MapReduce”, In: Proceedings of IEEE International Con-ference on Systems, Man and Cybernetics, 2015. https://doi.org/10.1109/SMC.2015.445.
    14. Wu K, Zeng W, Wu T & An Y. “Research and improve on K-means based on hadoop”. Software Engineering and Service Science (ICSESS). 2015 6th IEEE conference, 23-15 september, 2015. https://doi.org/10.1109/ICSESS.2015.7339068.
    15. Arora P, Deepali & Varshney S. “Analysis of K-Means and K-Medoids Algorithm For Big Data”, Volume 78, 2016, Pages 507-512. https://doi.org/10.1016/j.procs.2016.02.095.
    16. Ajin VM & Kumar LD. “Big data and clustering algorithms”. 2016 International Conference on Research Advances in Integrated Navi-gation Systems (RAINS). https://doi.org/10.1109/RAINS.2016.7764405.
    17. Shridhar, C., Kasivishwanath, N. and Reddy, P. C. Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop. Journal of Big Data (2017). Springer. https://doi.org/10.1186/s40537-017-0087-2.
    18. Rehioui, H., Idrissi, A. Abourezq, M. and Zegrari, F. DENCLUE-IM: A New Approach for Big Data Clustering. Procedia Computer Science, Volume 83, 2016, pages 560-567, ELSEVIER. https://doi.org/10.1016/j.procs.2016.04.265.
    19. Aletti, G. and Micheletti, A. A clustering algorithm for multivariate data streams with correlated components. Journal of Big Data 2017:48. https://doi.org/10.1186/s40537-017-0109-0.
    20. Zhang, T. and Ma Fumin, F. (2017). “Improved rough k-means clus-tering algorithm based on weighted distance measure with Gaussian function”, International Journal of Computer Mathematics, 94:4, 663-675, https://doi.org/10.1080/00207160.2015.1124099.
    21. Vijay V, Raghunath VP, Singh A & Omkar SN. “Variance Based Moving K-Means Algorithm," 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, 2017, pp. 841-847. https://doi.org/10.1109/IACC.2017.0173.
    22. Jain AK & Dubes RC. (1988). Algorithm for Clustering Data, Pren-tice Hall.
    23. Halkidi M, Batistakis Y & Vazirgiannis M. Journal of Intelligent In-formation Systems (2001) 17: 107. https://doi.org/10.1023/A:1012801612483.
    24. Saini A, Minocha J, Ubriani J & Sharma D. "New approach for clus-tering of big data: DisK-means," 2016 International Conference on Computing, Communication and Automation (ICCCA), Noida, 2016, pp. 122-126. https://doi.org/10.1109/CCAA.2016.7813702.
    25. Pandove D & Goel S. "A comprehensive study on clustering ap-proaches for big data mining," 2015 2nd International Conference on Electronics and Communication Systems (ICECS), Coimbatore, 2015, pp.1333-1338. https://doi.org/10.1109/ECS.2015.7124801.
    26. Qiao J & Zhang Y. "Study on K-means method based on Data-Mining," 2015 Chinese Automation Congress (CAC), Wuhan, 2015, pp. 51-54. https://doi.org/10.1109/CAC.2015.7382468.
    27. UCI Machine Learning Repository: https://archive.ics.uci.edu (2015).
  • Downloads

  • How to Cite

    ., A., Pandey, R., & Singh, A. (2026). Evaluating K-means multidimensional big data clusters through MapReduce paradigm. International Journal of Engineering and Technology, 7(4), 5601-5606. https://doi.org/10.14419/ijet.v7i4.28766