Evaluating K-means multidimensional big data  clusters through MapReduce paradigm

Agnivesh .; Rajiv Pandey; Amarjeet Singh

doi:10.14419/ijet.v7i4.28766

Authors

Agnivesh .
Rajiv Pandey
Amarjeet Singh

Received date: April 7, 2019

Accepted date: April 7, 2019

DOI:

https://doi.org/10.14419/ijet.v7i4.28766

Keywords:

Big Data, Cloud Computing, Clustering, Hadoop, K-Means.

Abstract

In the era of big data, with the increasing use of large-scale data-driven applications, clustering and extracting useful information from big datasets has posed challenges. Prevailing clustering algorithms need globally optimized solutions for big datasets. K-means algorithm for clustering is of great interest because of its simplicity. However, there are certain limitations in K-means for analyzing big data which leave scope for successive improvements. This research work presents a new K-means clustering algorithm by improving K-means in MapReduce paradigm. The proposed work presents a method to find initial seeds of clusters instead of randomly selecting them which is a major drawback in standard K-means for clustering big data. The research minimizes MapReduce iteration dependence also. Moreover, the presented algorithm takes into consideration between cluster separation and within cluster compactness to achieve high performance. To obtain efficiency, cloud computing is applied in which Amazon Elastic MapReduce 5.x is used. It distributes the job of clustering among various nodes in parallel using low cost machines. The proposed work is simulated on some real datasets from UC Irvine Machine Learning Repository. The results confirm that the research work models an effective algorithm for clustering Big Data.
Â
Â

Â

References

[1] Han J, Pei J & Kamber, M. Data Mining: Concepts and Techniques, Third Edition, Elsevier, 2011.
[2] Fang W, Sheng VS, Wen X & Pan W. â€œMeteorological data analysis using MapReduceâ€. Sci World J. 2014;2014. https://doi.org/10.1155/2014/646497.
[3] Pham DT, Dimov SS & Nguyen CD. (2005), â€œSelection of K in K-means clusteringâ€, Proc. IMechE Vol. 219 Part C: J. Mechanical Engineering Science, https://doi.org/10.1243/095440605X8298.
[4] Du W, Qian D, Xie M & Chen W. â€œResearch and Implementation of MapReduce Programming Oriented Graphical Modeling Systemâ€, 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, NSW, 2013, pp. 1332-1337. https://doi.org/10.1109/CSE.2013.197.
[5] Cordeiro RLF, Traina Junior C, Traina AJM, LÃ³pez J, Kang U & Faloutsos C. â€œClustering very large multidimensional datasets with MapReduceâ€, In: Proceedings of KDDâ€™11, ACM, California, August 21â€“24. 2011.
[6] Cui X, Charles JS & Potok T. â€œGPU enhanced parallel computing for large scale data clusteringâ€. Future Generation Computer Systems, 29(7), 1736-1741, (2013). https://doi.org/10.1016/j.future.2012.07.009.
[7] Andrade G, Ramos G, Madeira D, Sachetto R., Ferreira R & Rocha L. (2013). â€œG-DBSCAN: A GPU accelerated algorithm for density-based clusteringâ€. Procedia Computer Science, 18, 369-378 https://doi.org/10.1016/j.procs.2013.05.200.
[8] Cai X, Nie F & Huang, H. â€œMultiview K-means Clustering on Big Dataâ€. Proceedings of the Twenty-Third International Conference on Artificial Intelligence, Pages 2598-2604, Beijing, China, August 03-09, 2013, ISBN:978-1-57735-6332-2
[9] Ghosia U, Ahmad U & Ahmad M. â€œImproved K-Means Clustering Algorithm by Getting Initial Centroidsâ€, World Applied Sciences Journal 27 (4): 543-551, 2013, ISSN 1818-4952, Â© IDOSI Publications, 2013, DOI: 10.5829/idosi.wasj.2013.27.04.1142.
[10] Kodinariya1 TM & Makwana PR. â€œReview on determining number of Cluster in K-Means Clusteringâ€, International Journal of Advance Research in Computer Science and Management Studies, ISSN: 2321-7782 (Online), Volume 1, Issue 6, November 2013.
[11] Cui X, Zhu P, Yang X, Li K & Ji C. â€œOptimized big data clustering using MapReduceâ€. The journal of Supercomputing (2014). Volume 70, Issue 3, pp 1249â€“125970: 1249. https://doi.org/10.1007/s11227-014-1225-7 .
[12] Bide P & Shedge R. â€œImproved Document Clustering using k-means algorithmâ€, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, 2015, pp. 1-5. https://doi.org/10.1109/ICECCT.2015.7226065.
[13] Tsai CW, Hsieh CH & Chiang MC. â€œParallel black hole clustering based on MapReduceâ€, In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, 2015. https://doi.org/10.1109/SMC.2015.445.
[14] Wu K, Zeng W, Wu T & An Y. â€œResearch and improve on K-means based on hadoopâ€. Software Engineering and Service Science (ICSESS). 2015 6^th IEEE conference, 23-15 september, 2015. https://doi.org/10.1109/ICSESS.2015.7339068.
[15] Arora P, Deepali & Varshney S. â€œAnalysis of K-Means and K-Medoids Algorithm For Big Dataâ€, Volume 78, 2016, Pages 507-512. https://doi.org/10.1016/j.procs.2016.02.095.
[16] Ajin VM & Kumar LD. â€œBig data and clustering algorithmsâ€. 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS). https://doi.org/10.1109/RAINS.2016.7764405.
[17] Shridhar, C., Kasivishwanath, N. and Reddy, P. C. Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop. Journal of Big Data (2017). Springer. https://doi.org/10.1186/s40537-017-0087-2.
[18] Rehioui, H., Idrissi, A. Abourezq, M. and Zegrari, F. DENCLUE-IM: A New Approach for Big Data Clustering. Procedia Computer Science, Volume 83, 2016, pages 560-567, ELSEVIER. https://doi.org/10.1016/j.procs.2016.04.265.
[19] Aletti, G. and Micheletti, A. A clustering algorithm for multivariate data streams with correlated components. Journal of Big Data 2017:48. https://doi.org/10.1186/s40537-017-0109-0.
[20] Zhang, T. and Ma Fumin, F. (2017). â€œImproved rough k-means clustering algorithm based on weighted distance measure with Gaussian functionâ€, International Journal of Computer Mathematics, 94:4, 663-675, https://doi.org/10.1080/00207160.2015.1124099.
[21] Vijay V, Raghunath VP, Singh A & Omkar SN. â€œVariance Based Moving K-Means Algorithm," 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, 2017, pp. 841-847. https://doi.org/10.1109/IACC.2017.0173.
[22] Jain AK & Dubes RC. (1988). Algorithm for Clustering Data, Prentice Hall.
[23] Halkidi M, Batistakis Y & Vazirgiannis M. Journal of Intelligent Information Systems (2001) 17: 107. https://doi.org/10.1023/A:1012801612483.
[24] Saini A, Minocha J, Ubriani J & Sharma D. "New approach for clustering of big data: DisK-means," 2016 International Conference on Computing, Communication and Automation (ICCCA), Noida, 2016, pp. 122-126. https://doi.org/10.1109/CCAA.2016.7813702.
[25] Pandove D & Goel S. "A comprehensive study on clustering approaches for big data mining," 2015 2nd International Conference on Electronics and Communication Systems (ICECS), Coimbatore, 2015, pp.1333-1338. https://doi.org/10.1109/ECS.2015.7124801.
[26] Qiao J & Zhang Y. "Study on K-means method based on Data-Mining," 2015 Chinese Automation Congress (CAC), Wuhan, 2015, pp. 51-54. https://doi.org/10.1109/CAC.2015.7382468.
[27] UCI Machine Learning Repository: https://archive.ics.uci.edu (2015).

Evaluating K-means multidimensional big data clusters through MapReduce paradigm

Authors

Agnivesh .

Rajiv Pandey

Amarjeet Singh

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite