(N,Î±)-means algorithm for clustering big data

Md Tabrez Nafis; Ranjit Biswas

doi:10.14419/ijet.v7i2.27.12238

Authors

Md Tabrez Nafis
JAMIA HAMDARD UNIVERSITY, INDIA
Ranjit Biswas
JAMIA HAMDARD UNIVERSITY

Received date: April 27, 2018

Accepted date: June 1, 2018

Published date: June 12, 2018

DOI:

https://doi.org/10.14419/ijet.v7i2.27.12238

Keywords:

Big Data, (N, Î±)-Means, Multiset, Bag, Multiset Space, Leader-Set.

Abstract

The k-means algorithm is a popular algorithm for clustering data, but it is not appropriate for clustering big data. In this paper the authors modify the existing k-means algorithm to develop a new algorithm called by (N,Î±)-means algorithm. The proposed (N,Î±)-means algorithm is developed to cluster N number of big data into Î± number of clusters. In our approach by (N,Î±)-means algorithm the result is achieved in n number of sequential steps, in each step executing k-means algorithm twice.The method provides wide opportunity to many data points to stand as leaders and to justify their leadership with the progress of time. This new algorithm, if incorporated in the existing popular data mining tools (viz. Rapid Miner, Orange, Weka, Knime, Oracle Data Mining, etc.), is expected to play a better role in case of data mining of big data.
Â
Â

References

[1] Ahad, Mohd Abdul and Biswas, Ranjit. (2017).Comparing and Analyzing the Characteristics of Hadoop, Cassandra and Quantcast File Systems for Handling Big Data, Indian Journal of Science and Technology, Vol.10(8),pp 1-6. https://doi.org/10.17485/ijst/2017/v10i8/105400.
[2] Anandan, R, S. Phani Kumar, S.,Kalaivani, K. and Swaminathan, P. (2018). A survey on big data analytics for enhanced security on cloud. International Journal of Engineering & Technology,Vol. 7, No. 2.21, pp. 331-334. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i2.21.12397.
[3] Biswas, Ranjit. (2016).Introducing â€˜NR-Statisticsâ€™: A New Direction in Statistics,in Generalized and Hybrid Set Structures and Applications for Soft Computing: edited by Sunil John, IGI Global, USA. https://doi.org/10.4018/978-1-4666-9798-0.ch023.
[4] Biswas, Ranjit. (2015). â€œAtrain Distributed Systemâ€ (ADS): An Infinitely Scalable Architecture for Processing Big Data of Any 4Vsin Computational Intelligence for Big Data Analysis Frontier Advances and Applications: edited by D. P. Acharjya, SatchidanandaDehuri and SugataSanyal, Springer International Publishing, Switzerland 2015, Part-1, pp 1-53.
[5] Biswas, Ranjit. (2016). Introducing data structures for big data, Chapter-2 in Effective Big Data Management and Opportunities for Implementation, edited by Manoj Kumar Singh and Dileep Kumar, IGI Global (USA). https://doi.org/10.4018/978-1-5225-0182-4.ch002.
[6] Copson, E.T. (1968). Metric Spaces.Cambridge University Press. https://doi.org/10.1017/CBO9780511566141.
[7] Elgendy N. and Elragal A. (2014). Big data analytics: A literature review paper. Advances in data mining. Applications and theoretical aspects. Lecture Notes in Computer Science.pp 214-227. https://doi.org/10.1007/978-3-319-08976-8_16.
[8] EndahHiswati, Marselina. Achmad Fanany OnnilitaGaffar,Rihartantoand Haviluddin. (2018). Minimum wage prediction based on K-Means clustering using neural based optimized Minkowski Distance Weighting. International Journal of Engineering & Technology,Vol. 7, No. 2.2, pp. 90-93. ISSN 2227-524X.
[9] Gomathi, S.; Narayani, V.(2017). Early prediction of systemic lupus erythematosus using hybrid K-Means J48 decision tree algorithm. International Journal of Engineering & Technology, Vol. 7, No. 1.3, pp. 28-32. ISSN 2227-524X.
[10] Guha, S., Rastogi, R. (2001): An efficient clustering algorithm for large database. Inf. Syst. 26(1), 35â€“58. https://doi.org/10.1016/S0306-4379(01)00008-4.
[11] Han, J., Kamber, M., Pei, J. (2006). Data mining: concepts and techniques. Morgan Kaufmann.
[12] Hashmi A, S. and Ahmad T. (2016).Big data mining techniques. Indian Journal of Science and Technology, Vol.9 (37),pp 1-5.
[13] Havens, T.C., Bezdek, J.C., Palaniswami, M. (2013).Scalable single linkage hierarchical clustering for big data.in: 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396â€“401. IEEE.https://doi.org/10.1109/ISSNIP.2013.6529823.
[14] Heeku, Jin; Su Jeong, Yoon. (2018). A study on social big data analysis using text clustering. International Journal of Engineering & Technology, Vol.7, No.2.12, pp. 1-4. ISSN 2227-524X.
[15] Kaufman, L., Rousseeuw, P.J. (1990).Finding Groups in Data: An Introduction on Cluster Analysis. John Wiley and Sons. https://doi.org/10.1002/9780470316801.
[16] Kodali, Sadhana; Dabbiru, Madhavi; Rao, B Thirumala. (2018). A Survey of Data Mining Techniques on Information Networks. International Journal of Engineering & Technology, Vol. 7, No. 2.6, pp. 293-300. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i2.6.11267.
[17] Kusuma, S; KasiViswanath, D. (2018). IOT and Big Data Analytics in E-Learning: A Technological Perspective and Review. International Journal of Engineering & Technology, Vol.7, No.1.8, pp. 164-167. ISSN 2227-524X.
[18] Mhaske-Dhamdhere, Vidya; Vanjale, Sandeep. (2017). A novel approach for phishing emails real time classification using k-means algorithm. International Journal of Engineering & Technology, Vol. 7, No. 1.2, pp. 96-100. ISSN 2227-524X.
[19] Nafis, Md Tabrez and Biswas, Ranjit. (2018). A Secure Clustering Technique for Unstructured and Uncertain Big Data,inProgress in Advanced Computing and Intelligent Engineering, Advances in Intelligent System and Computing â€“ 564, edited by Khalid Saeed, NabenduChaki, BibudhenduPati, SambitBakshi, Durga Prasad Mohapatra, Springer Nature Singapore,Vol.2, Part-III, pp 459-466.
[20] RatnaBabu, P; Bhanu Prakash Battula. (2018).A novel k-nearest neighbor distance based under sampling for improved opinion mining on skewed data using random forest. International Journal of Engineering & Technology, Vol.7, No.1.8, pp. 62-66. ISSN 2227-524X.
[21] SagarImambi, S., P.Vidyullatha, P., Santhi, M.V.B.T. and Haran Babu, P.(2018). Explore Big Data and Forecasting Future Values using Univariate Arima Model in R.International Journal of Engineering & Technology, Vol. 7, No. 2.7, pp. 1107-1110. ISSN 2227-524X.
[22] Sakthivel, N K; Gopalan, N P; Subasree, S. (2018). Parallel framework based gene signature-hierarchical random forest cluster for predicting human diseases. International Journal of Engineering & Technology, Vol. 7, No. 2.27, pp. 12-16. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i2.27.12103.
[23] Sankaramalladi, Bhima; Srinivas Prasad. (2017). big data life cycle: security issues, challenges, threat and security model. International Journal of Engineering & Technology, Vol.7, No.1.3, pp. 100-103. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i1.3.9666.
[24] Seenu, Aaluri;Kameswara Rao, M. (2018). A Novel Privacy Preserving Data mining using improved decision tree and KP-ABE on High Dimensional Data. International Journal of Engineering & Technology, Vol.7, No.2.7, pp. 515-519. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i2.7.10874.
[25] Shirkhorshidi, Ali Seyed., Aghabozorgi, Saeed,Wah, Teh Ying.,Herawan, Tutut. (2014). Big Data Clustering A Review,Proceedings of the International Conference on Computational Science and Its Applications ICCSA 2014: Computational Science and Its Applicationsâ€“ICCSA (2014) pp 707-720.
[26] Shobana, G; Prakash Battula, Bhanu. (2018). an under sampled k-means approach for handlingimbalanced data using diversified distribution. International Journal of Engineering & Technology, Vol.7, No.1.8, pp. 113-117. ISSN 2227-524X.
[27] Simmons, G.F. (1963).Introduction to Topology and Modern Analysis.McGraw Hill, New York.
[28] Suhailan, S et al. (2018). A hybrid model of ordinal ranking-based clustering using G+Rank K-Means. International Journal of Engineering & Technology, Vol. 7, No. 2.15, pp. 41-44. ISSN 2227-524X.https://doi.org/10.14419/ijet.v7i2.15.11209.
[29] S. Shraddha Bollamma, K; Manishankar, S; V. Vishnu, M. Optimizing the performance of hadoop clusters through efficient cluster management techniques. International Journal of Engineering & Technology, Vol.7, No.2.31, pp. 19-22. ISSN 2227-524X.
[30] Tremblay,J.P. & Manohar,R. (1987), Discrete Mathematical Structures with Applicationsto Computer Science. McGraw Hill Int. Ed.
[31] Tryon. (1939). Cluster Analysis. McGraw-Hill Publishers, New York.
[32] VishwanathBrahmane, Anilkumar; Murugan, R. (2018). Parallel processing on Big Data in the context of Machine Learning and Hadoop Ecosystem: A Survey. International Journal of Engineering & Technology, Vol.7, No.2.7, pp. 577-588. ISSN 2227-524X.
[33] Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103â€“114.https://doi.org/10.1145/233269.233324.
[34] Zhao, W., Ma, H., He, Q. (2009). Parallel k-means clustering based on MapReduce. In: Cloud Computing, pp. 674â€“679.https://doi.org/10.1007/978-3-642-10665-1_71.

(N,Î±)-means algorithm for clustering big data

Authors

Md Tabrez Nafis

Ranjit Biswas

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite