An Improved Initialization Method for k-Means Clustering of Noisy Datasets Based on Rough Set Neighbourhood Model
-
https://doi.org/10.14419/69dvcw11
Received date: January 19, 2026
Accepted date: March 20, 2026
Published date: March 29, 2026
-
Initialization Methods; K-Means Clustering; Neighbourhood Model; Noisy Dataset; Rough Set Theory -
Abstract
This study improves one of the initialization methods for the k-means clustering algorithm based on a rough set neighbourhood model to enhance performance in noisy datasets. The method involves data normalization, obtaining a neighbourhood threshold based on the 0.25th trimmed mean of pairwise Minkowski distances, calculating cohesion and coupling degrees of the neighbourhoods and between them re-spectively, and obtaining the initial cluster centres as the k points having maximum cohesion degrees with minimum coupling degrees among themselves. The approach was evaluated on six datasets using Silhouette, Davies–Bouldin, Calinski–Harabasz, and Dunn–Hubert indices in comparison with an existing method. Results showed that the improved method outperformed the existing method on noisy datasets, achieving higher Silhouette and Dunn–Hubert scores, and lower Davies–Bouldin values, with a slight reduction in Calinski–Harabasz index in one of the datasets. On the non-noisy datasets, the two methods were at par in all four performance indices. With the improved performance, showing that the improved method enhanced the stability and robustness of k-means clustering in the presence of noisy data, it can be recommended for clustering noisy datasets such as gene expression, image, and signal datasets.
-
References
- Jain AK (2010), Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31, 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
- Mardia KV, Kent JT & Taylor CC (2024), Multivariate analysis (2nd ed.). John Wiley & sons, New York.
- Tan P-N, Steinbach M & Kumar V (2006), Introduction to data mining. Pearson Addison Wesley.
- Han J, Kamber M & Pei J (2012), Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann Publishers.
- MacQueen J (1967), Multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281-297).
- Bradley PS, Fayyad UM & Reina C (1998), Scaling clustering algorithms to large databases, knowledge discovery and data mining.
- Likas A, Vlassis N & Verbeek JJ (2003), The global k-means clustering algorithm. Pattern Recognition 36, 451–461. https://doi.org/10.1016/S0031-3203(02)00060-2.
- Yedla M, Pathakota SR & Srinivasa TM (2010), Enhancing k-means clustering algorithm with improved initial center. International Journal of Computer Science and Information Technologies 1, 121–125.
- Arthur D & Vassilvitskii S (2006), k-means++: The advantages of careful seeding. Stanford.
- Cao F, Liang J & Jiang G (2009), An initialization method for the K-means algorithm using neighborhood model. Computers & Mathematics with Applications 58, 474-483. https://doi.org/10.1016/j.camwa.2009.04.017.
- Pawlak Z (1982), Rough sets. International Journal of Computer & Information Sciences, 11, 341–356. https://doi.org/10.1007/BF01001956.
- Xu X, Li J & Zhou Z (2009), A weighted k-means clustering algorithm based on distance optimization. Expert Systems with Applications 36, 6983–6987.
- Ahmed AH & Ashour W (2011), An initialization method for the k-means algorithm using RNN and coupling degree. International Journal of Computer Applications 25, 1–6. https://doi.org/10.5120/2999-4030.
- Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U & Prugel-Bennett A (2015), Novel centroid selection approaches for KMeans-clustering based recommender systems. Information Sciences 320, 156-189. https://doi.org/10.1016/j.ins.2015.03.062.
- Yang J, Ma Y, Zhang X, Li S & Zhang Y (2017), An initialization method based on hybrid distance for k-means algorithm. Neural Computation 29, 3094–3117. https://doi.org/10.1162/neco_a_01014.
- Mishra BK, Rath AK, Nanda SK & Baidyanath RR (2019), Efficient intelligent framework for selection of initial cluster centers. International Journal of Intelligent Systems and Applications 11, 44–55. https://doi.org/10.5815/ijisa.2019.08.05
- Chowdhury K, Chaudhuri D & Pal AK (2021), An entropy based initialization method of k means clustering on the optimal number of clusters. Neural Computing and Applications 33, 6965–6982. https://doi.org/10.1007/s00521-020-05471-9.
- Yang J, Wang Y-K, Yao X & Lin C-T (2021), Adaptive initialization method for the K means algorithm. Frontiers in Artificial Intelligence 4, 740817. https://doi.org/10.3389/frai.2021.740817.
- Sujatha N, Latha Narayanan V, Prema A, Rathiha SK & Raja V (2022), Initial centroid selection for K means clustering algorithm using the statis-tical method. International Journal of Science and Research Archive 7, 474–478. https://doi.org/10.30574/ijsra.2022.7.2.0309.
- Gul M & Rehman M (2023), Big data: An optimized approach for cluster initialization. Journal of Big Data 10, Article 120. https://doi.org/10.1186/s40537-023-00798-1.
- Zu Y, Wu J, Zhao G, Wang M & Zhou X (2024), II-LA-KM: Improved initialization of a learning-augmented clustering algorithm for effective rock discontinuity grouping. Mathematics, 12(20), Article 3195. https://doi.org/10.3390/math12203195.
- Zubair M, Iqbal MA, Shil A, Chowdhury MJM, Moni MA & Sarker IH (2024), An improved k-means clustering algorithm towards efficient data-driven modeling. Annals of Data Science, 11(5), 1524–1544. https://doi.org/10.1007/s40745-022-00428-2.
- Gao C, Yong X, Gao Y.-L & Li T (2024), An improved black hole algorithm designed for k-means clustering method. Complex & Intelligent Sys-tems, 10, 5083–5106. https://doi.org/10.1007/s40747-024-01420-4.
- Zhang S, Chen S, Yu X & Mei S (2025), Research on collaborative filtering algorithm based on improved k-means algorithm for user attribute rat-ing and co-rating. Scientific Reports, 15, Article 19600. https://doi.org/10.1038/s41598-025-96705-0.
- Ugwu MC & Madukaife MS (2022), Two-stage cluster sampling with unequal probability sampling in the first stage and ranked set sampling in the second stage. Statistics in Transition new series 23 199–214. https://doi.org/10.2478/stattrans-2022-0038.
- Meilă M & Heckerman D (1998), An experimental comparison of several clustering and initialization methods. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI’98) (pp. 386–395). Morgan Kaufmann.
- Celebi ME, Kingravi HA & Vela PA (2013), A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications 40, 200–210. https://doi.org/10.1016/j.eswa.2012.07.021.
- Handoyo S, Marji M, Effendi MR & Kusnadi K (2014), The use of silhouette and fuzzy c-means clustering for customer segmentation. Internation-al Journal of Engineering & Technology 3, 354–358.
-
Downloads
