An Improved Initialization Method for k-Means Clustering of ‎Noisy Datasets Based on Rough Set Neighbourhood Model

  • Authors

    • Abeng J. Abeng Department of Statistics, University of Nigeria, Nsukka, Nigeria
    • Mbanefo S. Madukaife Department of Statistics, University of Nigeria, Nsukka, Nigeria
    https://doi.org/10.14419/69dvcw11

    Received date: January 19, 2026

    Accepted date: March 20, 2026

    Published date: March 29, 2026

  • Initialization Methods; K-Means Clustering; Neighbourhood Model; Noisy Dataset; Rough Set Theory
  • Abstract

    This study improves one of the initialization methods for the k-means clustering algorithm based on a rough set neighbourhood model to ‎enhance performance in noisy datasets. The method involves data normalization, obtaining a neighbourhood threshold based on the 0.25th ‎trimmed mean of pairwise Minkowski distances, calculating cohesion and coupling degrees of the neighbourhoods and between them re-‎spectively, and obtaining the initial cluster centres as the k points having maximum cohesion degrees with minimum coupling degrees among ‎themselves. The approach was evaluated on six datasets using Silhouette, Davies–Bouldin, Calinski–Harabasz, and Dunn–Hubert indices in ‎comparison with an existing method. Results showed that the improved method outperformed the existing method on noisy datasets, achieving higher Silhouette and Dunn–Hubert scores, and lower Davies–Bouldin values, with a slight reduction in Calinski–Harabasz index in ‎one of the datasets. On the non-noisy datasets, the two methods were at par in all four performance indices. With the improved performance, showing that the improved method enhanced the stability and robustness of k-means clustering in the presence of noisy data, it can ‎be recommended for clustering noisy datasets such as gene expression, image, and signal datasets‎.

  • References

    1. Jain AK (2010), Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31, 651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
    2. Mardia KV, Kent JT & Taylor CC (2024), Multivariate analysis (2nd ed.). John Wiley & sons, New York.
    3. Tan P-N, Steinbach M & Kumar V (2006), Introduction to data mining. Pearson Addison Wesley.
    4. Han J, Kamber M & Pei J (2012), Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann Publishers.
    5. MacQueen J (1967), Multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281-297).
    6. Bradley PS, Fayyad UM & Reina C (1998), Scaling clustering algorithms to large databases, knowledge discovery and data mining.
    7. Likas A, Vlassis N & Verbeek JJ (2003), The global k-means clustering algorithm. Pattern Recognition 36, 451–461. https://doi.org/10.1016/S0031-3203(02)00060-2.
    8. Yedla M, Pathakota SR & Srinivasa TM (2010), Enhancing k-means clustering algorithm with improved initial center. International Journal of Computer Science and Information Technologies 1, 121–125.
    9. Arthur D & Vassilvitskii S (2006), k-means++: The advantages of careful seeding. Stanford.
    10. Cao F, Liang J & Jiang G (2009), An initialization method for the K-means algorithm using neighborhood model. Computers & Mathematics with Applications 58, 474-483. https://doi.org/10.1016/j.camwa.2009.04.017.
    11. Pawlak Z (1982), Rough sets. International Journal of Computer & Information Sciences, 11, 341–356. https://doi.org/10.1007/BF01001956.
    12. Xu X, Li J & Zhou Z (2009), A weighted k-means clustering algorithm based on distance optimization. Expert Systems with Applications 36, 6983–6987.
    13. Ahmed AH & Ashour W (2011), An initialization method for the k-means algorithm using RNN and coupling degree. International Journal of Computer Applications 25, 1–6. https://doi.org/10.5120/2999-4030.
    14. Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U & Prugel-Bennett A (2015), Novel centroid selection approaches for KMeans-clustering based recommender systems. Information Sciences 320, 156-189. https://doi.org/10.1016/j.ins.2015.03.062.
    15. Yang J, Ma Y, Zhang X, Li S & Zhang Y (2017), An initialization method based on hybrid distance for k-means algorithm. Neural Computation 29, 3094–3117. https://doi.org/10.1162/neco_a_01014.
    16. Mishra BK, Rath AK, Nanda SK & Baidyanath RR (2019), Efficient intelligent framework for selection of initial cluster centers. International Journal of Intelligent Systems and Applications 11, 44–55. https://doi.org/10.5815/ijisa.2019.08.05
    17. Chowdhury K, Chaudhuri D & Pal AK (2021), An entropy based initialization method of k means clustering on the optimal number of clusters. Neural Computing and Applications 33, 6965–6982. https://doi.org/10.1007/s00521-020-05471-9.
    18. Yang J, Wang Y-K, Yao X & Lin C-T (2021), Adaptive initialization method for the K means algorithm. Frontiers in Artificial Intelligence 4, 740817. https://doi.org/10.3389/frai.2021.740817.
    19. Sujatha N, Latha Narayanan V, Prema A, Rathiha SK & Raja V (2022), Initial centroid selection for K means clustering algorithm using the statis-tical method. International Journal of Science and Research Archive 7, 474–478. https://doi.org/10.30574/ijsra.2022.7.2.0309.
    20. Gul M & Rehman M (2023), Big data: An optimized approach for cluster initialization. Journal of Big Data 10, Article 120. https://doi.org/10.1186/s40537-023-00798-1.
    21. Zu Y, Wu J, Zhao G, Wang M & Zhou X (2024), II-LA-KM: Improved initialization of a learning-augmented clustering algorithm for effective rock discontinuity grouping. Mathematics, 12(20), Article 3195. https://doi.org/10.3390/math12203195.
    22. Zubair M, Iqbal MA, Shil A, Chowdhury MJM, Moni MA & Sarker IH (2024), An improved k-means clustering algorithm towards efficient data-driven modeling. Annals of Data Science, 11(5), 1524–1544. https://doi.org/10.1007/s40745-022-00428-2.
    23. Gao C, Yong X, Gao Y.-L & Li T (2024), An improved black hole algorithm designed for k-means clustering method. Complex & Intelligent Sys-tems, 10, 5083–5106. https://doi.org/10.1007/s40747-024-01420-4.
    24. Zhang S, Chen S, Yu X & Mei S (2025), Research on collaborative filtering algorithm based on improved k-means algorithm for user attribute rat-ing and co-rating. Scientific Reports, 15, Article 19600. https://doi.org/10.1038/s41598-025-96705-0.
    25. Ugwu MC & Madukaife MS (2022), Two-stage cluster sampling with unequal probability sampling in the first stage and ranked set sampling in the second stage. Statistics in Transition new series 23 199–214. https://doi.org/10.2478/stattrans-2022-0038.
    26. Meilă M & Heckerman D (1998), An experimental comparison of several clustering and initialization methods. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence (UAI’98) (pp. 386–395). Morgan Kaufmann.
    27. Celebi ME, Kingravi HA & Vela PA (2013), A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications 40, 200–210. https://doi.org/10.1016/j.eswa.2012.07.021.
    28. Handoyo S, Marji M, Effendi MR & Kusnadi K (2014), The use of silhouette and fuzzy c-means clustering for customer segmentation. Internation-al Journal of Engineering & Technology 3, 354–358.
  • Downloads