Data labeling method based on Cluster similarity using Rough Entropy for Categorical Data Clustering

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    In present research, Data mining is become one of the growing area which deals with data. Clustering is recognized as an efficient methodology in data grouping; to improve the efficiency of the clustering many researchers have used data labeling method. Labeling method works on similar data points, into the proper clusters. In categorical domain applying data labeling is not so easy when compare with numerical domain. In numeral domain it is easy to find difference between to data points, but in categorical it is not easy. Since data labeling on categorical is a challenging issue till date and it is quite complex to implement. The proposed methodology is deals on this problem.  According proposed method a sample data will be taken. That sampled data further divides sliding windows, and then a normal clustering algorithm will be applied on one sliding window and divides into clusters. Rough membership Entropy function is used to find the similarity between unlabelled data points to labeled data points. The proposed methodology has two important features those are 1) The Data points will moved into their proper clusters, means the quality clusters will take places, 2) Proposed methodology will execute with high efficiency rate. In this paper the proposed methodology is applied on KDD Cup99 data sets, and the results shows appreciably more proficient than earlier works.


  • Keywords

    Categorical Data; Clustering; Data Labeling; Outlier; Entropy; Rough set.

  • References

      [1]. Anil K. Jain and Richard C. Dubes. “Algorithms for Clustering Data”, Prentice-Hall International, 1988.

      [2]. Jain A K MN Murthy and P J Flyn, “Data Clustering: A Review,” ACM Computing Survey, 1999.

      [3]. Kaufman L, P. Rousseuw,” Finding Groups in Data- An Introduction to Cluster Analysis”, Wiley Series in Probability and Math. Sciences, 1990.

      [4]. Michael R. Anderberg,” Cluster analysis for applications”, Academic Press, 1973.

      [5]. Han,J. and Kamber,M. “Data Mining Concepts and Techniques”, Morgan Kaufmann, 2001.

      [6]. Gibson, D., Kleinberg, J.M. and Raghavan,P. “Clustering Categorical Data An Approach Based on Dynamical Systems”, VLDB pp. 3-4, pp. 222-236, 2000.

      [7]. Bradley,P.S., Usama Fayyad, and Cory Reina,” Scaling clustering algorithms to large databases”, Fourth International Conference on Knowledge Discovery and Data Mining, 1998.

      [8]. Joydeep Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, “Handbook of Data Mining”, chapter 10, pp. 247–277. Lawrence Ealbaum Assoc, 2003.

      [9]. Chen. H. L., Chuang K.T. and Chen. M.S (2008), “On Data Labeling for clustering Categorical data”, IEEE Transactions on knowledge and Data Engineering, 20(2011), 1458-1471.

      [10]. Fuyuan Cao, Jiye Liang, “A Data Labeling method for clustering categorical data”, Elsevier Expert systems with applications, 38(2011), 2381-2385.

      [11]. Chen, H.L., Chuang, K.T. And Chen, M.S. “Labeling Un clustered Categorical Data into Clusters Based on the Important Attribute Values”, IEEE International Conference. Data Mining (ICDM), 2005.

      [12]. Klinkenberg, R.,” Using labeled and unlabeled data to learn drifting concepts”, IJCAI-01Workshop on Learning from Temporal and Spatial Data, pp. 16-24, 2001.

      [13]. Z. Pawlak, “Rough sets “, International journal of computer and information sciences, 11(1982), 341-356.

      [14]. D. Parmer, T. Wu and J. Blackhurst, MMR, “An Algorithm for clustering data using rough set theory”, Data and Knowledge Engineering, 63(3)(2007), 879-893.

      [15]. H.Venkateswara Reddy, S.Viswanadha Raju. “A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data”, IOSR Journal of Computer Engineering (IOSRJCE), Volume 3, Issue 5 (July-Aug. 2012), PP 44-51 (ISSN: 2278-0661) DOI number 10.9790/0661-0354451.

      [16]. Liang, J. Y., Wang, J. H., & Qian, Y. H. (2009). A new measure of uncertainty based on knowledge granulation for rough sets. Information Sciences, 179(4), 458–470.

      [17]. Gluck, M.A. and Corter, J.E. “Information Uncertainty and the Utility of Categories”, Cognitive Science Society, pp. 283-287, 1985.

      [18]. Shannon, C.E, “A Mathematical Theory of Communication,” Bell System Technical J., 1948.

      [19]. Chun-Bao Chen, Li-Ya Wang, “Rough Set-Based Clustering with refinement Using Shannon’s Entropy Theory”, ELSEVIER Computers and Mathematics with Applications 52 (2006) 1563-1576.

      [20]. Jiang, F., Sui, Y. F., & Cao, C. G. (2008). A rough set approach to outlier detection. International Journal of General Systems, 37(5), 519–536.

      [21]. Xiangjun Li, Fen Rao, “An Rough Entropy Based Approach to Outlier Detection”, Journal of Computational Information Systems 8: 24 (2012) 10501-10508.

      [22]. Venkateswara Reddy.H, Viswanadha Raju.S,” A Threshold for clustering Concept – Drifting Categorical Data”, IEEE Computer Society, ICMLC 2011.

      [23]. Tian Zhang, Raghu Ramakrishnan, and Miron Livny,” BIRCH: An Efficient Data Clustering Method for Very Large Databases”,ACM SIGMOD International Conference on Management of Data,1996.

      [24]. Ng, R.T. Jiawei Han “CLARANS: a method for clustering objects for spatial data mining”, Knowledge and Data Engineering, IEEE Transactions, 2002.

      [25]. S. Guha, R. Rastogi, K. Shim. CURE,” An Efficient Clustering Algorithm for Large Databases”, ACM SIGMOD International Conference on Management of Data, pp.73-84, 1998.

      [26]. Huang, Z. and Ng, M.K, “A Fuzzy k-Modes Algorithm for Clustering Categorical Data” IEEE On Fuzzy Systems, 1999.

      [27]. Guha,S., Rastogi,R. and Shim, K, “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, International Conference On Data Eng. (ICDE), 1999.

      [28]. Ganti, V., Gehrke, J. and Ramakrishnan, R, “CACTUS—Clustering Categorical Data Using Summaries,” ACM SIGKDD, 1999.

      [29]. Vapnik, V.N,” The nature of statistical learning theory”, Springer,1995.

      [30]. Fredrik Farnstrom, James Lewis, and Charles Elkan,” Scalability for clustering algorithms revisited”, ACM SIGKDD pp.:51–57, 2000.

      [31]. Barbara, D., Li, Y. and Couto, J. “COOLCAT: An Entropy-Based Algorithm for Categorical Clustering”, ACM International Conf. Information and Knowledge Management (CIKM), 2002.

      [32]. Andritsos, P, Tsaparas, P, Miller R.J and Sevcik, K.C.“LIMBO: Scalable Clustering of Categorical Data”, Extending Database Technology (EDBT), 2004.




Article ID: 20239
DOI: 10.14419/ijet.v7i4.6.20239

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.