Critical evaluation of classifiers in data stream mining

 
 
 
  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract


    Over past decade there has been a significant increase in the volume of online data. Extracting meaningful knowledge from this high volume data is considered as important aspect of research. It is very difficult to completely store full data, because of its perpetual nature. Therefore, analysis is needed while the “data is moving”. This moving data is known as data stream and analyzing it without storing it completely is termed as data stream mining. In recent years, many new techniques have been proposed to overcome the challenges of data stream mining. In this paper, we review the operation of popular streaming algorithms highlighting their strength and weaknesses. We also evaluate the classifiers used in these algorithms against two popular benchmark datasets namely (a) forest cover (forest) and (b) german credit available at UCI repository. Finally, we present our critical observation and draw conclusions on the basis of our analysis.

     

     


  • Keywords


    Classification; Clustering; Data Stream; Random Forest; Stream Mining.

  • References


      [1] Babcock, Brian, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. "Models and issues in data stream systems." In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1-16. ACM, 2002. https://doi.org/10.1145/543613.543615.

      [2] Muthukrishnan S. "Data streams: algorithms and applications", in Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, 2003.

      [3] Chaudhry N., Show K., and Abdelgurefi M., “Stream data Management”, Advances in a Database system. Vol. 30: Springer, 2005.

      [4] Han J. and Kamber M. Data Mining: Concepts and Techniques, Second ed. The Morgan Kaufmann Series in Data Management Systems: Elsevier, 2006.

      [5] Ferreira Cordeiro, Robson Leonardo, et al. "Clustering very large multi-dimensional datasets with mapreduce." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

      [6] Lior Cohen, Gil Avrahami, Mark Last, Abraham Kandel, “Info-fuzzy algorithms for mining dynamic data streams”, Applied Soft Computing, vol.8.4, pp 1283–1294, 2008. https://doi.org/10.1016/j.asoc.2007.11.003.

      [7] Jurgen Beringer, Eyke Hullermeier, “Online clustering of parallel data streams”, Data & Knowledge Engineering, vol. 58, pp 180–204, 2006. https://doi.org/10.1016/j.datak.2005.05.009.

      [8] Charu C. Aggarwal, Philip S. Yu, “On Clustering massive text and categorical data streams”, Knowledge Information System, vol. 24, pp 171–196, 2010. https://doi.org/10.1007/s10115-009-0241-z.

      [9] Hu, Xue-Gang, Pei-Pei Li, Xin-Dong Wu, and Gong-Qing Wu. "A semi-random multiple decision-tree algorithm for mining data streams." Journal of Computer Science and Technology 22, no. 5 (2007): 711-724. https://doi.org/10.1007/s11390-007-9084-9.

      [10]Tsai, Cheng-Jung, Chien-I. Lee, and Wei-Pang Yang. "An efficient and sensitive decision tree approach to mining concept-drifting data streams." Informatica 19, no. 1 (2008): 135-156.

      [11]Hashemi, Sattar, Ying Yang, Zahra Mirzamomen, and Mohammadreza Kangavari. "Adapted one-versus-all decision trees for data stream classification." IEEE Transactions on Knowledge and Data Engineering 21, no. 5 (2009): 624-637. https://doi.org/10.1109/TKDE.2008.181.

      [12]Mirzamomen, Zahra, and Mohammad Reza Kangavari. "Evolving Fuzzy Min–Max Neural Network Based Decision Trees for Data Stream Classification." Neural Processing Letters 45, no. 1 (2017): 341-363. https://doi.org/10.1007/s11063-016-9528-8.

      [13]Zhang, Peng, Chuan Zhou, Peng Wang, Byron J. Gao, Xingquan Zhu, and Li Guo. "E-tree: An efficient indexing structure for ensemble models on data streams." IEEE Transactions on Knowledge and Data Engineering 27, no. 2 (2015): 461-474. https://doi.org/10.1109/TKDE.2014.2298018.

      [14]Gama, Joao, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. "Learning with drift detection." In Brazilian Symposium on Artificial Intelligence, pp. 286-295. Springer, Berlin, Heidelberg, 2004.

      [15]Tatsuya Minegishi, Masayuki Ise, Ayahiko Niimi, Osamu Konishi, “Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data”, Fifth International Workshop on Computational Intelligence & Applications IEEE, pg 208-212, 2009.

      [16]Mohammad M Masud, Tahseen M, Al-khateeb, Latifur Khan, Charu Aggrawal, Jing Gao, Jiawei Han and Bhawani Thuraisinghum,“Detecting Recurring and Novel classes in Concept Drift Data Streams”, IEEE 11th International Conference On Data Mining, pp. 1176- 1181, 2011.

      [17]Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal “Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams”, TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE, pp. 1-14, 2011.

      [18]Amit Biswas, Dewan Md. Farid and Chowdhary Mofizur Rahman, “A New Decision Tree Learning Approach For Novel Class Detection In Concept-Drifting Data Stream Classification”, Journal of computer science and engineering, volume 14, issue 1, July 2012.

      [19]Huan Liu, Hiroshi Motoda, Rudy Setiono, Zheng Zhao,“Feature Selection: An Ever Evolving Frontier in Data Mining”, Fourth Workshop on Feature Selection in Data Mining, pp. 1-10, 2010.

      [20]Divya, G., and M. R. D. BrightAnand. "An Effective Classification and Novel Class Detection of Data Streams." International Journal Of Engineering And Computer Science 3.4 (2014): 5314-5318.

      [21]O’ Callaghan, N Mishra, A. Meyerson, and S. Guha, “Streaming Data Mining for High-Quality Clustering”, International Conference on Data Engineering, pp. 685, 2002. https://doi.org/10.1109/ICDE.2002.994785.

      [22]C. C. Aggarwal, J Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams”, International Conference on Very Large Database, pp. 81-92, 2003.

      [23]C. C. Aggarwal, J Han, J. Wang, and P. S. Yu, “A Framework for Projected Clustering on high dimensional data streams”, International Conference on Very Large Database, pp. 81-92, 2004.

      [24]F. Cao, M. Ester, W. Qian, and A. Zhou “Density-based clustering over an evolving data streams with noise”, SIAM International Conference on Data Mining, vol. 6, 2006.

      [25]M. Kamber and J. had, “Data Mining: Concepts and Techniques”, Second Edition, Elsevier, 2001.

      [26]E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani, “An online algorithm for segmenting time series”, IEEE international conference on data mining, 2001.

      [27]Yusuf, B. Reshma, and P. Chenna Reddy. "Mining data streams using option trees." International Journal of Computer Network and Information Security4.8 (2012): 49.

      [28]Nan Jiang and Le Gruenwald, “Research Issues in Data Stream Association Rule Mining”, SIGMOD Record, Vol. 35, No. 1, 2006. https://doi.org/10.1145/1121995.1121998.

      [29]Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, Elsevier, 2011.

      [30]Domingos, Pedro, and Geoff Hulten. “Mining high-speed data streams”. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 71-80. ACM, 2000. https://doi.org/10.1145/347090.347107.

      [31]Florent Masseglia, Pascal Poncelet, Maguelonne Teisseire, “Successes and New Directions in Data Mining”, Kluwer Academic Publishers Hingham, MA, USA, Volume 12 Issue 4, Pages 504 - 508, 2009.

      [32]Aggarwal, Charu C., et al. "A framework for clustering evolving data streams." Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003.

      [33]Abdulsalam, Hanady, David B. Skillicorn, and Patrick Martin., “Streaming random forests”. Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th International. IEEE, 2007.

      [34]Qadeer, Mohammed A., Nadeem Akhtar, and Faraz Khan., “Comparison of Tools for Data Mining and Retrieval in High Volume Data Stream”. Knowledge Discovery and Data Mining, 2009. WKDD 2009. Second International Workshop on. IEEE, 2009.

      [35]Wang, Aiping, et al. "An incremental extremely random forest classifier for online learning and tracking." Image Processing (ICIP), 2009 16th IEEE International Conference on. IEEE, 2009.

      [36]Li, Peipei, Xuegang Hu, and Xindong Wu. "Mining concept-drifting data streams with multiple semi-random decision trees." Advanced data mining and applications (2008): 733-740.

      [37]Gama, Joao, Ricardo Rocha, and Pedro Medas. "Accurate decision trees for mining high-speed data streams." In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 523-528. ACM, 2003. https://doi.org/10.1145/956750.956813.

      [38]https://www.cs.waikato.ac.nz/ml/weka

      [39]https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

      [40]https://moa.cms.waikato.ac.nz/


 

View

Download

Article ID: 10819
 
DOI: 10.14419/ijet.v7i2.18.10819




Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.