Critical evaluation of classifiers in data stream mining

  • Authors

    • Lalit Agrawal Shri Ramdeobaba College of Engineering and Management, Nagpur
    • Dattatraya Adane Shri Ramdeobaba College of Engineering and Management, Nagpur
    2018-09-16
    https://doi.org/10.14419/ijet.v7i2.18.10819
  • Classification, Clustering, Data Stream, Random Forest, Stream Mining.
  • Over past decade there has been a significant increase in the volume of online data. Extracting meaningful knowledge from this high volume data is considered as important aspect of research. It is very difficult to completely store full data, because of its perpetual nature. Therefore, analysis is needed while the “data is movingâ€. This moving data is known as data stream and analyzing it without storing it completely is termed as data stream mining. In recent years, many new techniques have been proposed to overcome the challenges of data stream mining. In this paper, we review the operation of popular streaming algorithms highlighting their strength and weaknesses. We also evaluate the classifiers used in these algorithms against two popular benchmark datasets namely (a) forest cover (forest) and (b) german credit available at UCI repository. Finally, we present our critical observation and draw conclusions on the basis of our analysis.

     

     

  • References

    1. [1] Babcock, Brian, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. "Models and issues in data stream systems." In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1-16. ACM, 2002. https://doi.org/10.1145/543613.543615.

      [2] Muthukrishnan S. "Data streams: algorithms and applications", in Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, 2003.

      [3] Chaudhry N., Show K., and Abdelgurefi M., “Stream data Managementâ€, Advances in a Database system. Vol. 30: Springer, 2005.

      [4] Han J. and Kamber M. Data Mining: Concepts and Techniques, Second ed. The Morgan Kaufmann Series in Data Management Systems: Elsevier, 2006.

      [5] Ferreira Cordeiro, Robson Leonardo, et al. "Clustering very large multi-dimensional datasets with mapreduce." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.

      [6] Lior Cohen, Gil Avrahami, Mark Last, Abraham Kandel, “Info-fuzzy algorithms for mining dynamic data streamsâ€, Applied Soft Computing, vol.8.4, pp 1283–1294, 2008. https://doi.org/10.1016/j.asoc.2007.11.003.

      [7] Jurgen Beringer, Eyke Hullermeier, “Online clustering of parallel data streamsâ€, Data & Knowledge Engineering, vol. 58, pp 180–204, 2006. https://doi.org/10.1016/j.datak.2005.05.009.

      [8] Charu C. Aggarwal, Philip S. Yu, “On Clustering massive text and categorical data streamsâ€, Knowledge Information System, vol. 24, pp 171–196, 2010. https://doi.org/10.1007/s10115-009-0241-z.

      [9] Hu, Xue-Gang, Pei-Pei Li, Xin-Dong Wu, and Gong-Qing Wu. "A semi-random multiple decision-tree algorithm for mining data streams." Journal of Computer Science and Technology 22, no. 5 (2007): 711-724. https://doi.org/10.1007/s11390-007-9084-9.

      [10]Tsai, Cheng-Jung, Chien-I. Lee, and Wei-Pang Yang. "An efficient and sensitive decision tree approach to mining concept-drifting data streams." Informatica 19, no. 1 (2008): 135-156.

      [11]Hashemi, Sattar, Ying Yang, Zahra Mirzamomen, and Mohammadreza Kangavari. "Adapted one-versus-all decision trees for data stream classification." IEEE Transactions on Knowledge and Data Engineering 21, no. 5 (2009): 624-637. https://doi.org/10.1109/TKDE.2008.181.

      [12]Mirzamomen, Zahra, and Mohammad Reza Kangavari. "Evolving Fuzzy Min–Max Neural Network Based Decision Trees for Data Stream Classification." Neural Processing Letters 45, no. 1 (2017): 341-363. https://doi.org/10.1007/s11063-016-9528-8.

      [13]Zhang, Peng, Chuan Zhou, Peng Wang, Byron J. Gao, Xingquan Zhu, and Li Guo. "E-tree: An efficient indexing structure for ensemble models on data streams." IEEE Transactions on Knowledge and Data Engineering 27, no. 2 (2015): 461-474. https://doi.org/10.1109/TKDE.2014.2298018.

      [14]Gama, Joao, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. "Learning with drift detection." In Brazilian Symposium on Artificial Intelligence, pp. 286-295. Springer, Berlin, Heidelberg, 2004.

      [15]Tatsuya Minegishi, Masayuki Ise, Ayahiko Niimi, Osamu Konishi, “Extension of Decision Tree Algorithm for Stream Data Mining Using Real Dataâ€, Fifth International Workshop on Computational Intelligence & Applications IEEE, pg 208-212, 2009.

      [16]Mohammad M Masud, Tahseen M, Al-khateeb, Latifur Khan, Charu Aggrawal, Jing Gao, Jiawei Han and Bhawani Thuraisinghum,“Detecting Recurring and Novel classes in Concept Drift Data Streamsâ€, IEEE 11th International Conference On Data Mining, pp. 1176- 1181, 2011.

      [17]Mohammad M. Masud, Qing Chen, Latifur Khan, Charu C. Aggarwal “Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streamsâ€, TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE, pp. 1-14, 2011.

      [18]Amit Biswas, Dewan Md. Farid and Chowdhary Mofizur Rahman, “A New Decision Tree Learning Approach For Novel Class Detection In Concept-Drifting Data Stream Classificationâ€, Journal of computer science and engineering, volume 14, issue 1, July 2012.

      [19]Huan Liu, Hiroshi Motoda, Rudy Setiono, Zheng Zhao,“Feature Selection: An Ever Evolving Frontier in Data Miningâ€, Fourth Workshop on Feature Selection in Data Mining, pp. 1-10, 2010.

      [20]Divya, G., and M. R. D. BrightAnand. "An Effective Classification and Novel Class Detection of Data Streams." International Journal Of Engineering And Computer Science 3.4 (2014): 5314-5318.

      [21]O’ Callaghan, N Mishra, A. Meyerson, and S. Guha, “Streaming Data Mining for High-Quality Clusteringâ€, International Conference on Data Engineering, pp. 685, 2002. https://doi.org/10.1109/ICDE.2002.994785.

      [22]C. C. Aggarwal, J Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streamsâ€, International Conference on Very Large Database, pp. 81-92, 2003.

      [23]C. C. Aggarwal, J Han, J. Wang, and P. S. Yu, “A Framework for Projected Clustering on high dimensional data streamsâ€, International Conference on Very Large Database, pp. 81-92, 2004.

      [24]F. Cao, M. Ester, W. Qian, and A. Zhou “Density-based clustering over an evolving data streams with noiseâ€, SIAM International Conference on Data Mining, vol. 6, 2006.

      [25]M. Kamber and J. had, “Data Mining: Concepts and Techniquesâ€, Second Edition, Elsevier, 2001.

      [26]E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani, “An online algorithm for segmenting time seriesâ€, IEEE international conference on data mining, 2001.

      [27]Yusuf, B. Reshma, and P. Chenna Reddy. "Mining data streams using option trees." International Journal of Computer Network and Information Security4.8 (2012): 49.

      [28]Nan Jiang and Le Gruenwald, “Research Issues in Data Stream Association Rule Miningâ€, SIGMOD Record, Vol. 35, No. 1, 2006. https://doi.org/10.1145/1121995.1121998.

      [29]Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniquesâ€, Elsevier, 2011.

      [30]Domingos, Pedro, and Geoff Hulten. “Mining high-speed data streamsâ€. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 71-80. ACM, 2000. https://doi.org/10.1145/347090.347107.

      [31]Florent Masseglia, Pascal Poncelet, Maguelonne Teisseire, “Successes and New Directions in Data Miningâ€, Kluwer Academic Publishers Hingham, MA, USA, Volume 12 Issue 4, Pages 504 - 508, 2009.

      [32]Aggarwal, Charu C., et al. "A framework for clustering evolving data streams." Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003.

      [33]Abdulsalam, Hanady, David B. Skillicorn, and Patrick Martin., “Streaming random forestsâ€. Database Engineering and Applications Symposium, 2007. IDEAS 2007. 11th International. IEEE, 2007.

      [34]Qadeer, Mohammed A., Nadeem Akhtar, and Faraz Khan., “Comparison of Tools for Data Mining and Retrieval in High Volume Data Streamâ€. Knowledge Discovery and Data Mining, 2009. WKDD 2009. Second International Workshop on. IEEE, 2009.

      [35]Wang, Aiping, et al. "An incremental extremely random forest classifier for online learning and tracking." Image Processing (ICIP), 2009 16th IEEE International Conference on. IEEE, 2009.

      [36]Li, Peipei, Xuegang Hu, and Xindong Wu. "Mining concept-drifting data streams with multiple semi-random decision trees." Advanced data mining and applications (2008): 733-740.

      [37]Gama, Joao, Ricardo Rocha, and Pedro Medas. "Accurate decision trees for mining high-speed data streams." In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 523-528. ACM, 2003. https://doi.org/10.1145/956750.956813.

      [38]https://www.cs.waikato.ac.nz/ml/weka

      [39]https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

      [40]https://moa.cms.waikato.ac.nz/

  • Downloads

  • How to Cite

    Agrawal, L., & Adane, D. (2018). Critical evaluation of classifiers in data stream mining. International Journal of Engineering & Technology, 7(4), 2166-2171. https://doi.org/10.14419/ijet.v7i2.18.10819