A novel k-nearest neighbor distance based under sampling for improved opinion mining on skewed data using random forest

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    In recent years, consumers are performing a pilot investigation using online resources before making any decision of purchase. One of the most popular social blogging online medium is twitter. The opinions collected from twitter at any point of frame in real world scenario are tending towards class imbalance in nature. The existing algorithms for opinion mining can work better on class balance nature, where opinions (positive and negative) are almost balance. In this paper, we propose a novel approach known as Improved Opinion Mining using Under Sampling (IOMUS) to efficiently summarize the reviews of class imbalance opinion mining corpus. The experimental set up is performed on the collection of opinion mining class imbalance dataset consisting of “1155” instances. The experimental results suggest that improved performance is obtained by the proposed IOMUS algorithm than the traditional approach.

  • Keywords

    Classification; Opinion Mining; Imbalanced Data; Under Sampling; IOMUS.

  • References

      [1] Shuo Wang, Member, and Xin Yao, “Multiclass Imbalance Problems: Analysis and Potential Solutions”, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics, Vol. 42, No. 4, August 2012https://doi.org/10.1109/TSMCB.2012.2187280.

      [2] Nitesh V. Chawla, Nathalie Japkowicz, AleksanderKolcz “Special Issue on Learning from Imbalanced Data Sets” Volume 6, Issue 1 - Page 1-6.

      [3] MikelGalar,Fransico, “A review on Ensembles for the class Imbalance Problem: Bagging,Boosting and Hybrid Based Approaches” IEEE Transactions On Systems, Man, And Cybernetics—Part C: Application And Reviews, Vol.42,No.4 July 2012

      [4] Kotsiantis, S., D. Kanellopoulos, and P. Pintelas, Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering, 2006.Vol 30(No 1): p. 25-36.

      [5] Yang, Z., et al., Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 2009. 39(6): p. 597-610. https://doi.org/10.1109/TSMCC.2009.2021989.

      [6] Zhu, Z.-B. and Z.-H. Song, Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chemical Engineering Research and Design, 2010. 88(8): p. 936- 951. https://doi.org/10.1016/j.cherd.2010.01.005.

      [7] Tavallaee, M., N. Stakhanova, and A.A. Ghorbani, toward credible evaluation of anomaly based intrusion-detection methods. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 2010. 40(5): p. 516-524. Aida Ali et al. 196

      [8] Mazurowski, M.A., et al., Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural networks: the official journal of the International Neural Network Society, 2008. 21(2-3): p. 427-436.

      [9] Soler, V., et al. Imbalanced Datasets Classification by Fuzzy Rule Extraction and Genetic Algorithms. in Data Mining Workshops, 2006. ICDM Workshops 2006.Sixth IEEE International Conference on. 2006.

      [10] Kubat, M. and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection in ICML. 1997.

      [11] Yi-Hung, L. and C. Yen-Ting. Total margin based adaptive fuzzy support vector machines for multi view face recognition. in Systems, Man and Cybernetics, 2005 IEEE International Conference on. 2005.

      [12] Li, Y., G. Sun, and Y. Zhu. Data imbalance problem in text classification in Information Processing (ISIP), 2010 Third International Symposium on. 2010. IEEE.

      [13] Al-Shahib, A., R. Breitling, and D. Gilbert, Feature selection and the class imbalance problem in predicting protein function from sequence. Applied Bioinformatics, 2005. 4(3): p. 195-203. https://doi.org/10.2165/00822942-200504030-00004.

      [14] Japkowicz, N. in Proc AAAI 2000 Workshop on Learning from Imbalanced Data Sets. 2000. AAAI Tech Report WS-00-05.

      [15] Chawla, N.V., N. Japkowicz, and A. Kotcz.inProc ICML 2003 Workshop on Learning from Imbalanced Data Sets. 2003.

      [16] LincyMeera Mathews, HariSeetha,” On Improving the Classification of Imbalanced Data”, CYBERNETICS AND INFORMATION TECHNOLOGIES, Volume 17, No 1,Sofia, 2017, BULGARIAN ACADEMY OF SCIENCES.

      [17] Sadam Al-Azani, El-Sayed M. El-Alfy,” Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis in Short Arabic Text “, Procedia Computer Science 109C (2017) 359–366, the 8th International Conference on Ambient Systems, Networks and Technologies, ANT 2017.

      [18] Farrukh Ahmed, Michele Samorani, Colin Bellinger, Osmar R. Za¨ıane,” Advantage of Integration in Big Data: Feature Generation in Multi-Relational Databases for Imbalanced Learning",

      [19] Jerzy Stefanowski,” Dealing with Data Difficulty Factors while Learning from Imbalanced Data”, S. Matwin and J. Mielniczuk (eds.), Challengesin Computational Statistics and Data Mining, Springer Studies in Computational Intelligence vol. 605, 2016, pp. 333-363.https://doi.org/10.1007/978-3-319-18781-5_17.

      [20] TengNiu, Shiai Zhu, Lei Pang, and Abdulmotaleb El Saddik,” Sentiment Analysis on Multi-View Social Data”, Q. Tian et al. (Eds.): MMM 2016, Part II, LNCS 9517, pp. 15–27, 2016, https://doi.org/10.1007/978-3-319-27674-8.

      [21] VasileiosAthanasiou and ManolisMaragoudakis,” A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek, Algorithms 2017, 10, 34; https://doi.org/10.3390/a10010034.

      [22] Michael Crawford, Taghi M. Khoshgoftaar, Joseph D. Prusa, Aaron N. Richter and Hamzah Al Najada,” Survey of review spam detection using machine learning techniques”, Crawford et al. Journal of Big Data (2015) 2:23, https://doi.org/10.1186/s40537-015-0029-9.

      [23] V. Gopalakrishnan and C. Ramaswamy, Sentiment Learning from Imbalanced Dataset: An Ensemble Based Method, International Journal of Artificial Intelligence, vol. 12, no. 2, pp. 75-87, 2014,CESER Publications

      [24] BartoszKrawczyk,” Learning from imbalanced data: open challenges and future directions”, ProgArtifIntell, DOI 10.1007/s13748-016-0094-0.

      [25] Julien Ah-Pine and EdmundoPavel Soriano Morales,” A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis”, In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Riva del Garda, Italy, 2016.

      [26] Troy Raeder, George Forman, and Nitesh V. Chawla,” Learning from Imbalanced Data: Evaluation Matters”, D.E. Holmes, L.C. Jain (Eds.): Data Mining: Found. &Intell. Paradigms, ISRL 23, pp. 315–331, 2012.

      [27] Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.https://doi.org/10.1023/A:1010933404324.

      [28] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools and techniques.2nd edition Morgan Kaufmann, San Francisco.

      [29] J. R Quinlan, (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos.

      [30] J. Quinlan. Induction of decision trees, Machine Learning, vol. 1, pp. 81C106, 1986.




Article ID: 9970
DOI: 10.14419/ijet.v7i1.8.9970

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.