A Study on the Performance of Feature Extraction Methods According to the Size of N-Gram

  • Authors

    • Young Man Kwon
    • Min Gu Son
    • Dong Keun Chung
    • Myung jae Lim
    2018-08-29
    https://doi.org/10.14419/ijet.v7i3.33.18516
  • malware detection, machine learning, classifier, N-gram, Opcode, API
  • In this paper, we studied the performance of feature extraction methods according to the size of N-gram for malware detection. The feature is extracted by three methods, using Opcode Only, both Opcode and API and API Only from PE file. We measure the performance of them indirectly with measuring the AUC score and accuracy of classifier. We did experiments with the different N size by using several classifiers such as DT, SVM, KNN and BNB classifiers. As a result, we got the conclusion as followings. If we use N-gram technique, we recommend Opcode Only method through our experiments. Also, the instance-based classifier KNN and DT among the model based classifier have good performance than SVM and BNB.

     

     

  • References

    1. [1] The Independent IT-Security Institute, https://www.av-test.org/en/statistics/malware/

      [2] Ashwini Mujumdar, Gayatri Masiwal, Dr.B. B. Meshram, “Analysis of Signature-Based and Behavior-Based Anti-Malware Approaches,†IJARCET, Volume 2, Issue 6, June 2013

      [3] James Scott, “Signature Based Malware Detection is Dead,†Institute for Critical Infrastructure Technology, February 2017

      [4] Kateryna Chumachenko, “Machine Learning Methods for Malware Detection and Classification,†kaakkois-Suomen ammattikorkeakoulu, March 2017

      [5] Edward Raff, Jared Sylvester, Charles Nicholas, “Learning the PE Header, Malware Detection with Minimal Domain Knowledge,†Proceeding of the 10th ACM Workshop on Artificial Intelligence and Security, November 2017, pp: 121-132

      [6] Yibin Liao, “PE-Header-Based Malware Study and Detection,†University of Georgia, 2012

      [7] Igor Santos, Felix Brezo, Javier Nieves, Yoseba K. Penya, Borja Sanz, Carlos Laorden and Pablo G. Bringas, “Idea: Opcode-sequence-based Malware Detection,†Engineering Secure Software and Systems, vol 5965, Springer, Berlin, Heidelberg, 2010

      [8] Veeramani R, Nihin Rai, “Windows API based Malware Detection and Framework Analysis,†International Journal of Scientific & Engineering Research, Volume 3, Issue 3, March, 2012

      [9] Tae-Hyun Ahn, Sang-Jin Oh, Young-Man Kwon, “Malware Detection Method using Opcode and windows API Calls,†The Journal of The Institute of Internet, Broadcasting and Communication (IIBC), Vo1.17, No.6, December 2017, pp: 11-17

      [10] Willeam B.Carnar, John M.Trenkle, “N-Gram-Based Text Categorization,†In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp: 161-175

      [11] Payal B.Awachate, Prof.Vivek P. Kshirsagar, “Improved Twitter Sentiment Analysis Using N Gram Feature Selection and Combinations,†IJRCCE, Vol.5, Issue 9, September 2016

      [12] Mikhail Zolotukhin, Timo Hamalainen, “Detection of Zero-day Malware Based on the Analysis of Op-code Sequences,†The 11TH Annual IEEE CCNC – Security, Privacy and Content Protection, 2014

      [13] Scikit Learn, http://scikit-learn.org/stable/modules/tree.html

      [14] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O’REILLY, 2017, pp: 167-179

      [15] Chih-Wei Hsu, Chin-Chung Chang, Chih-Jen Lin, “A Practical Guide to Support Vector Classification,†2016

      [16] Scikit Learn, http://scikit-learn.org/stable/modules/naive_ bayes .html

      [17] virusshare, https://virusshare.com

      [18] joxeankoret, https://malwareurls.joxeankoret.com

      [19] malc0de, https://malc0de.com

      [20] malwareblacklist, https://www.malwareblacklist.com

      [21] Andrew McCallum, Kamal Nigam, “A comparison of Event Models for Naïve Bayes Text Classification,†AAAI Workshop, 1998, pp: 41-48

      [22] Vaishali Ganganwar, “An overview of classification algorithms for imbalanced datasets,†International Journal of Emerging Technology and Advanced Engineering, Vol 2, Issue 4, April 2012

  • Downloads

  • How to Cite

    Man Kwon, Y., Gu Son, M., Keun Chung, D., & jae Lim, M. (2018). A Study on the Performance of Feature Extraction Methods According to the Size of N-Gram. International Journal of Engineering & Technology, 7(3.33), 23-27. https://doi.org/10.14419/ijet.v7i3.33.18516