A Study on the Performance of Feature Extraction Methods  According to the Size of N-Gram

Young Man Kwon; Min Gu Son; Dong Keun Chung; Myung jae Lim

doi:10.14419/ijet.v7i3.33.18516

Authors

Young Man Kwon
Min Gu Son
Dong Keun Chung
Myung jae Lim

Received date: August 28, 2018

Accepted date: August 28, 2018

Published date: August 29, 2018

DOI:

https://doi.org/10.14419/ijet.v7i3.33.18516

Keywords:

malware detection, machine learning, classifier, N-gram, Opcode, API

Abstract

In this paper, we studied the performance of feature extraction methods according to the size of N-gram for malware detection. The feature is extracted by three methods, using Opcode Only, both Opcode and API and API Only from PE file. We measure the performance of them indirectly with measuring the AUC score and accuracy of classifier. We did experiments with the different N size by using several classifiers such as DT, SVM, KNN and BNB classifiers. As a result, we got the conclusion as followings. If we use N-gram technique, we recommend Opcode Only method through our experiments. Also, the instance-based classifier KNN and DT among the model based classifier have good performance than SVM and BNB.
Â
Â

References

[1] The Independent IT-Security Institute, https://www.av-test.org/en/statistics/malware/
[2] Ashwini Mujumdar, Gayatri Masiwal, Dr.B. B. Meshram, â€œAnalysis of Signature-Based and Behavior-Based Anti-Malware Approaches,â€ IJARCET, Volume 2, Issue 6, June 2013
[3] James Scott, â€œSignature Based Malware Detection is Dead,â€ Institute for Critical Infrastructure Technology, February 2017
[4] Kateryna Chumachenko, â€œMachine Learning Methods for Malware Detection and Classification,â€ kaakkois-Suomen ammattikorkeakoulu, March 2017
[5] Edward Raff, Jared Sylvester, Charles Nicholas, â€œLearning the PE Header, Malware Detection with Minimal Domain Knowledge,â€ Proceeding of the 10th ACM Workshop on Artificial Intelligence and Security, November 2017, pp: 121-132
[6] Yibin Liao, â€œPE-Header-Based Malware Study and Detection,â€ University of Georgia, 2012
[7] Igor Santos, Felix Brezo, Javier Nieves, Yoseba K. Penya, Borja Sanz, Carlos Laorden and Pablo G. Bringas, â€œIdea: Opcode-sequence-based Malware Detection,â€ Engineering Secure Software and Systems, vol 5965, Springer, Berlin, Heidelberg, 2010
[8] Veeramani R, Nihin Rai, â€œWindows API based Malware Detection and Framework Analysis,â€ International Journal of Scientific & Engineering Research, Volume 3, Issue 3, March, 2012
[9] Tae-Hyun Ahn, Sang-Jin Oh, Young-Man Kwon, â€œMalware Detection Method using Opcode and windows API Calls,â€ The Journal of The Institute of Internet, Broadcasting and Communication (IIBC), Vo1.17, No.6, December 2017, pp: 11-17
[10] Willeam B.Carnar, John M.Trenkle, â€œN-Gram-Based Text Categorization,â€ In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp: 161-175
[11] Payal B.Awachate, Prof.Vivek P. Kshirsagar, â€œImproved Twitter Sentiment Analysis Using N Gram Feature Selection and Combinations,â€ IJRCCE, Vol.5, Issue 9, September 2016
[12] Mikhail Zolotukhin, Timo Hamalainen, â€œDetection of Zero-day Malware Based on the Analysis of Op-code Sequences,â€ The 11TH Annual IEEE CCNC â€“ Security, Privacy and Content Protection, 2014
[13] Scikit Learn, http://scikit-learn.org/stable/modules/tree.html
[14] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, Oâ€™REILLY, 2017, pp: 167-179
[15] Chih-Wei Hsu, Chin-Chung Chang, Chih-Jen Lin, â€œA Practical Guide to Support Vector Classification,â€ 2016
[16] Scikit Learn, http://scikit-learn.org/stable/modules/naive_ bayes .html
[17] virusshare, https://virusshare.com
[18] joxeankoret, https://malwareurls.joxeankoret.com
[19] malc0de, https://malc0de.com
[20] malwareblacklist, https://www.malwareblacklist.com
[21] Andrew McCallum, Kamal Nigam, â€œA comparison of Event Models for NaÃ¯ve Bayes Text Classification,â€ AAAI Workshop, 1998, pp: 41-48
[22] Vaishali Ganganwar, â€œAn overview of classification algorithms for imbalanced datasets,â€ International Journal of Emerging Technology and Advanced Engineering, Vol 2, Issue 4, April 2012

A Study on the Performance of Feature Extraction Methods According to the Size of N-Gram

Authors

Young Man Kwon

Min Gu Son

Dong Keun Chung

Myung jae Lim

How to Cite

DOI:

Keywords:

Abstract

References

Downloads

How to Cite