The Performance Comparison of the Classifiers According to Binary Bow, Count Bow and Tf-Idf Feature Vectors for Malware Detection


  • Young Man Kwon
  • So Hee Jun
  • Won Mo Gal
  • Myung Jae Lim





Malware Detection, Feature Selection, Machine Learning, BOW (Bag of words), TF-IDF


In this paper, we compared the performance of the classifiers according to feature vectors with Binary BOW, Count BOW and TF-IDF for malware detection. We used the feature of Opcode that extracted from PE file. For performance comparison, we measured the AUC score for the classifiers those are DT, KNN, MLP, MNB and SVM. As a result, we recommend neural network (MLP) and instance-based model (KNN) because they show the high AUC score and accuracy regardless of the unbalanced dataset and the feature vector. If you use classical classifiers, we recommend DT because it guarantees high AUC score and accuracy regardless of the same condition as the above. If you use SVM, you have to do Robust scaling to resolved outlier and unbalanced dataset. If you use MNB, you need to use N-gram technique to improve AUC score.




[1] Ashwini Mujumdar, Gayatri Masiwal, DR. B. B. Meshram, “Analysis of Signature-Based and Behavior-Based Anti-Malware Approaches,†International Journal of Advanced Research in Computer Engineering and Technology (IJARCET), Volume 2, Issue 6, June 2013

[2] J.Zico Kolter, Marcus A. Maloof, “Learning to Detect and Classify Malicious Executables in the Wild,†The Journal of Machine Learning Research, Volume 7, December 2006, pp: 2721-2744

[3] Daniel Gilbert, “Convolutional Neural Networks for Malware Classification,†October 2016

[4] Elizabeth D. Liddy, “Natural Language Processing,†In Encyclopedia of Library and Information Science, Volume 2, NY.Marcel Decker, 2001

[5] Trung Kien Tran, Hiroshi Sato, “NLP-based Approaches for Malware Classification from API Sequences,†Asia Pacific Symposium on Intelligent and Evolutionary Systems, 2017

[6] Python Library, scikit-learn, TfidfTransformer,

[7] Manohar Swamynathan, Mastering Machine Learning with Python in Six Steps, Apress, 2017, pp: 268-272

[8] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O’REILLY, 2017, pp: 167-179

[9] Sarah Guido, Andreas Muller, Introduction to Machine Learning with Python, O’REILLY, 2016, pp: 104-119

[10] Andrew McCallum, Kamal Nigam, “A comparison of Event Models for Naïve Bayes Text Classification,†AAAI Workshop, 1998, pp: 41-48

[11] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, “A Practical Guide to Support Vector Classification,†2016

[12] Willeam B.Carnar, John M.Trenkle, “N-Gram-Based Text Categorization,†In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994, pp: 161-175

[13] Mikhail Zolotukhin, Timo Hamalainen, “Detection of Zero-day Malware Based on the Analysis of Op-code Sequences,†The 11TH Annual IEEE CCNC – Security, Privacy and Content Protection, 2014

[14] virusshare,

[15] joxeankoret,

[16] malc0de,

[17] malwareblacklist,

[18] Sarah Guido, Andreas Muller, Introduction to Machine Learning with Python, O’REILLY, 2016, pp: 292-296

[19] Charles X. Ling, Jin Huang, Harry Zhang, “AUC: a Better Measure than Accuracy in Comparing Learning Algorithms,†Part of the Lecture Notes in Computer Science book series (LNCS), Volumne 2671, May 2003

[20] Box Plot: Display of Distribution,

[21] Introductino to Multi-Layer Perceptrons (Feedforward Neural Networks),

[22] Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, Yuval Elovici, “Detecting unknown malicious code by applying classification techniques on OpCode patterns,†Shabtai et al. Security Informatics, January 2012

View Full Article: