Designing A Multi-Agent System for Malicious PDF Detection Using Machine Learning and Generative Artificial Intelligence
-
https://doi.org/10.14419/s6dere11
Received date: September 26, 2025
Accepted date: October 11, 2025
Published date: November 10, 2025
-
Malicious PDF Detection; Artificial Intelligence; CTGAN; Machine Learning; Multi-Agent System; Cybersecurity; Explainable AI. -
Abstract
The growing sophistication of cyber threats exploiting PDF files represents a significant challenge for modern cybersecurity systems. To address this issue, this paper introduces a modular Multi Agent System (MAS) designed for the detection of malicious PDF files. Within this architecture, one agent employs a Conditional Tabular GAN (CTGAN) to expand the training dataset and reduce class imbalance, while another agent integrates supervised machine learning models for classification. Six supervised learning models, namely Decision Tree, Random Forest, XGBoost, Support Vector Machine, Naïve Bayes, and Neural Network, are evaluated on the enriched dataset. Among them, XGBoost achieves the best performance.
The MAS coordinates autonomous agents dedicated to dataset management, data augmentation, learning, decision making, and user interaction, ensuring flexibility and scalability under both standard and adversarial conditions. To support interpretability, SHAP analysis is applied during the evaluation phase to identify the features that most strongly influence model decisions. Taken together, the proposed system demonstrates a comprehensive, explainable, and adaptable framework that contributes to strengthening PDF malware detection in sensitive digital infrastructures.
-
References
- Brown, L. V. (2008). Computer security: Principles and practice. Pearson Prentice Hall.
- He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
- Kolter, J. Z., & Maloof, M. A. (2006). Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7(1), 2721-2744. https://doi.org/10.1145/1014052.1014105
- Sarker, I.H., Furhad, M.H. & Nowrozy, R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN COMPUT. SCI. 2, 173 (2021). https://doi.org/10.1007/s42979-021-00557-0
- Ofusori, L., Bokaba, T., & Mhlongo, S. (2024). Artificial intelligence in cybersecurity: a comprehensive review and future direction. Applied Artifi-cial Intelligence, 38(1), 2439609. https://doi.org/10.1080/08839514.2024.2439609.
- Abu Al-Haija, Q., Odeh, A., & Qattous, H. (2022). PDF malware detection based on optimizable decision trees. Electronics, 11(19), 3142. https://doi.org/10.3390/electronics11193142.
- Dehghantanha, A., Yazdinejad, A., & Parizi, R. M. (2023, November). Autonomous cybersecurity: Evolving challenges, emerging opportunities, and future research trajectories. In Proceedings of the Workshop on Autonomous Cybersecurity (pp. 1-10). https://doi.org/10.1145/3689933.3690832
- Hariharan, B., Siva, R., Sadagopan, S., Mishra, V., & Raghav, Y. (2023, July). Malware detection using XGBoost-based machine learning models: a review. In 2023 2nd International Conference on Edge Computing and Applications (ICECAA) (pp. 964-970). IEEE. https://doi.org/10.1109/ICECAA58104.2023.10212327.
- Torres, J., & Santos, S. D. L. (2018). Malicious PDF document detection using machine learning techniques. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP) (pp. 337-344). https://doi.org/10.5220/0006609503370344.
- Wooldridge, M. (2009). An introduction to multi-agent systems. John Wiley & Sons.
- Alabsi, B. A., Anbar, M., & Rihan, S. D. A. (2023). Conditional tabular generative adversarial based intrusion detection system for detecting ddos and dos attacks on the internet of things networks. Sensors, 23(12), 5644. https://doi.org/10.3390/s23125644.
- Singh, P., Tapaswi, S., & Gupta, S. (2020). Malware detection in pdf and office documents: A survey. Information Security Journal: A Global Per-spective, 29(3), 134-153. https://doi.org/10.1080/19393555.2020.1723747.
- Admass, W. S., Munaye, Y. Y., & Diro, A. A. (2024). Cyber security: State of the art, challenges and future directions. Cyber Security and Appli-cations, 2, 100031. https://doi.org/10.1016/j.csa.2023.100031.
- Laskov, P., & Šrndić, N. (2011, December). Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the 27th Annual Computer Security Applications Conference (pp. 373-382). https://doi.org/10.1145/2076732.2076785.
- Maiorca, D., & Giacinto, G. (2015). Clustering-based PDF malware detection through dynamic analysis. Computer Fraud & Security, 2015(5), 8–16.
- Bayer, U., Moser, A., Kruegel, C., & Kirda, E. (2006). Dynamic analysis of malicious code. Journal in Computer Virology, 2(1), 67–77. https://doi.org/10.1007/s11416-006-0012-2
- Khadim, U., Iqbal, M. M., & Azam, M. A. (2022). A secure digital text watermarking algorithm for Portable Document Format (PDF). Mehran University Research Journal of Engineering & Technology, 41(1), 100–110. https://doi.org/10.22581/muet1982.2201.10.
- Premarathne, U., Abuadbba, A., Alabdulatif, A., Khalil, I., Tari, Z., Zomaya, A., & Buyya, R. (2016). Hybrid cryptographic access control for cloud-based EHR systems. IEEE Cloud Computing, 3(4), 58-64. https://doi.org/10.1109/MCC.2016.76
- Pascanu, R., Stokes, J. W., Sanossian, H., Marinescu, M., & Thomas, A. (2015, April). Malware classification with recurrent networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1916-1920). IEEE. https://doi.org/10.1109/ICASSP.2015.7178304.
- Damodaran, A., Troia, F. D., Visaggio, C. A., Austin, T. H., & Stamp, M. (2017). A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques, 13, 1–12. https://doi.org/10.1007/s11416-015-0261-z.
- Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2012). A survey on automated dynamic malware analysis techniques and tools. ACM Computing Surveys, 44(2), 1–42. https://doi.org/10.1145/2089125.2089126
- Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003, May). Automatic document metadata extraction using support vector machines. In 2003 Joint Conference on Digital Libraries, 2003. Proceedings. (pp. 37-48). IEEE. https://doi.org/10.1109/JCDL.2003.1204842.
- Smutz, C., & Stavrou, A. (2012, December). Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference (pp. 239-248). https://doi.org/10.1145/2420950.2420987
- Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Securi-ty, 19(4), 639–668. https://doi.org/10.3233/JCS-2010-0410.
- Dabral, S., Agarwal, A., Mahajan, M., & Kumar, S. (2017). Malicious PDF files detection using structural and JavaScript based features. In S. Kaushik, D. Gupta, L. Kharb, & D. Chahal (Eds.), Information, communication and computing technology (pp. 149-159). Spring-er. https://doi.org/10.1007/978-981-10-6544-6_14
- Raff, E., Sylvester, J., & Nicholas, C. (2017). Learning the PE header: malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec ’17), 121–132. https://doi.org/10.1145/3128572.3140442
- Wang, Y., Cai, W. D., & Wei, P. C. (2016). A deep learning approach for detecting malicious JavaScript code. Security and Communication Net-works, 9(11), 1520-1534. https://doi.org/10.1002/sec.1441.
- Tzermias, Z., Sykiotakis, G., Polychronakis, M., & Markatos, E. P. (2011, April). Combining static and dynamic analysis for the detection of mali-cious documents. In Proceedings of the Fourth European Workshop on System Security (pp. 1-6). https://doi.org/10.1145/1972551.1972555
- Jiang, T., Liu, Y., Wu, X., Xu, M., & Cui, X. (2023). Application of deep reinforcement learning in attacking and protecting structural features-based malicious PDF detector. Future Generation Computer Systems, 141, 325-338. https://doi.org/10.1016/j.future.2022.11.015.
- Srndic, N., & Laskov, P. (2014). Practical evasion of a learning-based classifier: A case study. In 2014 IEEE Symposium on Security and Priva-cy (pp. 197-211). IEEE. https://doi.org/10.1109/SP.2014.20
- Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. Advances in Neural Information Processing Systems, 32.
- Rahman, S., Pal, S., Mittal, S., Chawla, T., & Karmakar, C. (2024). SYN-GAN: A robust intrusion detection system using GAN-based synthetic data for IoT security. Internet of Things, 26, 101-212. https://doi.org/10.1016/j.iot.2024.101212.
- Natsos, D., & Symeonidis, A. L. (2025). Transformer-based malware detection using process resource-utilization metrics. Results in Engineering, 25, 104-250. https://doi.org/10.1016/j.rineng.2025.104250
- Ni, M., Li, T., Li, Q., Zhang, H., & Ye, Y. (2016). FindMal: A file-to-file social network based malware detection framework. Knowledge-Based Systems, 112, 142-151. https://doi.org/10.1016/j.knosys.2016.09.004.
- Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. https://doi.org/10.1145/2347736.2347755.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655
- Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University Press. https://doi.org/10.1093/oso/9780198538493.001.0001.
- Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Prentice Hall.
- Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges. Information Fusion, 58, 82–115. https://doi.org/10.1016/j.inffus.2019.12.012.
- Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. https://doi.org/10.1007/BF00116251
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018.
- Chen, T., & Guestrin, C. (2016, August). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining (pp. 785-794). https://doi.org/10.1145/2939672.2939785.
- Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470.
- Wickramasinghe, I., & Kalutarage, H. (2021). Naive Bayes: Applications, variations and vulnerabilities: A review of literature with code snippets for implementation. Soft Computing, 25(3), 2277-2293. https://doi.org/10.1007/s00500-020-05297-6.
- Ramadhan, B., Purwanto, Y., & Ruriawan, M. F. (2020, October). Forensic malware identification using Naive Bayes method. In 2020 Internation-al Conference on Information Technology Systems and Innovation (ICITSI) (pp. 1-7). IEEE. https://doi.org/10.1109/ICITSI50517.2020.9264959.
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
- Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(1), 281-305.
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter opti-mization. Journal of Machine Learning Research, 18(185), 1-52.
- Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (Vol. 2, pp. 1137-1143).
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Spring-er. https://doi.org/10.1007/978-1-4614-7138-7.
- Chen, J., Song, Y., Wainwright, M. J., & Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning (pp. 883-892). PMLR.
- Biggio, B., & Roli, F. (2018, October). Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (pp. 2154-2156). https://doi.org/10.1145/3243734.3264418.
- Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
- Falah, A., Pokhrel, S. R., Pan, L., & de Souza-Daw, A. (2022). Towards enhanced PDF maldocs detection with feature engineering: Design chal-lenges. Multimedia Tools and Applications, 81, 41103–41130. https://doi.org/10.1007/s11042-022-11960-x
- Das, S., Saha, S., Priyoti, A. T., Roy, E. K., Sheldon, F. T., Haque, A., & Shiva, S. (2021). Network intrusion detection and comparative analysis using ensemble machine learning and feature selection. IEEE transactions on network and service management, 19(4), 4821-4833. https://doi.org/10.1109/TNSM.2021.3138457.
-
Downloads
-
How to Cite
Diabagate , D. A. ., Yazid , D. H. Y. ., Azmani , P. A. ., & Coulibaly, P. A. . (2025). Designing A Multi-Agent System for Malicious PDF Detection Using Machine Learning and Generative Artificial Intelligence. International Journal of Basic and Applied Sciences, 14(7), 282-295. https://doi.org/10.14419/s6dere11
