Designing A Multi-Agent System for Malicious PDF ‎Detection Using Machine Learning and Generative ‎Artificial Intelligence

  • Authors

    • Dr. Amadou Diabagate Faculty of Mathematics and Computer Science, University Felix Houphouët-Boigny, Abidjan, ‎Côte d’Ivoire
    • Dr. Hambally Yacouba Yazid University of Bondoukou, Bondoukou, Côte d’ivoire
    • Prof. Abdellah Azmani Faculty of Sciences and Technologies, University of Abdelmalek Essaadi, Tangier, Morocco
    • Prof. Adama Coulibaly Faculty of Mathematics and Computer Science, University Felix Houphouët-Boigny, Abidjan, ‎Côte d’Ivoire
    https://doi.org/10.14419/s6dere11

    Received date: September 26, 2025

    Accepted date: October 11, 2025

    Published date: November 10, 2025

  • Malicious PDF Detection; Artificial Intelligence; CTGAN; Machine Learning; Multi-Agent ‎System; Cybersecurity; Explainable AI.
  • Abstract

    The growing sophistication of cyber threats exploiting PDF files represents a significant ‎challenge for modern cybersecurity systems. To address this issue, this paper introduces a ‎modular Multi Agent System (MAS) designed for the detection of malicious PDF files. ‎Within this architecture, one agent employs a Conditional Tabular GAN (CTGAN) to expand ‎the training dataset and reduce class imbalance, while another agent integrates supervised ‎machine learning models for classification. Six supervised learning models, namely Decision ‎Tree, Random Forest, XGBoost, Support Vector Machine, Naïve Bayes, and Neural Network, ‎are evaluated on the enriched dataset. Among them, XGBoost achieves the best performance.

    The MAS coordinates autonomous agents dedicated to dataset management, data ‎augmentation, learning, decision making, and user interaction, ensuring flexibility and ‎scalability under both standard and adversarial conditions. To support interpretability, SHAP ‎analysis is applied during the evaluation phase to identify the features that most strongly ‎influence model decisions. Taken together, the proposed system demonstrates a ‎comprehensive, explainable, and adaptable framework that contributes to strengthening PDF ‎malware detection in sensitive digital infrastructures‎.

  • References

    1. Brown, L. V. (2008). Computer security: Principles and practice. Pearson Prentice Hall.
    2. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
    3. Kolter, J. Z., & Maloof, M. A. (2006). Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7(1), 2721-2744. https://doi.org/10.1145/1014052.1014105
    4. Sarker, I.H., Furhad, M.H. & Nowrozy, R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN COMPUT. SCI. 2, 173 (2021). https://doi.org/10.1007/s42979-021-00557-0
    5. Ofusori, L., Bokaba, T., & Mhlongo, S. (2024). Artificial intelligence in cybersecurity: a comprehensive review and future direction. Applied Artifi-cial Intelligence, 38(1), 2439609. https://doi.org/10.1080/08839514.2024.2439609.
    6. Abu Al-Haija, Q., Odeh, A., & Qattous, H. (2022). PDF malware detection based on optimizable decision trees. Electronics, 11(19), 3142. https://doi.org/10.3390/electronics11193142.
    7. Dehghantanha, A., Yazdinejad, A., & Parizi, R. M. (2023, November). Autonomous cybersecurity: Evolving challenges, emerging opportunities, and future research trajectories. In Proceedings of the Workshop on Autonomous Cybersecurity (pp. 1-10). https://doi.org/10.1145/3689933.3690832
    8. Hariharan, B., Siva, R., Sadagopan, S., Mishra, V., & Raghav, Y. (2023, July). Malware detection using XGBoost-based machine learning models: a review. In 2023 2nd International Conference on Edge Computing and Applications (ICECAA) (pp. 964-970). IEEE. https://doi.org/10.1109/ICECAA58104.2023.10212327.
    9. Torres, J., & Santos, S. D. L. (2018). Malicious PDF document detection using machine learning techniques. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP) (pp. 337-344). https://doi.org/10.5220/0006609503370344.
    10. Wooldridge, M. (2009). An introduction to multi-agent systems. John Wiley & Sons.
    11. Alabsi, B. A., Anbar, M., & Rihan, S. D. A. (2023). Conditional tabular generative adversarial based intrusion detection system for detecting ddos and dos attacks on the internet of things networks. Sensors, 23(12), 5644. https://doi.org/10.3390/s23125644.
    12. Singh, P., Tapaswi, S., & Gupta, S. (2020). Malware detection in pdf and office documents: A survey. Information Security Journal: A Global Per-spective, 29(3), 134-153. https://doi.org/10.1080/19393555.2020.1723747.
    13. Admass, W. S., Munaye, Y. Y., & Diro, A. A. (2024). Cyber security: State of the art, challenges and future directions. Cyber Security and Appli-cations, 2, 100031. https://doi.org/10.1016/j.csa.2023.100031.
    14. Laskov, P., & Šrndić, N. (2011, December). Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the 27th Annual Computer Security Applications Conference (pp. 373-382). https://doi.org/10.1145/2076732.2076785.
    15. Maiorca, D., & Giacinto, G. (2015). Clustering-based PDF malware detection through dynamic analysis. Computer Fraud & Security, 2015(5), 8–16.
    16. Bayer, U., Moser, A., Kruegel, C., & Kirda, E. (2006). Dynamic analysis of malicious code. Journal in Computer Virology, 2(1), 67–77. https://doi.org/10.1007/s11416-006-0012-2
    17. Khadim, U., Iqbal, M. M., & Azam, M. A. (2022). A secure digital text watermarking algorithm for Portable Document Format (PDF). Mehran University Research Journal of Engineering & Technology, 41(1), 100–110. https://doi.org/10.22581/muet1982.2201.10.
    18. Premarathne, U., Abuadbba, A., Alabdulatif, A., Khalil, I., Tari, Z., Zomaya, A., & Buyya, R. (2016). Hybrid cryptographic access control for cloud-based EHR systems. IEEE Cloud Computing, 3(4), 58-64. https://doi.org/10.1109/MCC.2016.76
    19. Pascanu, R., Stokes, J. W., Sanossian, H., Marinescu, M., & Thomas, A. (2015, April). Malware classification with recurrent networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1916-1920). IEEE. https://doi.org/10.1109/ICASSP.2015.7178304.
    20. Damodaran, A., Troia, F. D., Visaggio, C. A., Austin, T. H., & Stamp, M. (2017). A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques, 13, 1–12. https://doi.org/10.1007/s11416-015-0261-z.
    21. Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2012). A survey on automated dynamic malware analysis techniques and tools. ACM Computing Surveys, 44(2), 1–42. https://doi.org/10.1145/2089125.2089126
    22. Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003, May). Automatic document metadata extraction using support vector machines. In 2003 Joint Conference on Digital Libraries, 2003. Proceedings. (pp. 37-48). IEEE. https://doi.org/10.1109/JCDL.2003.1204842.
    23. Smutz, C., & Stavrou, A. (2012, December). Malicious PDF detection using metadata and structural features. In Proceedings of the 28th Annual Computer Security Applications Conference (pp. 239-248). https://doi.org/10.1145/2420950.2420987
    24. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Securi-ty, 19(4), 639–668. https://doi.org/10.3233/JCS-2010-0410.
    25. Dabral, S., Agarwal, A., Mahajan, M., & Kumar, S. (2017). Malicious PDF files detection using structural and JavaScript based features. In S. Kaushik, D. Gupta, L. Kharb, & D. Chahal (Eds.), Information, communication and computing technology (pp. 149-159). Spring-er. https://doi.org/10.1007/978-981-10-6544-6_14
    26. Raff, E., Sylvester, J., & Nicholas, C. (2017). Learning the PE header: malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec ’17), 121–132. https://doi.org/10.1145/3128572.3140442
    27. Wang, Y., Cai, W. D., & Wei, P. C. (2016). A deep learning approach for detecting malicious JavaScript code. Security and Communication Net-works, 9(11), 1520-1534. https://doi.org/10.1002/sec.1441.
    28. Tzermias, Z., Sykiotakis, G., Polychronakis, M., & Markatos, E. P. (2011, April). Combining static and dynamic analysis for the detection of mali-cious documents. In Proceedings of the Fourth European Workshop on System Security (pp. 1-6). https://doi.org/10.1145/1972551.1972555
    29. Jiang, T., Liu, Y., Wu, X., Xu, M., & Cui, X. (2023). Application of deep reinforcement learning in attacking and protecting structural features-based malicious PDF detector. Future Generation Computer Systems, 141, 325-338. https://doi.org/10.1016/j.future.2022.11.015.
    30. Srndic, N., & Laskov, P. (2014). Practical evasion of a learning-based classifier: A case study. In 2014 IEEE Symposium on Security and Priva-cy (pp. 197-211). IEEE. https://doi.org/10.1109/SP.2014.20
    31. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. Advances in Neural Information Processing Systems, 32.
    32. Rahman, S., Pal, S., Mittal, S., Chawla, T., & Karmakar, C. (2024). SYN-GAN: A robust intrusion detection system using GAN-based synthetic data for IoT security. Internet of Things, 26, 101-212. https://doi.org/10.1016/j.iot.2024.101212.
    33. Natsos, D., & Symeonidis, A. L. (2025). Transformer-based malware detection using process resource-utilization metrics. Results in Engineering, 25, 104-250. https://doi.org/10.1016/j.rineng.2025.104250
    34. Ni, M., Li, T., Li, Q., Zhang, H., & Ye, Y. (2016). FindMal: A file-to-file social network based malware detection framework. Knowledge-Based Systems, 112, 142-151. https://doi.org/10.1016/j.knosys.2016.09.004.
    35. Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87. https://doi.org/10.1145/2347736.2347755.
    36. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
    37. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/BF00058655
    38. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University Press. https://doi.org/10.1093/oso/9780198538493.001.0001.
    39. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Prentice Hall.
    40. Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., & Herrera, F. (2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges. Information Fusion, 58, 82–115. https://doi.org/10.1016/j.inffus.2019.12.012.
    41. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. https://doi.org/10.1007/BF00116251
    42. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018.
    43. Chen, T., & Guestrin, C. (2016, August). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining (pp. 785-794). https://doi.org/10.1145/2939672.2939785.
    44. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470.
    45. Wickramasinghe, I., & Kalutarage, H. (2021). Naive Bayes: Applications, variations and vulnerabilities: A review of literature with code snippets for implementation. Soft Computing, 25(3), 2277-2293. https://doi.org/10.1007/s00500-020-05297-6.
    46. Ramadhan, B., Purwanto, Y., & Ruriawan, M. F. (2020, October). Forensic malware identification using Naive Bayes method. In 2020 Internation-al Conference on Information Technology Systems and Innovation (ICITSI) (pp. 1-7). IEEE. https://doi.org/10.1109/ICITSI50517.2020.9264959.
    47. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
    48. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
    49. Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(1), 281-305.
    50. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2018). Hyperband: A novel bandit-based approach to hyperparameter opti-mization. Journal of Machine Learning Research, 18(185), 1-52.
    51. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (Vol. 2, pp. 1137-1143).
    52. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Spring-er. https://doi.org/10.1007/978-1-4614-7138-7.
    53. Chen, J., Song, Y., Wainwright, M. J., & Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning (pp. 883-892). PMLR.
    54. Biggio, B., & Roli, F. (2018, October). Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (pp. 2154-2156). https://doi.org/10.1145/3243734.3264418.
    55. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
    56. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
    57. Falah, A., Pokhrel, S. R., Pan, L., & de Souza-Daw, A. (2022). Towards enhanced PDF maldocs detection with feature engineering: Design chal-lenges. Multimedia Tools and Applications, 81, 41103–41130. https://doi.org/10.1007/s11042-022-11960-x
    58. Das, S., Saha, S., Priyoti, A. T., Roy, E. K., Sheldon, F. T., Haque, A., & Shiva, S. (2021). Network intrusion detection and comparative analysis using ensemble machine learning and feature selection. IEEE transactions on network and service management, 19(4), 4821-4833. https://doi.org/10.1109/TNSM.2021.3138457.
  • Downloads

  • How to Cite

    Diabagate , D. A. ., Yazid , D. H. Y. ., Azmani , P. A. ., & Coulibaly, P. A. . (2025). Designing A Multi-Agent System for Malicious PDF ‎Detection Using Machine Learning and Generative ‎Artificial Intelligence. International Journal of Basic and Applied Sciences, 14(7), 282-295. https://doi.org/10.14419/s6dere11