Taxonomic Classification of Bacteria Using Machine Learning ‎Models on DNA Sequences

  • Authors

    • Sudhir Anakal Department of Master of Computer Applications, Sharnbasva University, Kalaburagi. Karnataka, India
    • Sarange Shreepad Marotrao Department of Mechanical Engineering, Ajeenkya D Y Patil School of Engineering, Lohgaon, Pune, Maharashtra, India
    • Vedavathi G R Department of Computer Science & Engineering (AI & ML)، East Point College of Engineering and Technology, Bidarahalli, Bengaluru, ‎Karnataka, India
    • Gargi Mishra Department of Computer Science and Engineering, Bharati Vidyapeeth's College of Engineering, Paschim Vihar, New Delhi, Delhi, India
    • Anurag Vijay Agrawal Department of Electronics and Communication Engineering, Bhagwant Institute of Technology, Uttar Pradesh, India
    • Archana Bhaskar School of Computer Science and Applications, Reva University, Bengaluru, Karnataka, India
    • Surya D Department of Zoology, Madras Christian College, Tambaram, Chennai, Tamil Nadu, India
    • Anu Swedha Ananthan Department of Microbiology, Justice Basheer Ahmed Sayeed College for Women, Chennai, Tamil Nadu, India
    https://doi.org/10.14419/cse17v46

    Received date: June 10, 2025

    Accepted date: July 18, 2025

    Published date: July 25, 2025

  • Machine Learning; DNA Sequences; K-Mer Vector; Deep Learning
  • Abstract

    This study investigates different deep learning architectures, especially 1D and 2D convolutional neural networks (CNNs), for DNA se-‎sequence classification using k-mer vector representations. The results show that k-mer vectors, especially those with k = 5, can effectively ‎capture relevant features in DNA sequences and achieve high accuracy, precision, and recall across all taxonomic levels. Among the tested ‎models, 1D CNN outperformed 2D CNN in terms of accuracy and training efficiency. However, the 2D CNN achieved a slightly higher ‎accuracy without the nested layers, suggesting that critical information should be discarded due to the lack of input representation. As ‎expected, the model performance declined at lower taxonomic levels due to sequence feature limitations and class imbalance, but still achieved ‎‎88% accuracy at the genus level. Notably, simple multi-layer neural networks outperformed CNN, indicating the potential of low-‎low-complexity models for genomic data classification. These findings suggest that while CNNs are efficient, simpler architectures can provide ‎competitive performance in terms of information representation and task complexity‎.

  • References

    1. LaPierre, N., Ju, C. J. T., Zhou, G., & Wang, W. (2019). MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods, 166, 74–82. https://doi.org/10.1016/j.ymeth.2019.01.002.
    2. Phan, D., Ngoc, G. N., Lumbanraja, F. R., Faisal, M. R., Abipihi, B., Purnama, B., Delimiyanti, M. K., Kubo, M., & Satou, K. (2017). Combined use of k-mer numerical features and position-specific categorical features in fixed-length DNA sequence classification. Journal of Biomedical Sci-ence and Engineering, 10(7), 390–401. https://doi.org/10.4236/jbise.2017.108030.
    3. Huang, Y., Yang, L., & Wang, T. (2011). Phylogenetic analysis of DNA sequences based on the generalized pseudo-amino acid composition. Jour-nal of Theoretical Biology, 269(1), 217–223. https://doi.org/10.1016/j.jtbi.2010.11.007.
    4. Remita, M. A., Halioui, A., Diouara, A. A. M., Daigle, B., Kiani, G., & Diallo, A. B. (2017). A machine learning approach for viral genome classifi-cation. BMC Bioinformatics, 18, 208. https://doi.org/10.1186/s12859-017-1602-3.
    5. Tonkovic, P., Kalajdziski, S., Zdravevski, E., Lameski, P., Corizzo, R., Pires, I. M., Garcia, N. M., Loncar-Turukalo, T., & Trajkovik, V. (2020). Literature on applied machine learning in metagenomic classification: A scoping review. Biology, 9(11), 453. https://doi.org/10.3390/biology9120453.
    6. Rehman, M. U., Tayara, H., & Chong, K. T. (2022). DL-m6A: Identification of N6-methyladenosine sites in mammals using deep learning based on different encoding schemes. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20, 904–911. https://doi.org/10.1109/TCBB.2022.3192572.
    7. Pham, T. D. (2007). Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognition, 40(2), 516–529. https://doi.org/10.1016/j.patcog.2006.04.021.
    8. Hossain, P. S., Kim, K., Uddin, J., Samad, M. A., & Choi, K. (2023). Enhancing taxonomic categorization of DNA sequences with deep learning: A multi-label approach. Bioengineering, 10, 1293. https://doi.org/10.3390/bioengineering10111293.
    9. Soliman, N. (2022). An improved convolutional neural network model for DNA classification. Computers, Materials & Continua, 70(3). https://doi.org/10.32604/cmc.2022.018860.
    10. Liang, Q., Bible, P. W., Liu, Y., Zou, B., & Wei, L. (2020). DeepMicrobes: Taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1), Article lqaa009. https://doi.org/10.1093/nargab/lqaa009.
    11. Ghoneim, A., Muhammad, G., & Hossain, M. S. (2020). Cervical cancer classification using convolutional neural networks and extreme learning machines. Future Generation Computer Systems, 102, 643–649. https://doi.org/10.1016/j.future.2019.09.005.
    12. Abbas, Z., Ur Rehman, M., Tayara, H., Zou, Q., & Chong, K. T. (2023). XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites. Molecular Therapy, 31(6), 2543–2551. https://doi.org/10.1016/j.ymthe.2023.02.020.
    13. Abd-Alhalem, S. M. (2020). Bacterial classification with convolutional neural networks based on different data reduction layers. Nucleosides, Nu-cleotides & Nucleic Acids, 39(4), 493–503. https://doi.org/10.1080/15257770.2019.1645851.
    14. Malonzo, M. H., & Lähdesmäki, H. (2023). LuxHMM: DNA methylation analysis with genome segmentation via hidden Markov model. BMC Bio-informatics, 24, Article 58. https://doi.org/10.1186/s12859-023-05213-3.
    15. Gunasekaran, H., Ramalakshmi, K., Arokiaraj, A. R. M., Kanmani, S. D., Venkatesan, C., & Dhas, C. S. G. (2021). Analysis of DNA sequence classification using CNN and hybrid models. Computational and Mathematical Methods in Medicine, 2021, Article 1835056. https://doi.org/10.1155/2021/1835056.
  • Downloads

  • How to Cite

    Anakal, S. ., Marotrao, S. S. ., G R, V. ., Mishra, G. ., Agrawal, A. V. ., Bhaskar, A. ., D, S. ., & Ananthan, A. S. . (2025). Taxonomic Classification of Bacteria Using Machine Learning ‎Models on DNA Sequences. International Journal of Basic and Applied Sciences, 14(3), 298-310. https://doi.org/10.14419/cse17v46