Taxonomic Classification of Bacteria Using Machine Learning Models on DNA Sequences
-
https://doi.org/10.14419/cse17v46
Received date: June 10, 2025
Accepted date: July 18, 2025
Published date: July 25, 2025
-
Machine Learning; DNA Sequences; K-Mer Vector; Deep Learning -
Abstract
This study investigates different deep learning architectures, especially 1D and 2D convolutional neural networks (CNNs), for DNA se-sequence classification using k-mer vector representations. The results show that k-mer vectors, especially those with k = 5, can effectively capture relevant features in DNA sequences and achieve high accuracy, precision, and recall across all taxonomic levels. Among the tested models, 1D CNN outperformed 2D CNN in terms of accuracy and training efficiency. However, the 2D CNN achieved a slightly higher accuracy without the nested layers, suggesting that critical information should be discarded due to the lack of input representation. As expected, the model performance declined at lower taxonomic levels due to sequence feature limitations and class imbalance, but still achieved 88% accuracy at the genus level. Notably, simple multi-layer neural networks outperformed CNN, indicating the potential of low-low-complexity models for genomic data classification. These findings suggest that while CNNs are efficient, simpler architectures can provide competitive performance in terms of information representation and task complexity.
-
References
- LaPierre, N., Ju, C. J. T., Zhou, G., & Wang, W. (2019). MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods, 166, 74–82. https://doi.org/10.1016/j.ymeth.2019.01.002.
- Phan, D., Ngoc, G. N., Lumbanraja, F. R., Faisal, M. R., Abipihi, B., Purnama, B., Delimiyanti, M. K., Kubo, M., & Satou, K. (2017). Combined use of k-mer numerical features and position-specific categorical features in fixed-length DNA sequence classification. Journal of Biomedical Sci-ence and Engineering, 10(7), 390–401. https://doi.org/10.4236/jbise.2017.108030.
- Huang, Y., Yang, L., & Wang, T. (2011). Phylogenetic analysis of DNA sequences based on the generalized pseudo-amino acid composition. Jour-nal of Theoretical Biology, 269(1), 217–223. https://doi.org/10.1016/j.jtbi.2010.11.007.
- Remita, M. A., Halioui, A., Diouara, A. A. M., Daigle, B., Kiani, G., & Diallo, A. B. (2017). A machine learning approach for viral genome classifi-cation. BMC Bioinformatics, 18, 208. https://doi.org/10.1186/s12859-017-1602-3.
- Tonkovic, P., Kalajdziski, S., Zdravevski, E., Lameski, P., Corizzo, R., Pires, I. M., Garcia, N. M., Loncar-Turukalo, T., & Trajkovik, V. (2020). Literature on applied machine learning in metagenomic classification: A scoping review. Biology, 9(11), 453. https://doi.org/10.3390/biology9120453.
- Rehman, M. U., Tayara, H., & Chong, K. T. (2022). DL-m6A: Identification of N6-methyladenosine sites in mammals using deep learning based on different encoding schemes. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20, 904–911. https://doi.org/10.1109/TCBB.2022.3192572.
- Pham, T. D. (2007). Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recognition, 40(2), 516–529. https://doi.org/10.1016/j.patcog.2006.04.021.
- Hossain, P. S., Kim, K., Uddin, J., Samad, M. A., & Choi, K. (2023). Enhancing taxonomic categorization of DNA sequences with deep learning: A multi-label approach. Bioengineering, 10, 1293. https://doi.org/10.3390/bioengineering10111293.
- Soliman, N. (2022). An improved convolutional neural network model for DNA classification. Computers, Materials & Continua, 70(3). https://doi.org/10.32604/cmc.2022.018860.
- Liang, Q., Bible, P. W., Liu, Y., Zou, B., & Wei, L. (2020). DeepMicrobes: Taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1), Article lqaa009. https://doi.org/10.1093/nargab/lqaa009.
- Ghoneim, A., Muhammad, G., & Hossain, M. S. (2020). Cervical cancer classification using convolutional neural networks and extreme learning machines. Future Generation Computer Systems, 102, 643–649. https://doi.org/10.1016/j.future.2019.09.005.
- Abbas, Z., Ur Rehman, M., Tayara, H., Zou, Q., & Chong, K. T. (2023). XGBoost framework with feature selection for the prediction of RNA N5-methylcytosine sites. Molecular Therapy, 31(6), 2543–2551. https://doi.org/10.1016/j.ymthe.2023.02.020.
- Abd-Alhalem, S. M. (2020). Bacterial classification with convolutional neural networks based on different data reduction layers. Nucleosides, Nu-cleotides & Nucleic Acids, 39(4), 493–503. https://doi.org/10.1080/15257770.2019.1645851.
- Malonzo, M. H., & Lähdesmäki, H. (2023). LuxHMM: DNA methylation analysis with genome segmentation via hidden Markov model. BMC Bio-informatics, 24, Article 58. https://doi.org/10.1186/s12859-023-05213-3.
- Gunasekaran, H., Ramalakshmi, K., Arokiaraj, A. R. M., Kanmani, S. D., Venkatesan, C., & Dhas, C. S. G. (2021). Analysis of DNA sequence classification using CNN and hybrid models. Computational and Mathematical Methods in Medicine, 2021, Article 1835056. https://doi.org/10.1155/2021/1835056.
-
Downloads
-
How to Cite
Anakal, S. ., Marotrao, S. S. ., G R, V. ., Mishra, G. ., Agrawal, A. V. ., Bhaskar, A. ., D, S. ., & Ananthan, A. S. . (2025). Taxonomic Classification of Bacteria Using Machine Learning Models on DNA Sequences. International Journal of Basic and Applied Sciences, 14(3), 298-310. https://doi.org/10.14419/cse17v46
