Evaluating Machine Learning and Deep Learning Techniques ‎for Part-Of-Speech Tagging in Tamil

  • Authors

    • Dr. kannaiya raja Post Doctoral Researcher, Lincoln University College, Malaysia
    • Dr. Pawan Kumar Chaurasia Babasaheb Bhimrao Ambedkar Central University, Lucknow, Uttar Pradesh, India
    • Prof Dr. Midhunchakkaravarthy Prof Dr Midhunchakkaravarthy, Lincoln University College, Malaysia
    https://doi.org/10.14419/rs62f777

    Received date: July 25, 2025

    Accepted date: August 30, 2025

    Published date: September 6, 2025

  • Tamil; Part-of-Speech Tagging, POS; CRF; LSTM-RNN; Machine Learning; Deep Learning, NLP; Agglutinative Language; Accuracy
  • Abstract

    Part-of-speech (POS) tagging for Tamil is a importance task due to the language’s highly inflectional and agglutinative morphology. This ‎study systematically evaluates both machine learning and deep learning models including Conditional Random Fields (CRF), Support Vector Machine (SVM), Hidden Markov Model (HMM), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN), and LSTM-RNN with CRF output for Tamil POS tagging, using a well-annotated CLE-style benchmark dataset. We employed a comprehensive, lan-‎language-independent feature set and performed 10-fold cross-validation to ensure robust results. Experimental finding that, for the CLE da-‎dataset, the CRF model achieves the highest average accuracy at 86.32%, outperforming SVM (81.13%), LSTM-RNN (78.64%), LSTM-RNN-CRF (78.03%), and HMM (78.03%). In contrast, on the more challenging BJ dataset, the LSTM-RNN deep learning model attains ‎the highest accuracy of 92.70%, followed closely by CRF (91.2%), LSTM-RNN-CRF (91.02%), HMM (90.11%), and SVM (86.25%). ‎These results highlight the importance of model selection in morphologically rich languages, while CRF is optimal for structured and moderately sized datasets, LSTM-RNN deep learning approaches excel on larger. This work establishes new empirical benchmarks for Tamil ‎POS tagging and demonstrates that advanced neural models provide a clear advantage in handling Tamil’s linguistic complexity‎.

  • References

    1. E. Brill, “Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging,” Computational Lin-guistics, vol. 21, no. 4, pp. 543–565, 1995. [Online]. Available: https://aclanthology.org/J95-4004.
    2. A. Ramanathan, K. N. Murthy, and D. V. R. Rao, “A lightweight POS tagger for Tamil,” in Proc. 7th Workshop on Asian Language Resources, 2009, pp. 29–36.
    3. J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. ICML, 2001, pp. 282–289.
    4. Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv preprint, arXiv:1508.01991, 2015.
    5. P. Vikraman and S. Balaji, “Part of speech tagging for Tamil using machine learning approaches,” Materials Today: Proceedings, vol. 5, no. 1, pp. 2398–2406, 2018. https://doi.org/10.1016/j.matpr.2017.11.374.
    6. A. Kannan, T. V. Prabhakar, and K. N. Murthy, “A hybrid approach for part-of-speech tagging of Tamil,” in Proc. COLING/ACL Main Conf. Post-er Sessions, 2006, pp. 497–504. [Online]. Available: https://aclanthology.org/P06-2082.
    7. K. Sarveswaran and P. Priyadarsini, “ThamizhiPOSt: A neural-based part-of-speech tagger for Tamil using the Universal Dependencies frame-work,” Language Resources and Evaluation, vol. 56, pp. 763–786, 2022.
    8. V. Aravinthan and B. Eugene, “Hybrid deep learning architectures for Tamil text classification,” Procedia Computer Science, vol. 218, pp. 339–348, 2022. https://doi.org/10.1016/j.procs.2022.11.148.
    9. K. Visuwalingam, N. Kumaravel, and K. Somasundaram, “Deep learning-based part-of-speech tagging for Tamil using BLSTM,” International Journal of Speech Technology, vol. 27, pp. 23–36, 2024. https://doi.org/10.1007/s10772-023-10033-0.
    10. V. Sundararajan and N. Kumaravel, “Sequence labeling for morphologically rich languages: The case of Tamil POS tagging,” ACM Trans. Asian and Low-Resource Language Information Processing, vol. 18, no. 4, p. 46, 2019.
    11. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735.
    12. K. Vani and R. Hemalatha, “Comparative study on POS taggers for Indian languages,” Materials Today: Proceedings, vol. 56, pp. 2052–2056, 2022. https://doi.org/10.1016/j.matpr.2021.11.004.
    13. Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” in Proc. AAAI, 2016, pp. 2741–2749. https://doi.org/10.1609/aaai.v30i1.10362.
    14. C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999.
    15. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in Proc. EACL, 2017, pp. 427–431. https://doi.org/10.18653/v1/E17-2068.
    16. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. https://doi.org/10.1162/153244303322533223.
    17. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015. https://doi.org/10.1038/nature14539.
    18. X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” in Proc. ACL, 2016, pp. 1064–1074. https://doi.org/10.18653/v1/P16-1101.
    19. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
    20. A. Kumar and S. L. Devi, “A survey of part-of-speech tagging for Tamil,” Language in India, vol. 16, no. 2, pp. 13–22, 2016.
    21. A. Bharathi, R. Sangal, and L. S. Bai, “A POS tagger for Indian languages,” in Proc. LREC, 2006. [Online]. Available: https://aclanthology.org/L06-1017.
    22. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, 1997. https://doi.org/10.1109/78.650093.
    23. A. Joulin, T. Mikolov, M. Ranzato, and M. Denil, “FastText.zip: Compressing text classification models,” arXiv preprint, arXiv:1612.03651, 2016.
    24. S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. Sebastopol, CA, USA: O’Reilly Media, 2009.
    25. B. Plank, A. Søgaard, and Y. Goldberg, “Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss,” in Proc. ACL, 2016, pp. 412–418. https://doi.org/10.18653/v1/P16-2067.
    26. R. Collobert, J. Weston, L. Bottou, et al., “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011. [Online]. Available: https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf.
    27. F. Sha and F. Pereira, “Shallow parsing with conditional random fields,” in Proc. HLT-NAACL, 2003, pp. 213–220. https://doi.org/10.3115/1073445.1073473.
    28. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network,” in Proc. NAACL, 2003, pp. 173–180. https://doi.org/10.3115/1073445.1073478.
    29. B. Plank, D. Hovy, and A. Søgaard, “Learning part-of-speech taggers with inter-annotator agreement loss,” in Proc. EMNLP, 2014, pp. 1410–1415. https://doi.org/10.3115/v1/D14-1147.
    30. A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence labeling,” in Proc. COLING, 2018, pp. 1638–1649. [Online]. Available: https://aclanthology.org/C18-1139.
    31. B. R. Chakravarthi et al., “POS tagging for code-mixed Dravidian languages using BiLSTM-CRF,” arXiv preprint, arXiv:2010.12261, 2020.
    32. A. Mishra, D. Das, and S. Bandyopadhyay, “A simple unsupervised morphological analyzer for Tamil,” J. Language Technology and Computation-al Linguistics, vol. 27, no. 2, pp. 1–18, 2012. [Online]. Available: https://www.jlcl.org/2012_Heft2/Mishra-Das-Bandyopadhyay.pdf.
  • Downloads

  • How to Cite

    raja, D. kannaiya ., Chaurasia , D. P. K. ., & Midhunchakkaravarthy , P. D. . (2025). Evaluating Machine Learning and Deep Learning Techniques ‎for Part-Of-Speech Tagging in Tamil. International Journal of Basic and Applied Sciences, 14(5), 187-200. https://doi.org/10.14419/rs62f777