A comparison of features for POS tagging in Kannada

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    This paper proposes a system of part of speech tagging for the South Indian language Kannada using supervised machine learning. POS tagging is an important step in Natural Language Processing and has varied applications such as word sense disambiguation, natural language understanding etc. Based on extensive research into methods used for POS tagging, Conditional Random fields have been chosen as our algorithm. CRFs are used for sequence modeling in POS tagging, named entity recognition and as an alternative to Hidden Markov Models. Three very large corpora are used and their results are compared. The feature sets for all three corpora are also varied. The best method for the task is determined using these results.

  • Keywords

    Conditional Random Field; Indian languages; Kannada; Natural Language Processing; POS tagging

  • References

      [1] PJ Antony and KP Soman. Kernel based part of speech tagger for kannada. In Machine Learning and Cybernetics (ICMLC), 2010 International Conference on, volume 4, pages 2139–2144. IEEE, 2010.

      [2] Shambhavi BR and Ramakanth Kumar. Kannada part-ofspeech tagging with probabilistic classifiers. international journal of computer applications, 48(17):26–30, 2012.

      [3] MC Padma and RJ Prathibha. Morpheme based parts of speech tagger for kannada language. World Academy of Science, Engineering and Technology, International Journal of Cognitive and Language Sciences, 3(6), 2016.

      [4] KP Pallavi and Anitha S Pillai. Kannpos-kannada parts of speech tagger using conditional random fields. In Emerging Research in Computing, Information, Communication and Applications, pages 479–491. Springer, 2016.

      [5] Avinesh PVS and G Karthik. Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Shallow Parsing for South Asian Languages, 21, 2007.

      [6] John Lafferty, Andrew McCallum, and Fernando CN Pereira.Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.

      [7] Ashwath Rao, SN Muralikrishna, and Ashalatha Nayak. Developing a dependency treebank for kannada. 2014.

      [8] Govt. of India Department of Information Technology, Ministry of Communications & Information Technology. Unified parts of speech (pos) standard in indian languages.




Article ID: 14900
DOI: 10.14419/ijet.v7i4.14900

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.