The Effectiveness of Using Malay Affixes for Handling Unknown Words In Unsupervised HMM POS Tagger

  • Authors

    • Hassan Mohamed
    • Nazlia Omar
    • Mohd Juzaiddin Ab Aziz
    2018-11-26
    https://doi.org/10.14419/ijet.v7i4.29.21834
  • Malay, POS tagger, unsupervised HMM.
  • The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger is that the training depends on an untagged corpus; the only supervised data limiting possible tagging of words is a dictionary. A morpheme-based POS guessing algorithm has been introduced to assign unknown words’ probable tags based on linguistically meaningful affixes. Therefore, the exact morphemes of prefixes, suffixes and circumfixes in the agglutinative Malay language is examined before giving tags to unknown words. The algorithm has been integrated into HMM tagger which uses HMM trained parameters for tagging new sentences. However, for unknown words their parameters are absent. Therefore, the algorithm applies two methods for assigning unknown words’ emission to HMM tagger, first is based on uniform distribution of all possible tags; and second, is based on marginal proportionate distribution of tags. The effective method is proven to be using morpheme-based POS guessing with unknown word emissions substituted by a value proportionate to the marginal distribution of tags.

  • References

    1. [1] Xian BCM., Lubani M, Ping LK, Bouzekri K, Mahmud R & Lukose D (2016), “Bechmarking Mi-POS: Malay Part-of-Speech Taggerâ€, International Journal of Knowledge Engineering, Vol. 2(3) 115-121

      [2] Mohamed H, Omar N & Aziz MJA (2011), “Statistical Malay part-of-speech (POS) tagger using Hidden Markov approachâ€, Proceeding of International Conference on Semantic Technology and Information Retrieval, pp: 231-236

      [3] Pisceldo F, Adriani M, & Manurung R (2009), “Probabilistic part of speech tagging for Bahasa Indonesiaâ€, Proceeding of MALINDO’09

      [4] Zamin N, Oxley A, Bakar ZA & Farhan YA (2012), Ed., A lazy man’s way to part-of-speech tagging, ser. Lecture Notes in Computer Science, Berlin, Heidelberg, Springer, Vol. 7457, pp: 106-117

      [5] Alfred R, Mujat A & Obit JH (2013), Ed., A ruled-based part of speech (RPOS) tagger for Malay text articles, ser. Lecture Notes in Computer Science, Berlin, Heidelberg, Springer, Vol. 7803, pp: 50-59

      [6] Ranaivo-Malançon B (2005), Approach for a Malay Morphosyntactic Tagging (Approche pour un etiquetage morphosyntaxique du malais). Proceedings of the Traitement Automatique des Langues Naturelles, Dourdan, France, available online: https://taln.limsi.fr/tome2/P138.pdf, last visit: 29.07.2015

      [7] Knowles G. & Don ZM (2003), “Tagging a corpus of Malay texts, and coping with syntactic driftâ€, Proceeding of the Corpus Linguistics 2003 Conference, Lancaster, 2003, pp. 422-428

      [8] Karim NS, Farid OM, Hashim M & Hamid MA, Tatabahasa dewan edisi ketiga, Kuala Lumpur, Malaysia: Dewan Bahasa dan Pustaka, (2010)

      [9] Abdullah H, Morfologi siri pengajaran dan pembelajaran bahasa Melayu, Kuala Lumpur, Malaysia: PTS Professional, (2006)

      [10] Bakar J, Omar K, Nasrudin MF & Murah MZ, “Morphology Analysis in Malay POS Predictionâ€, Proceeding of AICS’13, (2013), pp. 112-119

      [11] Schröder I, “A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkitâ€, Department of Computer Science, University of Hamburg, Technical report FBI-HH-M-314/02, (2002)

      [12] Dandapat S, “Part-of-Speech Tagging for Bengaliâ€, MSc thesis, Indian Institute of Technology, Department of Computer Science and Engineering, Kharagpur, India, Jan. 2009

      [13] Giesbrecht E & Evert S, (2009), “Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpusâ€, Proceeding of. Web as Corpus Workshop (WAC5), pp. 27-35

  • Downloads

  • How to Cite

    Mohamed, H., Omar, N., & Aziz, M. J. A. (2018). The Effectiveness of Using Malay Affixes for Handling Unknown Words In Unsupervised HMM POS Tagger. International Journal of Engineering & Technology, 7(4.29), 9-12. https://doi.org/10.14419/ijet.v7i4.29.21834