POS-Taggging Malay Corpus: A Novel Approach Based on Maximum Entropy


  • Juhaida Abu Bakar
  • Khairuddin Khairuddin
  • Mohammad Faidzul Nasrudin
  • Mohd Zamri Murah






NLP pipeline task, POS-tags, tagging approach, Malay language, Jawi.


Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.




[1] Ali, S., & Mohd Safar, H. (2011). Internet usage in a Malaysian sub-urban community: A study of diffusion of ICT innovation. The Innovation Journal: The Public Sector Innovation Journal, 16(2), Article 6.

[2] Amat Juhari, M. (1991). Sejarah tulisan Jawi. Jurnal Dewan Bahasa, 35(11), 1001–1012.

[3] Awasthi, P., Rao, D., & Ravindran, B. (2006). Part Of Speech Tagging and Chunking with HMM and CRF. In Proceedings of NLPAI contest workshop during NWAI ’06 (pp. 1–4). SIGAI Mumbai. Retrieved from http://publications.cse.iitm.ac.in/157/

[4] Bar-Haim, R., Sima’An, K., & Winter, Y. (2008). Part-of-speech tagging of Modern Hebrew text. Natural Language Engineering, 14(02), 223–251. doi:10.1017/S135132490700455X

[5] Biemann, C. (2010). Unsupervised Part-of-Speech Tagging in the Large. Research on Language and Computation, 7(2-4), 101–135. doi:10.1007/s11168-010-9067-9

[6] Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python (1st ed.). USA: O’Reilly Media, Inc.

[7] Boonkwan, P., & Supnithi, T. (2017). Bidirectional Deep Learning of Context Representation for Joint Word Segmentation and POS Tagging. In International Conference on Computer Science, Applied Mathematics and Applications (pp. 184–196). Berlin: Springer.

[8] Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Language Processing : A Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4), 543–565.

[9] Che Wan Shamsul Bahri, C. W. A., Khairuddin, O., Mohammad Faidzul, N., Mohd Zamri, M., & Azmi, S. M. (2013). Machine Transliteration for Old Malay Manuscript. In 2nd International Conference on Machine Learning and Computer Science (IMLCS’2013) (pp. 23–26). Kuala Lumpur.

[10] Hamdan, A. R. (1999). Panduan menulis dan mengeja Jawi. Kuala Lumpur: Dewan Bahasa dan Pustaka.

[11] Hasan, F. M., UzZaman, N., & Khan, M. (2007). Comparison of Different POS Tagging Techniques (N-gram , HMM and Brill ’ s tagger) for Bangla. In K.Elleithy (Ed.), Advances and Innovations in Systems, Computing Sciences and Software Engineering (pp. 121–126). Springer.

[12] Hassan, M., Nazlia, O., & Mohd Juzaiddin, A. A. (2015). Malay Part of Speech Tagger : A Comparative Study on Tagging Tools. Asia-Pacific Journal of Information Technology and Multimedia, 4(1), 11–23.

[13] Hassan, M., Nazlia, O., & Mohd Juzaidin, A. A. (2011). Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Approach. In 2011 International Conference on Semantic Technology and Information Retrieval (pp. 231–236). IEEE.

[14] Huang, H., & Zhang, X. (2009). Part-of-speech tagger based on maximum entropy model. 2009 2nd IEEE International Conference on Computer Science and Information Technology, 26–29. doi:10.1109/ICCSIT.2009.5234787

[15] Ismail, D. (1991). Pedoman Ejaan Jawi yang Disempurnakan. Kuala Lumpur: Dewan Bahasa dan Pustaka.

[16] Juhaida, A. B., Khairuddin, O., Mohammad Faidzul, N., & Mohd Zamri, M. (2016). NUWT : Jawi-specific Buckwalter Corpus for Malay Word Tokenization. Journal of Communication and Information Technology, 15, 1–25.

[17] uhaida, A. B., Khairuddin, O., Mohammad Faidzul, N., Mohd Zamri, M., & Che Wan Shamsul Bahri, C. W. A. (2013). Implementation of Buckwalter transliteration to Malay corpora. In 2013 13th International Conference on Intelligent Systems Design and Applications (pp. 213–218). Serdang. doi:10.1109/ISDA.2013.6920737

[18] Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (Second Edi.). New Jersey, USA: Pearson Education, Inc.

[19] Knowles, G., & Zuraidah, M. D. (2003). Tagging a corpus of Malay texts, and coping with “syntactic drift.†In Proceedings of the corpus linguistics (pp. 422–428). Retrieved from http://eprints.lancs.ac.uk/8620/

[20] Li, Z., Chao, J., Zhang, M., Chen, W., Zhang, M., & Fu, G. (2017). Coupled POS Tagging on Heterogeneous Annotations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 557–571.

[21] Malecha, G., & Smith, I. (2010). Maximum Entropy Part-of-Speech Tagging in NLTK (pp. 1–10). unpublished course-related report.

[22] Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–172.

[23] Nurul Huda, M. S., Juhaida, A. B., Rafidah, A. K., Nurbaiti, T., & Khalijah, M. N. (2012). Pembangunan korpus cerpen bertag Bahasa Melayu: Analisis Linguistik Korpora. In Research, Invention, Innovation & Design (RIID 2012) (pp. 1–5). Universiti Teknologi MARA Kampus Melaka.

[24] Nurwidyantoro, A., & Winarko, E. (2012). Parallelization of Maximum Entropy POS Tagging for Bahasa Indonesia with MapReduce. International Journal of Computer Science Issues (IJCSI), Vol. 9(Issue 4), 1–6.

[25] Othmane, C. Z. B., Fraj, F. B., & Limam, I. (2017). POS-tagging arabic texts: A novel approach based on ant colony. Natural Language Engineering, 23(3), 419–439.

[26] Pisceldo, F., Adriani, M., & Manurung, R. (2009). Probabilistic Part Of Speech Tagging for Bahasa Indonesia. In Third International MALINDO Workshop, colocated event ACL-IJCNLP (pp. 1–6). Singapore.

[27] Ratnaparkhi, A. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the conference on empirical methods in natural language processing (pp. 133–142).

[28] Ratnaparkhi, A. (1999). Learning to Parse Natural Language with Maximum Entropy Models. Machine Learning, 34(1-3), 151–175. doi:10.1023/A:1007502103375

[29] Søgaard, A. (2010). Simple semi-supervised training of part-of-speech taggers. In Proceedings of the ACL 2010 Conference Short Papers (pp. 205–208). Uppsala: Association for Computational Linguistics.

[30] Suliana, S., Khairuddin, O., Nazlia, O., Mohd Zamri, M., & Hamdan, A. R. (2011). A Malay Stemmers for Jawi Characters. In D. Wang & M. Reynolds (Eds.), AI 2011: Advances in Artificial Intelligence (pp. 668–676). Perth, Australia: Springer Berlin / Heidelberg. doi:10.1007/978-3-642-25832-9_68

[31] Viani, N., Miller, T. A., Dligach, D., Bethard, S., Napolitano, C., Priori, S. G., Bellazzi, R., Sacchi, L., & Savova, G. K. (2017). Recurrent Neural Network Architectures for Event Extraction from Italian Medical Reports. In Conference on Artificial Intelligence in Medicine in Europe (pp. 198–202). Vienna: Springer.

[32] Wicaksono, A. F., & Purwarianti, A. (2010). HMM Based Part-of-speech Tagger for Bahasa Indonesia. In The 4th International MALINDO (Malay and Indonesian Language) Workshop (pp. 1–7).

[33] Yahya, S. R., Abdullah, S. S., Omar, K., Zakaria, M. S., & Liong, C. Y. (2009). Review on image enhancement methods of old manuscript with the damaged background. In Proceedings of International Conference on Electrical Engineering and Informatics (pp. 62–67). Bangi: IEEE.

[34] Zamora-Martinez, F., Castro-Bleda, M. J., Espana-Boquera, S., & Tortajada-Velert, S. (2009). Adding Morphological Information to a Connectionist Part-Of-Speech Tagger. In Current Topics in Artificial Intelligence (pp. 191–200). Seville: Springer Berlin Heidelberg.

View Full Article: