BnVec: Towards the Development of Word Embedding for Bangla Language Processing

  • Authors

    • Md. Kowsher Stevens Institute of Technology
    • Md. Jashim Uddin
    • Anik Tahabilder
    • Nusrat Jahan Prottasha
    • Mahid Ahmed
    • K. M. Rashedul Alam
    • Tamanna Sultana
  • Word Embedding, Word2vec, fastText, Hash Vectorizer, TF-IDF, Bangla NLP
  • Progression in machine learning and statistical inference are facilitating the advancement of domains like computer vision, natural language processing (NLP), automation & robotics, and so on. Among the different persuasive improvements in NLP, word embedding is one of the most used and revolutionary techniques. In this paper, we manifest an open-source library for Bangla word extraction systems named BnVec which expects to furnish the Bangla NLP research community by the utilization of some incredible word embedding techniques. The BnVec is splitted up into two parts, the first one is the Bangla suitable defined class to embed words with access to the six most popular word embedding schemes (CountVectorizer, TF-IDF, Hash Vectorizer, Word2vec, fastText, and Glove). The other one is based on the pre-trained distributed word embedding system of Word2vec, fastText, and GloVe. The pre-trained models have been built by collecting content from the newspaper, social media, and Bangla wiki articles. The total number of tokens used to build the models exceeds 395,289,960. The paper additionally depicts the performance of these models by various hyper-parameter tuning and then analyzes the results.

    Author Biography

    • Md. Kowsher, Stevens Institute of Technology
      ML, NLP, Computer Vision
  • References

    1. [1] Ahmad, A., Amin, M.R.: Bengali word embeddings and it’s application in solving document classification problem. In: 2016 19th International Conference on Computer and Information Technology (ICCIT). pp. 425–430. IEEE (2016)

      [2] Azevedo, P., Leite, B., Cardoso, H.L., Silva, D.C., Reis, L.P.: Exploring nlp and information extraction to jointly address question generation and answering. In: IFIP International Conference on Artificial Intelligence Applications and Innovations. pp. 396–407.Springer (2020)

      [3] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017)

      [4] El Mahdaouy, A., Gaussier, E., El Alaoui, S.O.: Arabic text classification based on word and document embeddings. In: International Conference on Advanced Intelligent Systems and Informatics. pp. 32–41. Springer (2016)

      [5] Gaikwad, V., Haribhakta, Y.: Adaptive glove and fasttext model for hindi word embeddings. In: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pp. 175–179 (2020)

      [6] Ismail, S., Rahman, M.S.: Bangla word clustering based on n-gram language model. In: 2014 International Conference on Electrical Engineering and Information & Communication Technology. pp. 1–5. IEEE (2014)

      [7] Karim, R., Islam, M., Simanto, S.R., Chowdhury, S.A., Roy, K., Al Neon, A., Hasan, M., Firoze, A., Rahman, R.M., et al.: A step towards information extraction: Named entity recognition in bangla using deep learning. Journal of Intelligent & Fuzzy Systems (Preprint), 1–13 (2019)

      [8] Kowsher, M., Tahabilder, A., Islam Sanjid, M.Z., Prottasha, N.J., Hossain Sarker, M.M.: Knowledge-base optimization to reduce the response time of bangla chatbot. In: 2020 Joint 9th International Conference on Informatics, Electronics Vision (ICIEV) and 2020 4th International Conference on Imaging, Vision Pattern Recognition (icIVPR). pp. 1–6 (2020).

      [9] Kowsher, M., Tahabilder, A., Murad, S.A.: Impact-learning: a robust machine learning algorithm. In: Proceedings of the 8th International Conference on Computer and Communications Management. pp. 9–13 (2020)

      [10] Kumari, A., Lobiyal, D.: Word2vec’s distributed word representation for hindi word sense disambiguation. In: International Conference on Distributed Computing and Internet Technology. pp. 325–335. Springer (2020)

      [11] Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior research methods, instruments, & computers 28(2), 203–208 (1996)

      [12] Ma, L., Zhang, Y.: Using word2vec to process big text data. In: 2015 IEEE International Conference on Big Data (Big Data). pp. 2895–2897. IEEE (2015)

      [13] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

      [14] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)

      [15] Mojumder, P., Hasan, M., Hossain, M.F., Hasan, K.A.: A study of fasttext word embedding effects in document classification in bangla language. In: International Conference on Cyber Security and Computer Science. pp. 441–453. Springer (2020)

      [16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

      [17] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

      [18] Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010)

      [19] Rohde, D.L., Gonnerman, L.M., Plaut, D.C.: An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM 8(627-633), 116 (2006)

      [20] Sun, X., Gao, Y., Sutcliffe, R., Guo, S.X., Wang, X., Feng, J.: Word representation learning based on bidirectional grus with drop loss for sentiment classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2019)

      [21] Zaib, M., Sheng, Q.Z., Emma Zhang, W.: A short survey of pre-trained language models for conversational ai-a new age in nlp. In: Proceedings of the Australasian Computer Science Week Multiconference. pp. 1–4 (2020)

  • Downloads

  • How to Cite

    Kowsher, M., Uddin, M. J., Tahabilder, A., Prottasha, N. J., Ahmed, M., Alam, K. M. R., & Sultana, T. (2021). BnVec: Towards the Development of Word Embedding for Bangla Language Processing. International Journal of Engineering & Technology, 10(2), 95-102.