Review of techniques for extraction of bilingual lexicon from comparable corpora

  • Authors

    • Manpreet Singh Lehal
    • Dr Ajit Kumar
    • Dr Vishal Goyal
  • Bilingual Lexicon, Comparable Corpora, Machine Translation, Extraction.
  • Bilingual lexicons are important resources for performing a number of bilingual tasks in machine translation (MT) and cross-language in-formation retrieval (CLIR). Since the manual building of bilingual extraction is a tedious affair, researchers have focused upon the automatic extraction of bilingual lexicons from corpora. Another issue is the use of parallel and comparable corpora for extraction. Much success has been achieved in the use of parallel corpora but it is only available for a few language pairs and for limited domains. Therefore, the use of comparable corpora comes as an alternative but a lot need to be done in this field. The paper presents a review of different techniques and methods, which have been used for automatic extraction of bilingual lexicon suggesting that an integrated approach can give better results than using individual approaches. The paper also contains a proposed method for extraction of bilingual method using a combined approach.



  • References

    1. [1] Brockett, C. (2005). Support vector machines for paraphrase identification and corpus construction. Proceedings of the third International Workshop on Paraphrasing (IWP), 1–8. Retrieved from

      [2] Brown, P. F., Cocke, J., Pietra, S. a Della, Pietra, V. J. Della, Jelinek, F., Lafferty, J. D., Watson, T. J. (1990). Statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.

      [3] Brown, P. F., Pietra, S. A. Della, Pietra, V. J. Della, & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 263–311. Retrieved from

      [4] Chiao, Y., Zweigenbaum, P., Dsi, S., Publique, A., Paris, H. De, Biomathématiques, D. De, & Paris, U. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of the 19th International Conference on Computational Linguistics, 3–7.

      [5] Déjean H, Gaussier É, S. F. (2002). An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. Proceedings of the 19th International Conference on Computational Linguistics (COLING), 1–7.

      [6] Deng, Y. (2005). Bitext Alignment for Statistical Machine Translation. PhD Thesis, 29–51.

      [7] Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 61–74. Retrieved from

      [8] Fano, R. M., & Hawkins, D. (1961). Transmission of Information: A Statistical Theory of Communications. American Journal of Physics, 29(11), 793–794.

      [9] Fung, P. (1995). Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus A Non-parallel Corpus of Chinese and English. Proceedings of the Third Workshop on Very Large Corpora, 173–183.

      [10] Fung, P. (1998). A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, 1529, 1–17. Retrieved from papers2://publication/uuid/8A778A29-6509-4FF8-95F5-D283E5D5AC76

      [11] Fung, P., & Cheung, P. (2004). Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus, 1051.

      [12] Fung, P., & Church, K. W. (1994). K-vec: A new approach for aligning parallel texts. Proceedings of the 15th Conference on Computational Linguistics: Volume 2, 2, 1096–1102.

      [13] Fung, P., & McKeown, K. (1994). Aligning Noisy Parallel Corpora across Language Groups : Word Pair Feature Matching by Dynamic Time Warping. Proceedings of AMTA94 Association of Machine Translation in the Americas, 8. Retrieved from

      [14] Fung, P., & Yee, L. Y. (1998). An {IR} Approach for Translating New Words from Nonparallel, Comparable Texts. Proceedings of the 36th Annual Meeting of the ACL and 17th International Conference on Computational Linguistics: COLING/ACL-98, 414–420.

      [15] Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., & Dejean, H. (2004). A Geometric view on bilingual lexicon extraction from comparable corpora. Association for Computational Linguistics, 1529, 1–17.

      [16] Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Klein, D. (2008). Learning Bilingual Lexicons from Monolingual Corpora. In Proceedings of ACL-08: HLT, 2008(June), 771–779. Retrieved from

      [17] Kenneth, Erlbaum, N. J. L., Church, Gale, W., Hanks, P., & Hindle, D. (1991). Lexical Acquisition : Exploiting On-Line Resources to Build a Lexicon. Association for Computational Linguistics, 214–216.

      [18] Koehn, Philipp; Knight, K. (2001). Knowledge sources for word-level translation models. Proceedings of the Conference on Empirical Method in Natural Language Processing.

      [19] Koehn, P., & Knight, K. (2002). Learning a translation lexicon from monolingual corpora. Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition, 9(July), 9–16.

      [20] Levow, G. A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management, 41(3), 523–547.

      [21] Liu, X., Duh, K., & Matsumoto, Y. (2013). Topic Models+ Word Alignment= A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus. Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL ’13), 212–221. Retrieved from

      [22] Munteanu, D., & Fraser, A. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpora. HLT-NAACL, Main Proce(Boston, Massachusetts, USA, May. Association for Computational Linguistics), 265–272. Retrieved from

      [23] Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. Proceedings of ACL, 44(July), 81–88.

      [24] Otero, P. G. (2007). Learning Bilingual Lexicons from Comparable English and Spanish Corpora. Proceedings of MT Summit XI, 191--198.

      [25] Rapp, R. (1995). Identifying Word Translations in Non-Parallel Texts. Proceedings of the 33rd ACL, Cambridge, MA, 320–322.

      [26] Rapp, R. (1999). Automatic identification of word translations from unrelated English and German Corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 519–526.

      [27] Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. Association for Computational Linguistics, 141–144.

      [28] Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. Proceedings of Coling 2004, 618–es.

      [29] Smith, J. R. J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, (June), 403–411. Retrieved from

      [30] Vulić, I., & Moens, M.-F. (2012). Detecting highly confident word translations from comparable corpora without any prior knowledge. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 449–459.

      [31] Vulić, I., Smet, W. De, & Moens, M. (2011). Identifying word translations from comparable corpora using latent topic models. Proceedings of the 49th Annual Meeting …, 479–484. Retrieved from

  • Downloads

  • How to Cite

    Singh Lehal, M., Ajit Kumar, D., & Vishal Goyal, D. (2018). Review of techniques for extraction of bilingual lexicon from comparable corpora. International Journal of Engineering & Technology, 7(2.30), 15-20.