Review of techniques for extraction of bilingual lexicon from comparable corpora

Authors and Affiliations

  • Manpreet Singh Lehal
  • Dr Ajit Kumar
  • Dr Vishal Goyal

About this article

DOI:

https://doi.org/10.14419/ijet.v7i2.30.13456

Download PDF

Keywords:

Bilingual Lexicon, Comparable Corpora, Machine Translation, Extraction.

Abstract

Bilingual lexicons are important resources for performing a number of bilingual tasks in machine translation (MT) and cross-language in-formation retrieval (CLIR). Since the manual building of bilingual extraction is a tedious affair, researchers have focused upon the automatic extraction of bilingual lexicons from corpora. Another issue is the use of parallel and comparable corpora for extraction. Much success has been achieved in the use of parallel corpora but it is only available for a few language pairs and for limited domains. Therefore, the use of comparable corpora comes as an alternative but a lot need to be done in this field. The paper presents a review of different techniques and methods, which have been used for automatic extraction of bilingual lexicon suggesting that an integrated approach can give better results than using individual approaches. The paper also contains a proposed method for extraction of bilingual method using a combined approach.

References

Brockett, C. (2005). Support vector machines for paraphrase identifi-cation and corpus construction. Proceedings of the third International Workshop on Paraphrasing (IWP), 1–8. Retrieved from http://acl.ldc.upenn.edu/I/I05/I05-5001.pdf

Brown, P. F., Cocke, J., Pietra, S. a Della, Pietra, V. J. Della, Jelinek, F., Lafferty, J. D., Watson, T. J. (1990). Statistical approach to ma-chine translation. Computational Linguistics, 16(2), 79–85. https://doi.org/10.3115991365.991407

Brown, P. F., Pietra, S. A. Della, Pietra, V. J. Della, & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 263–311. Retrieved from http://dl.acm.org/citation.cfm?id=972474

Chiao, Y., Zweigenbaum, P., Dsi, S., Publique, A., Paris, H. De, Biomathématiques, D. De, & Paris, U. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceed-ings of the 19th International Conference on Computational Linguis-tics, 3–7. https://doi.org/10.3115/1071884.1071904

Déjean H, Gaussier É, S. F. (2002). An approach based on multilin-gual thesauri and model combination for bilingual lexicon extraction. Proceedings of the 19th International Conference on Computational Linguistics (COLING), 1–7. https://doi.org/http://dx.doi.org/10.3115/1072228.1072394

View more references (26)

Deng, Y. (2005). Bitext Alignment for Statistical Machine Transla-tion. PhD Thesis, 29–51.

Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19, 61–74. Retrieved from http://portal.acm.org/citation.cfm?id=972454

Fano, R. M., & Hawkins, D. (1961). Transmission of Information: A Statistical Theory of Communications. American Journal of Physics, 29(11), 793–794. https://doi.org/10.1119/1.1937609

Fung, P. (1995). Compiling Bilingual Lexicon Entries from a Non-Parallel English-Chinese Corpus A Non-parallel Corpus of Chinese and English. Proceedings of the Third Workshop on Very Large Corpora, 173–183.

Fung, P. (1998). A statistical view on bilingual lexicon extraction: from parallel corpora to non-parallel corpora. Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, 1529, 1–17. Retrieved from papers2://publication/uuid/8A778A29-6509-4FF8-95F5-D283E5D5AC76

Fung, P., & Cheung, P. (2004). Multi-level bootstrapping for extract-ing parallel sentences from a quasi-comparable corpus, 1051. https://doi.org/10.3115/1220355.1220506

Fung, P., & Church, K. W. (1994). K-vec: A new approach for aligning parallel texts. Proceedings of the 15th Conference on Com-putational Linguistics: Volume 2, 2, 1096–1102. https://doi.org/10.3115/991250.991328

Fung, P., & McKeown, K. (1994). Aligning Noisy Parallel Corpora across Language Groups : Word Pair Feature Matching by Dynamic Time Warping. Proceedings of AMTA94 Association of Machine Translation in the Americas, 8. Retrieved from http://arxiv.org/abs/cmp-lg/9409011

Fung, P., & Yee, L. Y. (1998). An {IR} Approach for Translating New Words from Nonparallel, Comparable Texts. Proceedings of the 36th Annual Meeting of the ACL and 17th International Confer-ence on Computational Linguistics: COLING/ACL-98, 414–420. https://doi.org/10.3115/980845.980916

Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., & Dejean, H. (2004). A Geometric view on bilingual lexicon extraction from com-parable corpora. Association for Computational Linguistics, 1529, 1–17. https://doi.org/10.3115/1218955.1219022

Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Klein, D. (2008). Learning Bilingual Lexicons from Monolingual Corpora. In Proceed-ings of ACL-08: HLT, 2008(June), 771–779. Retrieved from http://www.researchgate.net/publication/220873349_Learning_Bilingual_Lexicons_from_Monolingual_Corpora/file/3deec52254895b9903.pdf

Kenneth, Erlbaum, N. J. L., Church, Gale, W., Hanks, P., & Hindle, D. (1991). Lexical Acquisition : Exploiting On-Line Resources to Build a Lexicon. Association for Computational Linguistics, 214–216.

Koehn, Philipp; Knight, K. (2001). Knowledge sources for word-level translation models. Proceedings of the Conference on Empirical Method in Natural Language Processing.

Koehn, P., & Knight, K. (2002). Learning a translation lexicon from monolingual corpora. Proceedings of the ACL-02 Workshop on Un-supervised Lexical Acquisition, 9(July), 9–16. https://doi.org/10.3115/1118627.1118629

Levow, G. A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. Information Pro-cessing and Management, 41(3), 523–547. https://doi.org/10.1016/j.ipm.2004.06.012

Liu, X., Duh, K., & Matsumoto, Y. (2013). Topic Models+ Word Alignment= A Flexible Framework for Extracting Bilingual Diction-ary from Comparable Corpus. Proceedings of the 17th Conference on Computational Natural Language Learning (CoNLL ’13), 212–221. Retrieved from https://www.aclweb.org/anthology-new/W/W13/W13-35.pdf#page=224

Munteanu, D., & Fraser, A. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpo-ra. HLT-NAACL, Main Proce(Boston, Massachusetts, USA, May. Association for Computational Linguistics), 265–272. Retrieved from http://acl.ldc.upenn.edu/hlt-naacl2004/main/pdf/93_Paper.pdf

Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. Proceedings of ACL, 44(July), 81–88. https://doi.org/10.3115/1220175.1220186

Otero, P. G. (2007). Learning Bilingual Lexicons from Comparable English and Spanish Corpora. Proceedings of MT Summit XI, 191--198.

Rapp, R. (1995). Identifying Word Translations in Non-Parallel Texts. Proceedings of the 33rd ACL, Cambridge, MA, 320–322.

Rapp, R. (1999). Automatic identification of word translations from unrelated English and German Corpora. Proceedings of the 37th An-nual Meeting of the Association for Computational Linguistics on Computational Linguistics, 519–526. https://doi.org/10.3115/1034678.1034756

Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Bilingual terminol-ogy acquisition from comparable corpora and phrasal translation to cross-language information retrieval. Association for Computational Linguistics, 141–144. https://doi.org/10.3115/1075178.1075201

Shao, L., & Ng, H. T. (2004). Mining new word translations from comparable corpora. Proceedings of Coling 2004, 618–es. https://doi.org/10.3115/1220355.1220444

Smith, J. R. J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Human Language Technologies: The 2010 Annual Con-ference of the North American Chapter of the Association for Com-putational Linguistics, (June), 403–411. Retrieved from http://dl.acm.org/citation.cfm?id=1858062

Vulić, I., & Moens, M.-F. (2012). Detecting highly confident word translations from comparable corpora without any prior knowledge. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 449–459.

Vulić, I., Smet, W. De, & Moens, M. (2011). Identifying word trans-lations from comparable corpora using latent topic models. Proceed-ings of the 49th Annual Meeting …, 479–484. Retrieved from http://dl.acm.org/citation.cfm?id=2002832


How to Cite

Singh Lehal, M., Ajit Kumar, D., & Vishal Goyal, D. (2018). Review of techniques for extraction of bilingual lexicon from comparable corpora. International Journal of Engineering and Technology, 7(2.30), 15-20. https://doi.org/10.14419/ijet.v7i2.30.13456

Downloads