Hybrid method for automatic extraction of multiword expressions

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    A three phase hybrid method for automatic extraction of English multiword expressions (MWEs) has been proposed. The proposed method is based on linguistic patterns, association and context similarity between constituent words of the MWEs. First, the expressions are extract-ed in the form of N-grams from the raw text and then filtered using well defined linguistic patterns. Next, these expressions are again fil-tered using association score and context similarity score between their constituent words. Two association measures, Dice’s coefficient and PMI have been used for calculating the association score. The context similarity between words has been calculated using Latent Semantic Analysis (LSA) method. The problem of deciding the best value for the cut-off boundary thresholds in statistical methods is quite common. A two phase method of deciding the boundary threshold, using training dataset, has been proposed and employed in the current work. De-tailed performance analysis has been done on manually annotated dataset. The significant gain in performance has been observed for various types of multiword expressions.

  • Keywords

    Collocation Extraction; Information Retrieval; Latent Semantic Analysis; Multiword Expressions; Natural Language Processing.

  • References

      [1] Agrawal S, Jaspal A, Aggarwal A, Sanyal R & Sanyal S. (2013). Hybrid Approach: A Solution for Extraction of Domain Independent Multiword Expressions. International Journal of Technology Innovations and Research (IJTIR), Vol. 5, pp. 1-16.
      [2] Agrawal S, Sanyal R & Sanyal S. (2014). Statistics and linguistic rules in multiword extraction: A comparative analysis. International Journal of Reasoning-based Intelligent Systems. Vol. 6, No. 1/2, pp. 59-70.https://doi.org/10.1504/IJRIS.2014.063954.
      [3] Baldwin T, Bannard C, Tanaka T & Widdows D. (2003). An empirical model of multiword expressions decomposability. In Proceedings of the ACL-2003 workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 89-96, Sapporo, Japan.https://doi.org/10.3115/1119282.1119294.
      [4] Baldwin T. (2005). The deep lexical acquisition of English verb-particles. Computer Speech and Language, Special Issue on Multiword Expressions, Vol. 19, pp. 398-414.
      [5] Biber D, Johansson S, Leech G, Conrad S & Finegan E. (1999). Grammar of Spoken and Written English, Longman, Harlow, United Kingdom.
      [6] Boulaknadel S, Daille B & Aboutajdine D. (2008). A multi-word term extraction program for Arabic language. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1485-1488, Marrakech, Morocco.
      [7] Calzolari N, Fillmore CJ, Grishman R, Ide N, Lenci A, Macleod C & Zampolli A. (2002). Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC), pp. 1934-1940, Las Palmas, Canary Islands.
      [8] Church KW & Hanks P. (1990). Word association norms, mutual information & lexicography. Computational Linguistics, Vol. 16, No. 1, pp. 22-29.
      [9] Dahlmann I & Adolphs S. (2007). Pauses as an indicator of psycholinguistically valid multi-word expressions (mwes)? In Proceedings of the ACL-2007 Workshop on A Broader Perspective on Multiword Expressions, pp. 49-56, Prague, Czech Republic.https://doi.org/10.3115/1613704.1613711.
      [10] Deerwester SC, Dumais ST, Landauer TK, Furnas GW & Harshman RA. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science (JASIS), Vol. 41, No. 6, pp. 391-407.https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.
      [11] Dice LR. (1945). Measures of the Amount of Ecologic Association between Species. Ecology, Vol. 26, No. 3, pp. 297-302.https://doi.org/10.2307/1932409.
      [12] Duan J, Zhang M, Tong L & Guo F. (2009). A hybrid approach to improve bilingual multiword expression extraction. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data (PAKDD 2009), pp. 541-547, Bangkok, Thailand.
      [13] Dubey V, Raghuwanshi P & Vyas S. (2015). Impact of Multiword Expression in English-Hindi Language. International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), Vol. 4, No. 3, pp. 101-105.
      [14] Evert S & Krenn B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, Vol. 19, No. 4, pp. 450-466.https://doi.org/10.1016/j.csl.2005.02.005.
      [15] Fano RM. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, Massachusetts, United States.
      [16] Goldman JP, Nerima L & Wehrli E. (2001). Collocation extraction using a syntactic parser. In Proceedings of the ACL Workshop on Collocations, pp. 61-66, Toulouse, France.
      [17] Hofmann T. (1999). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99), pp. 289-296, San Francisco, CA.
      [18] Hurskainen A. (2008). Multiword expressions and machine translation. Technical Report 1, Technical Reports in Language Technology.
      [19] Jackendoff R. (1997). Twistin' the night away. Language, Vol. 73, No. 3, pp. 534-559.https://doi.org/10.2307/415883.
      [20] Justeson JS & Katz SM. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, Vol. 1, No. 1, pp. 9-27.https://doi.org/10.1017/S1351324900000048.
      [21] Karan M, ?najder J & Ba?i? BD. (2012). Evaluation of classification algorithms and features for collocation extraction in Croatian. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp. 657-662, Istanbul, Turkey.
      [22] Katz G & Giesbrecht E. (2006). Automatic identification of noncompositional multi-word expressions using latent semantic analysis. In Proceedings of the ACL-2006 workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pp. 12-19, Sydney, Australia.
      [23] Kunchukuttan A & Damani OP. (2008). A system for compound nouns multiword expression extraction for Hindi. In Proceedings of the 6th International conference on Natural Language Processing (ICON 2008), Pune, India.
      [24] Lambert P & Castell N. (2004). Alignment of parallel corpora exploiting asymmetrically aligned phrases. In Proceedings of the LREC 2004 Workshop on the Amazing Utility of Parallel and Comparable Corpora, pp. 26-29.
      [25] Landauer TK & Dumais ST. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, Vol. 104, No. 2, pp. 211-240.https://doi.org/10.1037/0033-295X.104.2.211.
      [26] Liang Y, Tan H, Li H, Wang Z & Gui W. (2017). A language-independent hybrid approach for multi-word expression extraction. In proceedings of International Joint Conference on Neural Networks (IJCNN), pp. 3273-3279, Anchorage, AK, USA.
      [27] Lin D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Association of Computational Linguistics (ACL-1999), pp. 317-324, College Park, Maryland, USA.https://doi.org/10.3115/1034678.1034730.
      [28] McInnes BT. (2004). Extending the loglikelihood measure to improve collocation identification. Master thesis, University of Minnesota, USA.
      [29] Moir?n BV & Tiedemann J. (2006). Identifying idiomatic expressions using automatic word alignment. In Proceedings of the EACL-2006 workshop on Multiword Expressions in a multilingual context, pp. 33-40, Trento, Italy.
      [30] Monti J, Barreiro A, Elia A, Marano F & Napoli A. (2011).Taking on new challenges in multiword unit processing for machine translation. In Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, pp. 11-19, Barcelona, Spain.
      [31] Pearce D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, pp. 41-46, Pittsburgh, Pennsylvania, USA.
      [32] Piasecki M, Wendelberger M & Maziarz M. (2015). Extraction of the Multiword Lexical Units in the Perspective of the Wordnet Expansion. In Proceedings of Recent Advances in Natural Language Processing, pp. 512-520, Hissar, Bulgaria.
      [33] Ramisch C. (2012). A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications. In Proceedings of the ACL 2012 Student Research Workshop, pp. 61-66, Jeju Island, Korea.
      [34] Sag IA, Baldwin T, Bond F, Copestake A & Flickinger D. (2002). Multi-word expressions: A pain in the neck for nlp. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), Vol. 2276 of Lecture Notes in Computer Science, pp. 1-15, London, UK.https://doi.org/10.1007/3-540-45715-1_1.
      [35] Schone P & Jurafsky D. (2001). Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 6th conference on Empirical Methods in Natural Language Processing (EMNLP-2001), pp. 100-108, Hong Kong.
      [36] Seretan V. (2011). A collocation-driven approach to text summarization. In Proceedings of the Traitement Automatique des Langues Naturelles (TALN 2011), pp. 9-14, Montpellier, France.
      [37] Singh A & Jamwal SS. (2016). Identification, Extraction and Translation of Multiword Expressions. International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 6, No. 7, pp. 445-449.
      [38] Smadja F. (1993). Retrieving collocations form text: Xtract. Computational Linguistics, Vol. 19, No. 1, pp. 143-177.
      [39] Strik H, Binnenpoorte D & Cucchiarini C. (2005). Multiword expressions in spontaneous speech: Do we really speak like that? In Proceedings of the Interspeech-2005 (IS-2005), pp. 1161-1164, Lisbon, Portugal.
      [40] Tsvetkov Y & Wintner S. (2012). Extraction of multiᆳword expressions from small parallel corpora. Natural Language Engineering, Vol. 18, No. 4, pp. 549-573.https://doi.org/10.1017/S1351324912000101.
      [41] Tutubalina E & Braslavski P. (2016). Multiple Features for Multiword Extraction: A LearningtoRank Approach. In Proceedings of the International Conference on Computational Linguistics and Intellectual Technologies: "Dialogue 2016", pp. 782-793, Moscow, Russia.
      [42] Vechtomova O. (2005). The role of multi-word units in interactive information retrieval. In Proceedings of the Advances in Information Retrieval, 27th European Conference on IR Research (ECIR-2005), pp. 403-420, Santiago de Compostela, Spain.
      [43] Venkatapathy S, Agrawal P & Joshi AK. (2005). Relative compositionality of Noun + Verb multiword expressions in Hindi. In Proceedings of the International Conference on Natural Language (ICON), pp. 37-44, Kanpur, India.
      [44] Vintar S & Fiser D. (2008). Harvesting multi-word expressions from parallel corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pp. 1091-1096, Marrakech, Morocco, 2008.




Article ID: 10063
DOI: 10.14419/ijet.v7i2.6.10063

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.