Hybrid method for automatic extraction of multiword expressions

    A three phase hybrid method for automatic extraction of English multiword expressions (MWEs) has been proposed. The proposed method is based on linguistic patterns, association and context similarity between constituent words of the MWEs. First, the expressions are extract-ed in the form of N-grams from the raw text and then filtered using well defined linguistic patterns. Next, these expressions are again fil-tered using association score and context similarity score between their constituent words. Two association measures, Dice’s coefficient and PMI have been used for calculating the association score. The context similarity between words has been calculated using Latent Semantic Analysis (LSA) method. The problem of deciding the best value for the cut-off boundary thresholds in statistical methods is quite common. A two phase method of deciding the boundary threshold, using training dataset, has been proposed and employed in the current work. De-tailed performance analysis has been done on manually annotated dataset. The significant gain in performance has been observed for various types of multiword expressions.

    Collocation Extraction; Information Retrieval; Latent Semantic Analysis; Multiword Expressions; Natural Language Processing.

Article ID: 10063
DOI: 10.14419/ijet.v7i2.6.10063

