In this paper, the authors are presenting the approach to extract the multiword expression (MWEs) from monolingual corpora. It both validates and generates multiword candidates. The multiword expression provides a l...In this paper, the authors are presenting the approach to extract the multiword expression (MWEs) from monolingual corpora. It both validates and generates multiword candidates. The multiword expression provides a list of candidates which are extracted and filtered according to the number of criteria and a set of standard statistical association measures. The generation of the multiword candidates is based on the surface forms, while the validation consists of series of criteria for removing noise using language independent association measures. For generating corpus count, it provides both a corpus indexation facility. Also, this approach allows easy integration with a machine learning tool for thecreation and application of supervised multiword extraction models if annotated data is available. The authors present the use of multiword in a standard configuration, for extracting MWEs from a corpus of general purpose English.展开更多
文摘In this paper, the authors are presenting the approach to extract the multiword expression (MWEs) from monolingual corpora. It both validates and generates multiword candidates. The multiword expression provides a list of candidates which are extracted and filtered according to the number of criteria and a set of standard statistical association measures. The generation of the multiword candidates is based on the surface forms, while the validation consists of series of criteria for removing noise using language independent association measures. For generating corpus count, it provides both a corpus indexation facility. Also, this approach allows easy integration with a machine learning tool for thecreation and application of supervised multiword extraction models if annotated data is available. The authors present the use of multiword in a standard configuration, for extracting MWEs from a corpus of general purpose English.