Réseau universitaire international de Genève
Geneva International Academic Network

Français | English
Homepage > Research > Projects > Long Description

Linguistic Analysis and Collocation Extraction

Annual Call for Projects 2001

Summary

Cross-cultural communication raises frequently the problem - in particular in the context of international organizations - of the proper understanding of idiomatic expressions, i.e. multi-word expressions whose meaning differs from the composition of the individual meaning of their parts. The importance of multi-word expressions is widely recognized in the domains of translation and terminology. For one thing, these expressions usually cannot be translated literally, and one must find adequate correspondence (idiomatic or not) in the target language.

This problem is particularly acute when dealing with texts of a diplomatic or legal nature. It is then indispensable, both for the participants in a negotiation and for the translators, to be capable: i) to recognize a group of words as an (idiomatic) expression; ii) to understand its meaning in the source language; and iii) to quickly find an appropriate translation in the various target languages.

The problem of the identification and the extraction of multi-word expressions, and in particular of collocations (by collocation, we mean a conventional combination of words such as oil slick, market share, to practice a profession, to entertain the hope, etc.), is a much-debated topic in computational linguistics. On top of the translation problem mentioned above, the proliferation of textual databases (for instance on the Internet) exacerbates the need for indexing techniques and sophisticated search engines. Proper identification of multi-word expressions constitutes an important improvement of such tools.

Notice that although one can find numerous dictionaries for lexical compounds and fixed expressions, there is hardly any dictionary dealing specifically with collocations (with the notable exception of the BBI Dictionary of English Word Combinations).

Computational tools can bring partial solutions to this problem by means of tools capable of identifying expressions and of suggesting translations. Commercial products for collocation extraction already exist, based on stochastic approaches. However, to be effective and fully satisfactory, collocation extraction systems should be based not only on stochastic (or statistical) methods, but also on a genuine linguistic analysis. Thus, the collocation problem belongs to the field of computational linguistics.

The main objective of this project is the design and development of a computer system of terminological extraction capable of handling multi-word expressions, based on detailed linguistic analysis. The originality of this approach comes from the fact that collocations are not extracted from raw texts, but rather from syntactically parsed texts. The linguistic analysis filters potential pairs of words. Only those words that occur in a specific syntactic configuration will be taken into account for further statistical processing. Such a chain of processes increases significantly the quality and the relevance of the extracted collocations.

The system developed in this project will be applied to a large number of WTO textual corpora and will enrich the translators and terminologists' workbench of the organization.

The grant provided by the GIAN for this project totals SFr 234,000

> See shorter summary

Project Team

Mr Fermin Alcoba Enciso , Principal Member, Division of Linguistic Services and Documentation, World Trade Organisation (WTO) .

Mr Jean-Philippe Goldman , Principal Member, Linguistics Departement , Faculty of Arts , University of Geneva (Unige) .

Mr Juan Mesa , Principal Member, Division of Linguistic Services and Documentation, World Trade Organisation (WTO) .

Mr Olivier Pasteur , Principal Member, Division of Linguistic Services and Documentation, World Trade Organisation (WTO) .

Research Output

Linguistic Analysis and Collocation Extraction
(available in English and French)
> more
Linguistic Analysis and Collocation Extraction - Final Report
(available in English only)
> more