Linguistic resources on the Internet
1998/12/01 Artola, Xabier Iturria: Elhuyar aldizkaria
The programs that perform the treatment of language through the computer are more and more numerous. Communication with computers through natural languages (in Basque in our case) will be more and more frequent. On the other hand, the computer becomes a special person to alleviate the displacements of this multilingual society between languages.
In addition, the enormous advance experienced in telecommunications (especially Internet phenomena) has increased the need for automatic language treatment. In fact, through the network you can get a lot of information, but it is not easy to find that specific data we need. In this work the linguistic treatment is nothing more than auxiliary.
The field of research on automatic language treatment is called Natural Language Processing (LNP). A whole new industry is being created around the language, whose aim is to treat language through the computer. We already talk about linguistic technology, linguistic engineering. Its main fields of application are four: i) Edition of texts or textual management (orthographic and stylistic correctors, aids to the creation and use of multilingual texts, queries of dictionaries, ...); ii) Treatment and management of large masses of text (search of concepts, documentary classification, extraction of information and automatic creation of texts); iii) Machine translation or assisted translation, and iv) Knowledge and creation of the language.
In the IXA group we have worked for ten years on this subject, always from the point of view of the Basque language. Adding the members of the Faculty of Computer Science of Donostia of the UPV-EHU and UZEI are a total of 21 people. Our strategy has never been to make a very complex system, for example, to make a translation system. We have preferred to start with simple but fundamental objectives, such as morphology, understood as a problem too simple for other languages, and to build on that path broad and solid linguistic bases.
Later we have undertaken more complex projects such as lematization, syntax or the use of dictionaries, but working on a wide base built previously saves us time and gives consistency to new products. Since our linguistic resources can also be useful for other groups, we decided to disseminate the “electronic exhibition”, which is the objective of the project presented in this article. The project was approved in the 1997 call for research projects University-Company of the Basque Government (reference UE97/8) and will be developed during the years 1998-99.
The resources we want to locate on the Internet in the medium term are the lexical database, the spell checker, the morphological analyzer, the lematizer and the syntactic analyzer. But in this first step only the first three will appear.The project is underway and you can already test with spell checker in the http address://ixa.si.ehu.es/tresna (see the computer screens that appear in this same article or see them directly on your computer).
Try to introduce your unknown words into your personal vocabulary and check that from there you will also know other ways of declining those words.
To finish, we will explain what is the Lexical Database of the Basque Country (EDBL) mentioned in the name of the project. The lexical database is a large lexical warehouse. It is a kind of electronic dictionary, conceived for the automatic treatment of the language and, therefore, organized taking into account the demands of this objective to automate the treatment of the language. This requires, of course, that the organization of the lexicon is carried out taking into account the use that will be made later, and a systematization of the lexical description: use of a unified and homogeneous system of income categories, the definition of the characteristics necessary to correctly describe the elements of each category, etc.
In the case of Basque, the need for this type of lexical warehouse arose when we started the preparation of the Xuxen orthographic corrector in the IXA group. As discussed above, this corrector was more basic for us as a by-product of the morphological analyzer, and we did not want to organize the lexical database as a dictionary or a simple list of words for that corrector, but as a solid lexical base for any other tool or application in the field of the automatic treatment of the Basque language in the future. And so came the EDBL, the Lexical Database of the Basque Country, which since then has been the lexical basis for our work, which has been constantly updated, and which today or tomorrow will open its doors to a wider community, in order for the bases to be also exploited by others.
When designing the database, it was given great importance, therefore, to be flexible enough to accept possible future extensions and, in particular, to describe in the most neutral way possible the linguistic information contained in it, that is, in the most independent way possible of formalisms or linguistic theories.
EDBL currently comprises nearly 70,000 entries, classified into three major sections: dictionary entries (names, adjectives, verbs, etc. ). ), verbs (verbal forms played) and non-independent morphemes (suffixes, prefixes, etc. ).
The predefined characteristics or attributes of each entry category are recorded, describing in all cases, as previously mentioned, the morphology of entry (morphological information) by a formalism at two levels widely used in computational morphology.
Currently the EDBL is under a commercial database management system that offers the linguist the usual facilities in this type of system, since it is the linguists its main users: a pleasant interface for the work, facilities to keep the information up to date and ensure its consistency, possibilities to properly filter the information for the necessary applications, etc. The database has also become an essential tool to keep up with the latest developments in the unification process of the Basque language, especially the decisions of the Euskaltzaindia, and one of the important tasks that can be carried out in the future EDBL can be to be the tool that accounts for the latest decisions.
- Title of the project: Public use environment of the Lexical Database of the Basque Country (EDBL).
- Objective of the project: Dissemination on the Internet of the use of some products of the IXA group for incorporation into the Basque language.
- Director: Xabier Artola Zubillaga.
- Working Team: Group IXA E. Agirre, I. Aldezabal, I. Alegria, O. Ansa, X. Arrangi, J.M. Arriola, X. Artola, A. Díaz de Ilraza, N. Ezeiza, K. Gojenola,J.M. Intxausti, M. Lersundi, A. Maritxal,M. Maritxalar, M. Oronoz, K. Sarasola, A. Soroa, R. Urizar and M. Birch.
- Department: Languages and Computer Systems
- Center: Center: UPV-EHU Computer Science (Donostia)