Spellchecker for the Basque language XUXEN

Spellchecker for the Basque language XUXEN


Last September, the presentation of the spell checker for the Basque Xuxen took place at the Koldo Mitxelena cultural center in Donostia. We had the opportunity to talk calmly about what had been seen there with Iñaki Alegría, one of the authors. The result of this interview is the following.

Elhuyar.- What is the Xuxen program?

Figure . Main screen.
To see the photo well go to pdf

I. Alegria.- Spell checker for texts written in Basque, Xuxen aims to detect and correct the spelling of texts in Basque, that is, detect and correct typographical and spelling errors. For this purpose the unified Basque is the one that approves the program. You can analyze one or more documents in each execution. The documents are treated literally and while he knows the words he keeps working, but when he does not know a word he warns and stops. In view of the possible vacuum, the user can accept, direct users, request the proposals offered by the correction program or, again, access the personal dictionary to recognize all forms derived from this motto (see figure 1).

Elh.- How long have it taken to complete the program? and who have participated?

I.A.- The elaboration of the program has been a great work mainly for two reasons: the morphological complexity of Basque and the lack of a systematic description of the morphology of Basque.

Taking advantage of the previous automatic morphological analyzer, this program developed over the last three years has been the result of the collaboration between the Faculty of Informatics of the UPV/EHU, UZEI and the company Hizkia de Baiona. The team of the Faculty of Informatics has developed and coordinated the prototype, UZEI has given it a linguistic guarantee and Hizkia has assumed responsibility for the commercial product. Versions for Macintosh and PC have been prepared. The collaboration of: IVAP/IVAP, Department of Economics of the Provincial Council of Gipuzkoa and Cooperation Programme Euskadi/Aquitaine.

Elh.- What are the design features of the Xuxen program?

I.A.- Due to the morphological complexity of Basque, you cannot consult a list of words as is done for other languages to decide whether a word is right or not, and it is finished; since the legitimate forms that can be created from a slogan are many, the list would be huge. For example, if we start from a name, adding a single declination suffix we can obtain 135 legal forms (if we take into account the ellipses this number goes up enormously). In addition, if it had acted this way, instead of understanding all its decline by introducing only the slogan in the user's dictionary, as in Xuxen, the user should introduce one by one all the forms that correspond to that motto. For all the above it has been necessary to carry out a morphological analysis that allows to correctly identify the legal words.

Fdo.- He comments that morphological analysis has been fundamental, what steps have they taken to address it?

I.A.- Morphological analysis is based on two-level formalism proposed by Professor Koskenniemi of the University of Helsinki in 1983. Although this formalism was initially proposed for Suomese, it has been a success for any other language and for languages like Basque. The main characteristics of this formalism are the clear separation between the words that appear in the texts, the superficial level, and the lexicon, the lexical level, which serves for analysis and synthesis, and the distinction between the program and linguistic discovery. Linguistic information consists of morphous lexicons and morphophonic rules.

The lexicon has more than 60,000 entries, stored in a database and distributed in 120 subsamples. Each entry is assigned a continuation class that defines the set of suffixes that may come behind it. Superficial alterations in the collection of morphemes are manifested in twenty-four morpho-phonological rules. Each of these rules indicates when an insertion, deletion, or modification of a character occurs. For example, rule eight describes the following modification: the letter k of the lexicon is transformed into g of the cover if the letter k is a suffix “ko”, and if the former is a motto finished with n letters or a place name finished with l, m or n letters. For example, when collecting the morphemes, the form is created.

With the most used words to increase speed, a list has been prepared to avoid its morphological analysis.

Fdo.- The correction proposal is one of the options offered by the program, what does it consist of?

I. Joy. Professor at the Faculty of Computer Science at the University of the Basque Country. One of the authors of the spell checker for the Basque language Xuxen.

I.A.- Before an error the user can request proposals to the program. In this work typographical and spelling errors have a different treatment. In typographs it is considered as source of error the loss of a character, the insertion or variation or the exchange of two continuous characters, looking in reverse the appropriate words to propose.

Errors caused by poor knowledge of Basque, lack of knowledge of the latest changes in unity or dialectal use are called orthographic or typical. For detection and correction, Xuxen has a sole and special rules. For example, haundi is related to the preferred large form in a special lexical form; by performing the analysis of the “big” one gets from haundi+, but when haundi is marked as error it becomes large and therefore a great generation of big+sand arises as a proposal. Special rules include those describing the loss of h and the variation of x-s. In this way, when analyzing in zuaitxe you get automatically tree + ko and with creation the tree proposal.

Elh.- What language model have you used?

I.A.- Taking into account the bending of the Basque language, it has had to build a system of useful decline by computer. For this we have based on the table proposed by Euskaltzaindia and adapted it to our system, that is, we have taken this table and we have grouped the cases that fit each category of lexicon. Thus, each base has a single suffix, composed of suffixes that it can take.

In the derivation there are some prefixes and suffixes worked, but the most common are as dictionary entries. However, the user can enter new derived words into his dictionary. In the association of words the most common and systematizable has been worked for the moment according to the criteria set by the LEF Commission of Euskaltzaindia. The factive verb is also systematically treated

1992 Euskaltzaindia Recommendation

As for the verb, Xuxen knows the forms of both the auxiliary verb and the massif, provided that Euskaltzaindia decides. Recognizes neutral, unmarked, or hitan forms.

If in the grammar section the only normative source has been the Royal Academy of the Basque Language-Euskaltzaindia, it is not so when the lexicon begins to work. The recommendations and decisions in each case have been formulated in some points: Letter H, -a own, composition and writing of numbers, etc. These are the ones we have followed when completing the lexicon, although in the case of the numbers, for the moment, we maintain both options (admitting twenty-five and twenty-five). The same has happened with the names of people and places, as well as in the writing of the loans.

To create the basic vocabulary, that is, the most frequent list of slogans in any lexicon, we have had to resort to other current sources: Basque Free Choice Dictionary by Ibon Sarasola, UZEI Euskalterm database and lexicographical database EEBS, Xabier Kintana and others Hiztegia 2000, J.M. Etxebarria Frequency and Availability Dictionary, etc. When they did not conform to the criteria of Euskaltzaindia, the entries have been “adapted”, and in those that have not been agreed by Euskaltzaindia, the dictionary of Ibon Sarasola has been the source of criteria.

To complete the basic vocabulary, the SEE of UZEI has adopted complex expressions, expressions and forms. Acronyms and abbreviations have also been worked according to UZEI criteria. Based on common vocabulary, terminology has sometimes been necessary. Euskalterm has been essential in these cases.

To complete the list of proper names (although proper names do not come from common dictionaries) two sources have been used: the first has been the list of Basque names and places proposed by Euskaltzaindia, but to obtain the list of place names of the world has been resorted to Elhuyar.

From all these sources we have developed a large vocabulary that contains at least one lexicon of common texts. However, the terminology of specific topics will be freely included in your personal vocabulary.

Elh.- What do we look at the future?

I.A.- In group we want to deal with automatic syntactic analysis in the coming years. In this way, XUXEN of the future will have the opportunity to perform an advanced correction. On the other hand, our group also works in the elaboration of dictionaries, with the aim of obtaining a greater performance in the application of computer resources to dictionaries. However, based on morphological analysis, we intend to extract the automatic lematizer EUSLEM within a year.

Buletina

Bidali zure helbide elektronikoa eta jaso asteroko buletina zure sarrera-ontzian

Bidali

Bizitza