}

Basque Language and Linguistic Engineering

2002/11/22 Sarasola, Kepa

In Science and Technology Week, Wednesday was the day of information and communication technologies. Then Dr. Kepa Sarasola and professor at the UPV spoke about Basque and information and communication technologies (TICS). You have sent us a summary of your intervention and we want to thank you from here.

Steps to organize the language industry

In the medium term communication between people and machines can be carried out in our language, not in that of machines. There is no doubt that natural language is the key to our everyday life. No and when we say that your computational treatment is becoming increasingly important. Every day documentary databases are growing, changing the ways of relating to computers and digitizing all multimedia systems. As a result, we need to explore ways to work on natural language. Undoubtedly, language technologies are fundamental in what we call the information and communication society.

These tools will be limited and will always work with a degree of error, but nevertheless they will help us a lot. On the one hand, they will be economically profitable; it is cheaper to correct a translation draft with errors than to translate the whole text. On the other hand, these tools will improve communication between humans (for example, talking on the phone with someone who uses another language, translating words one by one).

There are currently several linguistic applications available: spelling and stylistic correctors, on-line vocabulary queries, translation aids, Internet browsers, systems that convert speech into text, text readers, second language learning systems, etc.

However, most such systems work only in English, not in other languages. The other languages have to make a great effort not to be left behind, even more so the Basque language and the other minority languages.

If you look at the website of the Natural Language Software Registry service, we will receive information about the program 167 currently available for working languages (see figure 1). Of these, 75% are available in English and only 30% can be used in any language. Most of the applications that can be found on the market are aimed at “large” languages, mainly English, but also, although in the background, French, German and Spanish.

Figure .

Application of linguistic engineering

In almost 50 years of PTP history there have been great ups and downs. At the euphoric moments in which they were considered to be on the verge of achieving fascinating goals, pragmatic moments have been repeatedly followed to lower their ears and limit them to lower but affordable goals. The day computers will understand the language as people understand it is still far away, but that does not mean that interesting and very useful applications cannot be made.

However, for the development of these applications it is necessary to start from a solid foundation. In general, we can represent the structure of language technologies with a kind of pyramid.

At the base of this pyramid are the basic resources needed to work in linguistic engineering. These resources will allow us to develop tools that, once developed, allow us to launch commercial products that work in different areas of linguistic engineering. We must take into account, however, that the reverse road is not possible if we do not want to build the house on the roof.

What infrastructure is needed to develop applications?

Applications, of course. We live in a multilingual society and dream of tools that help us multilingualism: machine translation into Basque,
speech knowledge, style correctors. But if we come to create them, we will first need a solid foundation. For example, for the development of a semi-automatic tool that can help translators, we must first develop a number of resources and tools.

In the case of Basque, the main tools and basic resources we have developed so far are:

Tools

  • A tool that turns us into written text. In the Basque Country there are two or three research groups working on this topic - one at the Bilbao School of Engineering, the Council, another at the Leioa Faculty of Sciences.
  • Morphological analyzer. In all languages it is necessary and essential in Basque, since it is a flexed language and sticker. The function of the morphological analyzer (and
    synthesizer) is to know (and compose) the morphemes that form the word form and provide the morphological-lexical information corresponding to each morpheme. This tool is based on applications such as spell checker, optical character recognition (OCR) and more sophisticated applications such as machine translation. The general
    morphological analyzer/synthesizer for the Basque language is made and Xuxen is the essence of the spell checker in Basque.
  • Lematizer/labeler. The lematizer/labeler derives from the morphological analyzer and provides the motto and category of a word form to avoid or reduce ambiguity in the context.
    Although the main task is disambiguation, another task that has such an instrument is the identification of multilingual lexical units (locutions, word unions, names of people, etc. ). ). The applications of lematizers are very interesting: indexing – in Internet browsers, e.g. –, terminology and lexicography, etc. The Basque general lematizer has been called EusLem and is already implemented in several internet browsers.
  • Syntactic analyzer. The function of syntactic analyzers is to know the syntactic components of the texts: sentences, nominal
    syntagmas, names and friends, etc. The analysis will be based on lexicon and grammar, which will define the characteristics of words and possible compositions of syntactic structures. It is also an indispensable tool in many language applications, such as machine translation. In the case of Basque, we have developed a general surface syntactic analyzer — EusMG — and the studies that the whole syntactic tree will give us are quite advanced.

Linguistic resources and foundations

We first need tools to develop applications, but their base is resources. The main ones are:

  • Lexical database and description of morphology. The lexical database of the Basque language EDBL currently contains about 75,000 entries.
  • Electronic dictionaries. Based on a general lexical language database, other lexical tools such as definition dictionaries, specialized terminology dictionaries, bilingual dictionaries, etc. can be grouped together.
  • Computational grammars: syntax descriptions. In the case of Basque, we must also take into account the close relationship between morphology and syntax. This has led us to integrate morphosyntactic treatment into the morphological analyzer, the result of a general morphosyntactic analyzer called Morfeus.
  • Semantic taxonomies. However, when it comes to understanding language is not enough with morphology and syntax, since the program also has to know semantics. These lexical-semantic relationships are explicitly expressed in a kind of semantic network. Among the semantic networks in English we have the one known as WordNet and its adaptation to Basque is called Euskal WordNet.
  • Textual corpus. Textual corpus are large masses of text, the main source of linguistic information, and the essential testers for the aforementioned applications, tools and bases.

As mentioned above, without these basic resources and tools, we will not be able to develop the applications we pursue.

In the case of Basque we have tools and resources, but if we want to see linguistic technologies like English, we still have a long way to go.

Conclusions

There are products that combine Basque and software. In the Euskera Software Catalogue 105 have been collected. 26 of them are related to the language industry. That is nothing, but very little; we have to make a great effort so that the Basque language does not stay behind in this world of information society.

Each of the linguistic bases that we will create in our path, each of the tools and applications must be well designed to be useful in the following products.

With the aim of working on research and development of linguistic engineering and creating a solid industry internationally, we have designed a medium-term strategy based on 15 years of IXA Group experience.

Research teams, industry and official bodies must coordinate to achieve this goal.

Gai honi buruzko eduki gehiago

Elhuyarrek garatutako teknologia