Auzolan digital in favor of Basque

2019/09/01 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia Iturria: Elhuyar aldizkaria

In our lives, tools and services that use language and speech technologies are increasingly integrated: virtual assistants, intelligent speakers, automatic translators... The development of these technologies requires resources, but not only economic; in particular, the linguistic resources for training systems are absolutely necessary: audio recordings, interview examples, translations... They are scarcer in Basque than in other more widespread languages, so, as some recent initiatives show, crowdsourcing is lately being used to create these resources. Behind this great anglicism, after all, there is only such an auzolan rooted among us, in this case the digital auzolan.
Ed. Common Voice

Who has never used a virtual assistant or dialogue agent? Siri, Google Assistant, Cortana and others are installed by default on our mobile phones and computers, and although I, for example, have only used to do the test, it is very common for young generations to use them. Text dialogue systems, also known as chatbots, are increasingly common on websites, apps and messaging programs like Whatsapp. Machine translation has become almost a daily resource to understand a text that is in a language that we do not master or, at least, when we need to create a text in another language to have a first version to correct it. There are many services and websites for this, and automatic translators are integrated into apps and websites. Audios and videos are also automatically transcribed or subtitled.

What common characteristics do all these examples have? At least two things: one, all based on linguistic and speech technologies; the other, which does not exist in Basque or which, in general, work worse than in other languages.

One of the causes of the latter is, logically, economic. Many more human and economic resources are allocated to the research and development of such technologies in large languages, due to the size, power and diffusion of large languages, much less to the development in Basque. But there is another reason: there is a big difference in the availability of recordings, translations, dialogue examples, etc. Hegemonic languages have much more resources available than Basque.

In fact, at present, the most widely used methods for the development of language and speech technologies that offer better results are based on examples. In particular, the technology currently used in these technologies is that of deep neural networks, with which it has been shown that the best quality is achieved. And these systems need many examples to learn and function in some way from them. A machine translation system through neural networks requires many examples of translation to train and function properly; a dialogue system, many examples of conversations and a transcription system, many examples of transcribed audios. That is why the mentioned linguistic resources are so necessary, and therefore the systems of languages with less resources of this type work worse.

The Basques, for their theme, also want to be in our language the instruments and services of other languages, and for this it is necessary to create linguistic resources, so recently various initiatives have been launched for their creation through crowdsourcing. Crowdsourcing means leveraging the collaboration of many people to achieve something, especially with the development of the Internet, which facilitates communication and coordination of groups of people. But behind this name, after all, there is only one auzolan that we have been using for a long time, in this case the digital auzolan (term used by the association Librezale to designate the Common Voice initiative that we will expose below).

Common Voice Initiative in Basque

Common Voice is one of the latest projects to create resources for the Basque language. In fact, it is not an initiative created in the Basque Country itself, but an initiative launched by the Mozilla Foundation. The Mozilla Foundation, which is behind the free browser Firefox, aims to achieve an open and free web that facilitates access to the general public, including the browser itself Firefox and other devices and tools. To do this, it aims to create technology free of speech knowledge for as many languages as possible. Through the Common Voice project, people donate voice recordings to develop speech recognition systems. These recordings are free, so not only Mozilla, but anyone else can leverage them to develop speech recognition technology. Numerous people around the world are recording in several languages in the Common Voice project: About 2,000 hours have been recorded in 28 languages and other languages are on the way.

Librezale aims to promote Basque in the world of ICT and prioritizes free software. In February he launched the initiative to make recordings in Basque within the Common Voice project. Librezal did his first work (web translation, compilation of phrases to record...) and, once started, he has worked in the promotion of the initiative, in the organization of marathons, etc. with the collaboration of different agents: The groups Argia, iAmetza, IXA and Aholkularitza of the UPV, Garabide, Elhuyar Fundazioa... A lot of work has been done that is paying off: four months after the implementation of the project, thanks to 508 users, 83 hours had been recorded, of which 45 were validated. It is not bad, considering that at the same time and before, in Spanish, for example, there were 32 hours made; in Italian, 35 hours; in Dutch, 21 hours... We are far from the 1,200 hours we want to get, but certainly it is on the right track. If you want to collaborate with the initiative go to https://voice.mozilla.org/eu and record phrases or validate ones.

Collection of IXA Group interviews

Also in the IXA group of the University of the Basque Country they have taken the path of digital auzolan to develop a chatbot or dialogue system for the Basque language. Specifically, it aims to develop a chatbot that responds to requests for information from the user looking for information on the Internet, keeping the conversation as natural as possible. The initiative will be developed within a research project: Led by professors Eneko Agirre and Aitor Soroa, it has the participation of researchers Jon Ander Campos and Arantxa Otegi, as well as the master student Aitor Agirre. In addition, it has received one of the research awards awarded annually by Google (Google Faculty Research Awards). The project is based on interviews in English, but will be used for development in other languages.

As has been said, the development of a system of these characteristics requires many examples of real conversations that have wanted to be complemented with the contribution of the Basque volunteers. To do this, they prepared a website in which users were placed two by two, one asked about a Wikipedia article and the other gave answers in sessions of about 10 minutes. An example of such a conversation, based on the Wikipedia Korrika article, would be:

Ed. AAC

- What is Korrika?

- Korrika is a march that runs through the Basque Country.

- What length does it have?

- The route changes but always around 2,300 kilometers.

- How long?

- About two weeks.

- Without stopping?

- Yes, the march never stops, neither at night, nor due to bad weather conditions.

The collection of examples took place in June, with the intention of receiving 400 interviews and 356 interviews. It is not little! It is intended to release the interviews received so that anyone can use them in any other project.

It is clear that such initiatives are very interesting and necessary for the future. If the Basques succeed in bringing the auzolan so own in the digital world to fruition, we will surely get the machines to speak in Basque.