}

Orai presents a new neural model for artificial intelligence in Euskera

2024/09/12 Elhuyar Zientzia Iturria: Elhuyar aldizkaria

Ed. Now

Orai, Elhuyar's artificial intelligence research center, has developed the most recent free neural model for artificial intelligence systems that require the understanding and creation of the written Basque language. Named after eus-8B, it will be used for the development of chatbots, automatic translators, grammatical correctors, search engines, content creation systems...

According to Oraiko's researchers, this is the most advanced model for Euskera in the light foundational model, with less than 10 billion parameters. In addition, in order to facilitate the development and research of technologies in Basque both in the academic and industrial spheres, free access to information on the development and evaluation of technologies in Basque has been made available to citizens.

According to them, for the development of the Eus-8B variant, the most recent model of Meta 3.1-8B has been used as the base model (it is the open source model of 8 billion parameters). This neural linguistic model has been created through machine learning algorithms using a large collection of texts (15 trillion words), most in English, and is very effective in this language (and in other major languages) to automate tasks that require linguistic skills (automatic translation, automatic summary, content generation, dialogue systems…). However, the performance in Basque is very limited.

Since there is no collection of texts in giant Basque and the computational requirements for training from scratch a similar model for the Basque Country are very large, they have decided to depart from Base 3.1-8B, as it has a solid base. The objective has been to transfer to the Basque Country the skills acquired from millions of English texts through machine learning algorithms and the use of a collection of Basque texts.

For this purpose, they have used the corpus Zelai collected a few months ago by Orai, the largest corpus in Basque with a free and high quality license. To improve the transfer of competences between English and Basque, the texts of the Gran Campo have been combined with English texts. In this way, the models have managed to maintain the knowledge of English and, at the same time, to improve the understanding of the Basque Country, effectively reusing what was learned for English in the original training”. The model training was carried out using the Hyperion system of the Donostia International Physics Center (DIPC) supercomputing center.

The model has been evaluated in a comprehensive test bench that includes 11 tasks in Basque, in which they have used the formal linguistic competencies (correct use of grammar and dictionary) and functional (ability to understand and use language in real contexts): school exams, problem solving, questionnaires on various topics, opinion analysis, etc.

The results of the evaluation show that the mildest management model available today in Basque is the one that provides the best results (less than 10 billion parameters), thus constituting a valuable resource for the development of artificial intelligence systems that require language skills in Basque. In some tasks it gives more competitive results than much larger models. In any case, although the results are ever closer to those of English, the performance in Basque is still much lower than that of English.

Gai honi buruzko eduki gehiago

Elhuyarrek garatutako teknologia