The HiTZ Center creates for the Basque language a linguistic model called Latxa
2024/02/01 Elhuyar Zientzia Iturria: Elhuyar aldizkaria
The Word Center has presented a great linguistic model for the euskera.Lo named Latxa, is based on the Meta LLaMA models and collects models of between 7 and 70 billion parameters. Today’s LLMs have amazing performance in languages with many resources, such as those of ChatGPT or Bard for English. In the case of Basque and other minority languages, however, their performance is much lower. Latxa has been developed to bridge this gap.
In principle they are three basic batches, pre-formed but not refined in user instructions or preferences. These models are not, therefore, directly used by the general public, but are fundamental for the construction of useful tools that use linguistic technology for the Basque country. For their development they have used GPU type servers and trained the latest models on the CINECA Leonardo supercomputer.
On the other hand, regarding the texts, they have used EusCrawl. This corpus, extracted from 33 quality content sites, offers better quality than other corpus composition techniques via the Internet. In total, they have 1.72 million documents and 288 million words.
To evaluate the quality of the models, they measure the capacity of the models in different linguistic competencies, such as the understanding of readings, common sense and reasoning, the analysis of feelings, the perception of attitudes, the classification of themes, co-reference, inference and meanings of words.
The Latxa models contain the LLaMA-2 License, which enables research and commercial activity, and are available at Huwaukee Face.
Gai honi buruzko eduki gehiago
Elhuyarrek garatutako teknologia