Cast the verse and we will analyze it
2013/09/01 Hulden, Mans - EHUko IXA taldeko ikertzailea | Agirrezabal Zabaleta, Manex - EHUko IXA taldeko ikertzailea | Arrieta Kortajarena, Bertol - EHUko IXA taldeko ikertzailea | Astigarraga Pagoaga, Aitzol - EHUko IXA taldeko ikertzailea Iturria: Elhuyar aldizkaria
Within the work carried out in the IXA group of the Faculty of Computer Science of the UPV/EHU, combining language and computer science, in recent years we are also working on the subject of bertsolarism. So, recently we have presented in collaboration with the Bertsozale Association the digital whiteboard (with search for rhymes and synonyms, measuring verifiers, etc.) to help the production of bertsos (soon also available for mobile). Also, in the field of linguistic creation we are working on the automatic creation of verses. Although we have taken the first steps, before taking more determined steps we have tried to analyze the bertsos in detail, since their exhaustive analysis can lead to a better creation.
For the realization of these studies has been based on the corpus compiled and classified by the Xenpelar Documentation Center. The corpus used by us covers the berths of the main tournaments held between 1986 and 2009. This corpus is composed of 6,887 verses classified in 2,600 verses. As can be seen in Figure 1, more and more verses --and therefore verses - are stored in the database.
The analysis has been carried out at different levels, taking into account the main characteristics of the verse: rhymes, measures, melodies, words, morphosyntactic categories and use of unified Basque.
To analyze which rhymes and feet are most used, we have taken into account measures that only rhymes in even lines, since with this type of verses we obtained 94% of the corpus and that the need to obtain rhymes of more irregular measures added a complexity that was not worth for this study.
As can be seen in the table in Figure 2, the championship is not always the same for the most used rhymes, although the tendency to use one is greater than others (for example, the rhyme eBGD appears in the first position).
Taking the corpus in its entirety (taking into account all the verses of the seven tournaments), we have also studied the rhymes and the most used feet (data that can be seen in Figure 3; the number that appears to the left of the feet indicates the proportion in which that foot has been used in that rhyme, for example, in 13.27% of the cases in which the rhyme “ela” the selected foot has been used). Keep in mind that in the corpus most of the verses belong to the last two tournaments, so the data of these two tournaments will have more weight in these measures.
On the other hand, the three feet that repeat most throughout the corpus and, therefore, the most used are the words "eve", "without" and "looking".
Regarding the analysis of the measures, we have analyzed which are the most used in the exercise of prison, the only punctuable exercise that is sung freely.
As can be seen in the graph in Figure 4, the trend to long and special measurements is increasing, as expected. It should also be noted that since the 2001 championship (according to corpus data) has not been sung in the major zortziko, and that in 2009 only a tenth (3%) was used. With this data, it seems that in the prisons of the future there will be no room for the major zortziko and the decimal.
In this study only the verses sung in the free melody have been taken into account, leaving out the melodies used in the points responses.
Figure 5 shows the evolution in percentage of the use of ten frequent melodies. Noteworthy is the low use of the well-known melody "Triste bizi naiz eta", and the remarkable boom of the melodies "Haizea dator ifarralde" and "Baserrian jaio naiz". (Note: We have not considered the championship of the year 1989 because almost a quarter of the berths that appear in the corpus do not have the melody documented.)
Most used words
As for the words used for the bertso, the graph in figure 6 shows the proportion in which the bertso can be composed using a certain number of slogans. In it it can be observed that the 500 most used slogans of the corpus of verses are sufficient to form 70% of a verse and the 1,000 most used slogans to complete 80% of the verse. To put it more clearly, a student of Basque would understand 70% of a bertso (without taking into account the obstacles for the orality or the limits of intelligibility of the syntax) knowing the 500 slogans most used in this bertso corpus.
On the other hand, it must be said that this corpus of competitions complies with the law of the Zipf. From the point of view of language processing, Zipf's law states that if, taking any corpus of the natural language, the most represented word is X times, the next most frequent word will appear X /2 times and the next X /4 times and the next X /8 times...
The morphosyntactic categories of words have also been analyzed to find out which are the most used and see if significant changes have occurred year after year.
As can be seen in Figure 7, the names and verbs (including the main verbs, auxiliaries and synthetics at the same time) are the most used by far. We also find important the evolution of the use of adjectives, since the championship has dropped by championship, although the difference is not very significant.
Use of unified Basque
Finally, to know the use of unified Basque in the corpus of verses, we have analyzed the corpus with the lematizer of the group IXA, taking into account the evolution of the words known by the lematizer.
As can be seen in chart 8, the number of known terms has increased from championship to championship. In the 2005 championship, with 89%, it is observed that although in 2009 this proportion drops slightly, it remains similar. The reasons why the lematizers of the IXA group do not know the words can be very diverse, while our estimates point to the fact that the use of Basque batua is the most common (80%). The rest are unknown (13%), carnivals (6%) or transcription errors (1%). According to these data, we cannot assure that the increase in the known words is due to a greater use of the Basque batua (and not for example a lower use of Spanish), but our intuition and a sample that we have analyzed by hand has confirmed the feeling that this is the trend.
The measures of the last tournaments, in our opinion, suggest two types of forecasts, although the data we have are not accurate enough and it seems to us that it is too early to draw conclusions: this trend will be reversed henceforth and the bertsolaris will use the language of the dialects again; or the upper limit (90%) in the use of unified Basque will continue around that limit. In any case, we believe that the most difficult thing is that the use of unified Basque increases even more in an oral activity such as bertsolarism.
The statistical analysis of the berths of the last seven main tournaments has allowed us to show some trends. While it will be worth doing a more paused and thorough analysis of this data, the former has also left us some significant things. In the choice of the measure and in the use of unified Basque, for example, it has served to confirm that the previous intuitions were true: there is increasing propensity for special and long measures, and even in the use of unified Basque it seems that the increase has been practically constant. As for the melodies, it seems that there is a tendency to an increasingly reduced use of the melodies, but in this data we have revealed a characteristic that we do not dare to draw conclusions about it.
Are these trends maintained in this year's competition or are they reversed? What about the following? What other interesting interpretations can be done from the corpus of verses? What consequences would you draw from analyzing non-competitive berths? And compare competition with competition?
There is still much to be done in this field, but we believe that the importance of continuing to properly document the bertsos for an exhaustive analysis of the production of bertsos is undeniable if you want to see how the trends mentioned in this article and others that deserve to be examined in a more leisurely way evolve in the coming years.