Fere Eus: Combining internet, corpus and basque

2007/07/01 Kortabitarte Egiguren, Irati - Elhuyar Zientzia

The Internet is a great source of information. Few put it in doubt. At present, in addition to the search for information, it is increasingly used for linguistic consultations, corpus, etc. Thus, the Internet is progressively becoming a good source of linguistic resources and corpus. An example of this is Fere Eus, a tool that allows to use the Internet as a giant corpus in Basque.
The results of the Fere Eus system look like this. Result of the search corresponding to the word anorexia in the example.

Today, all languages need corpus. The corpus are collections of texts labeled in electronic and linguistic format --linguistically labeling means that each word is given its corresponding slogan, category, etc.- and are used in linguistic research and in the development of linguistic technologies. They are very important resources for the development of linguistic technologies, elaboration of dictionaries, etc. The elaboration of corpus is an expensive, laborious work and difficult to maintain always updated. Therefore, the corpus in Basque is scarce and small compared to other languages.

Through the internet

But there is the Internet or the web, a huge collection of texts, available to everyone, much more text than any other corpus in Basque. It is also a corpus, although it is not labeled linguistically. It would be good to be able to consult or exploit it as a corpus. This is what Fere Eus does.

Tools like WebConc or WebCorp already exist in the network, but there are also other tools and Internet search engines that have two problems with the Basque language: on the one hand, they can only search for a specific form and not all the forms of a word or slogan at the same time -- for example, we are interested in looking for land, land, land, land, etc.-, and on the other hand, if the Basque is not too futile software.

Search for the word banner in corpus Fere Eus and WebCorp. Fere Eus only shows the results in Basque language.

Fere Eus is born to overcome these limits. This tool, developed by the R&D group of the Elhuyar Foundation, with the collaboration of the IXA Group of the Faculty of Computer Science of the UPV/EHU, allows to use the Internet as corpustzat in Basque. The Internet is a giant corpus, much larger than any corpus in Basque. In addition, it is always updating and adding content, so you can consult the most recent words.

Fere Eus uses the APIs of Internet search engines (you can move with Google, Yahoo or Microsoft) to know which page a word appears -- the functions offered by the APIs service (Application Programming Interface) to use it from another program. Below, it shows, in its context, all the manifestations of the word contained in these pages. It also shows the number of apparitions.

It can sort the results according to different factors, showing the linguistic analysis of the results. It works with various types of documents (HTML, XML, RSS, RDF, TXT, DBF, DOC, RTF, PDF, PPT, PPS, XLS). In addition, the search is carried out by solving the two problems of the Basque language: it searches according to the motto and only offers pages in Basque, as explained by Igor Leturia, head of the project Fere Eus and researcher of the R&D group of the Elhuyar Foundation.

They use a tool developed by the IXA Group of the University of the Basque Country/Euskal Herriko Unibertsitatea to show a concrete form and all the possibilities derived from its slogan. In this way all forms are requested to the API using the OR operator. For example, if the user asks for the word house, the search engine will be put: etxe OR etxea OR to OR... The first problem has been solved. Of course, search engines do not support as many options as they wish, so all declines are not sent, but they do enough to obtain significant results.

The Internet is today a great source of information that, with the right search tools, can also be used as a giant corpus.
From file

Results in Basque language

As mentioned above, there is no search engine that only reflects the results in Basque. This is a problem if the word we want to find is said the same in other languages. This is what happens with technical words like anorexia, sulphurous and byte, with short words -- cat and milk, for example - and with proper names -- Fiji and Newton, among others. In fact, the searches for technical words are very common and useful in the corpus in Basque, since the terminology is not sufficiently normalized in Basque.

To obtain only the results in Basque, Fere Eus uses filters. The researchers of the R&D group of the Elhuyar Foundation have hung as filters the words most used in Basque, all related to an AND. To know the most used words, a corpus has been used.

Unfortunately, the most used words in Basque ( and, that is, no, ) are short, they are frequently used in other languages and, sometimes, they can be abbreviations and acronyms. Therefore, there are no magic words, that is, words that only appear in Basque texts and that can be used as a filter. It is and is the most used word in Basque. But ETA is also an acronym that is frequently used in the media in many languages. Another of the most used words is the verb, but in Russian, yes.

The corpus are collections of texts labeled electronically and linguistically.
From file

Therefore, how many of these words should be used as a filter to search only on pages in Basque? According to Igor Leturia, "the more words you use, the more concrete the search will be and, therefore, the fewer results that are not in Basque. However, it will not show any results in Basque, since some or some of these words do not appear in them."

Some limits

St.Eus complements the corpus so far. However, in addition to advantages, it has some disadvantages. On the one hand, as mentioned above, as the Internet is not labeled linguistically, it will always have some uncertainty with words with more than one slogan. In the search for the word pelotari, for example, since it is a trigger of the word ball and a person who plays ball. Another drawback is that, to a large extent, it has not combed - especially blogs, forums, personal content, etc.-, although it can be seen as an advantage (for example, because it gives a model close to oral language), it is also a disadvantage, since it can be of worse quality and defective.

On the other hand, you can never see everything there is, since normally search engines have a limit of a thousand pages, so you can only show the results of these pages. And finally, Fere Eus is dependent on the search engines: on the one hand, the results of the tool depend on the order of their results and, on the other, on the changes they make in the APIs and the limitations they put to the APIs.

Members of the Elhuyar Foundation R&D group: on the left, Antton Gurrutxaga, Nerea Areta, Xabier Saralegi and Igor Leturia.
R. R. Carton Carton

In any case, Fere Eus has been the first attempt that has joined the internet, corpus and basque. Surely it will not be the last. In fact, in other languages we also need larger and larger corpus for linguistic technologies, for which the tendency to use the Internet is growing remarkably.

Web page of the project Fere Eus: http://www.corpeus.org

Kortabitarte Egiguren, Irati
