Noise TechnologyEus and Elebilafor web searches in Basque
2007/11/26 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia
Fere Eus is a tool to consult Internet as corpus in Basque and Elebila a search engine in Basque.
Although the results of both tools are different and the uses that can be given, both perform web searches in Basque and both use the same technology developed in the R&D department of the Elhuyar Foundation.
Search problems in Basque language
Search for content in Basque on the Internet to regular search engines (Google, Yahoo!, When we play Windows Live Search... we have two problems mainly. The first of them is that no one allows to search only on the pages in Basque. Thus, when looking for words with the same graph in other languages, such as energy, anorexia or software, we are barely presented with results in Basque. The same happens with many proper names like Egypt, Newton or the Guggenheim —. And so with many short words like donkeys, cats or milks, because there are many possibilities to exist in other languages, even as acronyms.
The second is that the Basque language is a declined language, characteristic that the seekers do not take into account. When looking for a word in Basque, it is also convenient to look for the decreases of the word; otherwise, when looking for the word energy, it would not appear, for example, a page that says that energy consumption has increased.
Using APIs of search engines
Since the usual Internet search engines do not offer good results for the Basque language, there are two options: develop a search engine totally own or use the APIs that offer other search engines. The first is very complex. On the one hand, the technical difficulties, the main search engines that are still being investigated and who will probably have to continue investigating constantly: ranking, customization, web spam... On the other hand, there are all the hardware and infrastructure you demand: many computers making crawling, machines to host giant indexes, search services...
The use of APIs (interfaces or function sets offered by search engines to develop their own application through them) is much more economical and simple. However, it has some drawbacks: there is dependency of search engines, there is no control over the order and other parameters... However, Fere Eus and Elebila have been developed using APIs that seem to have more advantages.
Only results in Basque language
To obtain from the search engines only the results in Basque, they are added to the word that the user wants to search for the words that appear most often in Basque. The pages in other languages will not normally have these words of filter and will have the majority of the texts in Basque.
Four filter words are added to the question sent to the API: y , es, no. Only with the first is not enough, since the name ETA appears many times in other languages that are not Basque. Not with two, it is a word that means yes in several Slav languages. With none of the three, nor the word, for its brevity, for its meaning in other languages or for the acronym of something. Therefore, adding the four words, it is possible that practically all the results are in Basque. From time to time some page that is not in Basque is translated, but for its filtering the linguistic identifier LangId developed by the IXA Group is used. It applies to the text part sent by the search engine to show and if you see that there is a page other than Basque, both tools eliminate the results.
Euskera has a rich morphology: a slogan of a word (for example, the equation) has many forms (the same equation, the equations, the equations, the equations, ...). When searching for a word on the Internet, it is convenient to find any form of that word. Therefore, a search engine developed specifically for the Basque language should not index the exact forms of words, but their slogans. But Internet search engines do not, and only look for the exact form of introduced word, so pages with any other form of the same word are lost.
Fere Eus and Elebila.use the extension of the question by morphological creation to solve it. Morphological creation tools made by the IXA Group are used to obtain the forms of a slogan, requesting the API pages with any of these forms through an OR operator. So we managed to perform a lematized search.
The truth is that there is no complete search with the slogan, since the Basque words can have a lot of declines (technically infinite declines) and the APIs of the seekers have limitations as to the number of words that can be sent to them. For this reason, the decreases are ordered according to the frequency of use and so many are sent as APIs are accepted to cover most cases and achieve an almost true lematized search.
Navigation search vs. information search
Given that to obtain only the results in Basque, four words of filter are used, sometimes the pages in Basque are out of the results, since one or more of them do not contain. And this can be a problem, especially in browsing searches.
What is that? Theoreticians in the field of Internet search engines distinguish two types of searches: browsing searches (when the search looks for the address of a specific website, such as Euskaltube or Caja Laboral) and information searches (when you want to search for information about something, such as cancer or nuclear energy). Fere Eus and Elebila are mainly designed to search for CONTENT in Basque, that is, they have been designed for the search of information, where the usual search engines fail. And the texts with good information are usually quite long to have filter words and appear in this type of searches.
Members of the Elhuyar Foundation R&D group: on the left, Antton Gurrutxaga, Nerea Areta, Xabier Saralegi and Igor Leturia. (Photo: R. R. Carton)
However, for browsing searches, sometimes the Elebila will not work so well, since the pages of access to websites or the main pages, that is, those that we want to appear in this type of search, often have a short and scarce text, and may not appear in those short texts words of filter. But there is a solution. When Elía fails a navigation search we have two options: Accessing the advanced search in and dealing with the weakest filter (in this way the number of filter words will be reduced and the probability of the search page will be increased), or Indicating the search in any language (in this case, will perform the search that would make a conventional search engine; and for browsing searches in Basque pages the usual search engines of the Internet work quite well, since the ranking based on the number of pages that link is sufficient.
Fere Eus is mainly used for information searches. However, in some cases, the case may be that the filter terms present few results. In this case we have the option Try expanding the coverage to be able to perform the search with less words of filter. This option can have good results if the word searched is only in Basque, but if it has the same graph as another larger language, the API will translate many results that are not in Basque and then nothing will be shown, since the linguistic identifier LangId will remove them.
API Windows Live Search
LouisEus and Elebila are based on Microsoft's Windows Live Search API. To carry out this option, we have analyzed the limitations that the main search engines establish for the use of their APIs: The Google API only supports 1,000 calls a day and also no longer accepts new entries, as this API is being abandoned by Google to drive the new AJAX Search API (which only returns 8 results), the Yahoo API! Allows 10,000 calls per day for each IP and a free MICROSOFT call for each IP, and an App call.
But Fere Eus and Elebila are not at all married to Windows Live Searchs by chance and forever. They can also use other APIs (Google, Google AJAX, Yahoo and Alexa). We decided to give public service with Windows Live Searchs for the conditions, but if the conditions change at any time, we can place them almost immediately to use another API.