OCR in Basque
2003/05/01 Martinez Iraola, Edurne Iturria: Elhuyar aldizkaria
Access routes, analysis and information collection are changing. At one time the best way to receive information was the printed book, but today, on the contrary, we demand alternatives such as the search, copy and movement of information, classification, modification and manipulation of it. All of them are options that did not give us the traditional texts known so far, but in today's digital society things are very different.
The use of OCR is widespread in the Basque market, although this is an important work of subsequent correction. In Euskal Herria we have many newspapers, magazines and publishers, and in most cases their documentary background is not saved in digital format. The dissemination of the Internet, however, has made it necessary that all these documentary funds are duly digitized and collected to organize faster cataloging and search systems.
OCR (Optical Character Recognition) is the computer knowledge of written or printed text characters. This means that when we use OCR software we scan each character as a photo, and then analyze that scanned image and return it to a normal character code (for example, ASCII).
The accuracy of the OCR system is limited by three factors: the quality of the original document, the quality of the image created by the scanner and the interpretation made by the OCR software. Here we will talk about the last.
What the OCR does, in short, is to convert the scanned image into text. To do this, it analyzes the different points that make up the image and distinguishes the gaps between them. This process is called segmentation and is done in three steps: the first lines are separated (online segmentation), the isolation of the words is done (word segmentation) and finally the characters are distinguished (character segmentation). This last phase is simpler if all characters are of the same width, and it gets very complicated if they touch each other, if they are mixed with other punctuation marks or if the width depends on the shape of the character.
To realize the character knowledge it is necessary that the OCR system knows all the characters of the language of the scanned text. If doubts arise with the characters, I would wait for the word to be completed, process in which it will be useful to have a dictionary of that language to be able to equate it. Thus, through a game of probabilities and evaluating whether it is a dictionary word, the system will select one or another character.
Apparently, the existence of an alphabet and a dictionary in that language would be sufficient to correctly apply the OCR, but in the case of Basque it is not so. In this case you cannot provide a complete list of possible words, that is, you cannot create a dictionary, since being a declined language, from each of the word roots too many forms of word are extracted. Language tools are going to be a great help in this step, that is, working the main characteristics of Basque we can achieve great improvements in the development of an OCR system. For example, groups of characters or words (ts, tz, tx, or stripes) that are made in Basque are less common in other European languages.
With most of the OCR software currently used, when we want to analyze a text in Basque, we must use the vocabulary of a language in Spanish. However, in these cases it is preferable not to use vocabulary than that of another language so as not to make more mistakes in the text. For example, if we are using an English dictionary, it will almost certainly replace most six-word appearances with the set. If you are using Spanish, the appearance of the word energy will replace it with the word energ (tilde).
The result of the project developed at ELEKA is that the most commonly used OCR software, the Omnipage program, has been added a correction in Basque along with the morphological information of Basque. This program, for the case of Basque, is prepared to take the step of turning the scanned image into a character. To date, however, it was not prepared for the later phase of verification and correction of words (although it is intended for the majority languages: English, German). The following intentions will consist of adding an OCR corrector such as Xuxen for the word processors Microsoft Word and OpenOffice, to make available to users who do not use Omnipage the OCR system in Basque.
Therefore, through the incorporation of language tools in Basque, the tool that best digitizes the texts in Basque has been developed. That is, ELEKA has developed a tool that understands and directs Basque automatically when it comes to digitizing texts. For the development of this project it has had the collaboration of the Basque Government's Deputy Ministry of Language Policy, which will be responsible for the distribution of this application.
Gai honi buruzko eduki gehiago
Elhuyarrek garatutako teknologia