Size matters: large collections of texts, necessary in language processing
2009/11/01 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia Iturria: Elhuyar aldizkaria
Language processing exists almost since the creation of computers. The first programmable electronic machines created in the 40s of the last century, due to World War II, were mainly used to decipher messages and break codes, but, after the war, began to work a lot in language processing, especially in the field of machine translation.
In those beginnings, especially mathematicians, they used very simple techniques, influenced by the customs of cryptography: basically they intended to obtain machine translation through dictionaries and modifications of the word order. But they soon realized that languages were more than that, and that more complex language models had to be used. Thus, linguists were incorporated into the groups and applied the theories of Saussure and Chomsky. Since then, and over decades, in all areas of language processing (morphology, spelling correction, syntax, disambiguation of meanings...) an approach has predominated: the adaptation of knowledge based on the intuition of linguists to simple structures that can be treated by computers (rules, trees, graphs, programming languages...).
But these methods also have their limitations. On the one hand, the best linguists cannot take into account all the casuistry offered by a language; on the other hand, languages have a great complexity and richness to express themselves through simple structures. These limitations are even greater in conversational language. However, there was no other way; given the capacity of the machines of the time, this was the only way to speak with language. And with these techniques progress has been relatively slow for many years.
Arrival of corpus and statistics
However, in the last two decades, a more empirical approach is dominating language processing, based on the exploitation of large collections of texts and statistical methods. Rather than relying on intuitive knowledge, large real language samples, i.e. corpus, are used to take into account as many cases as possible of the language. Methods such as statistics or machine learning are used on them, with few linguistic techniques. Even in cases where language modeling is attempted through computable structures, the models are automatically extracted from the corpus. Therefore, working with statistical methods, in order for a machine to speak, it must have access to a huge collection of texts and resources to work with it.
This methodological change is mainly due to two factors. On the one hand, current computers, unlike previous ones, have the ability to handle huge amounts of data. On the other hand, there are more texts available in electronic format than ever, especially since the creation of the Internet.
Thus, corpus and statistical techniques are used in spell checkers (looking for contexts similar to the incorrect word in corpus), in machine translation (using translation memories or texts from multilingual websites to statistically obtain translations of words, syntagmas or phrases as large as possible), in sense disambiguation, in automatic terminology extraction, etc. And in general it can be said that the larger the corpus, the better results the systems get. For example, Google's Franz Joseph Och presented at the 2005 ACL (Association for Computational Linguistics) congress his statistical machine translation system, trained on a corpus of 200 billion words. And since then your system is the main reference in machine translation and the one that wins all contests. Something similar happens in other areas.
Future, hybridization
However, this methodology also has limitations. Some languages and tasks already use really gigantic corpus, and it can be said that they have already reached the top level, since they are very difficult to continue improving much more the results obtained. In other languages and areas there is no such large corpus, and with exclusively statistical methods such good results cannot be obtained.
Therefore, the recent tendency to improve statistical methods is to combine them with linguistic techniques and create hybrid methods. And in the future that will be the way forward in language processing. If we want machines to understand and deal with language shortly, and we want machines to speak, it will be necessary for mathematicians, computer scientists and linguists to go hand in hand.
Gai honi buruzko eduki gehiago
Elhuyarrek garatutako teknologia