
Instead of reading, listening

2014/04/02 Leturia Azkarate, Igor - Informatikaria eta ikertzaileaElhuyar Hizkuntza eta Teknologia Iturria: Elhuyar aldizkaria

Although the oldest and most common form of communication among human beings is through speech, interaction with computers has been, traditionally, written or visual. In recent times, however, voice communication with machines is spreading more and more and they are able to better treat speech automatically. Here we are also working on it, and the magazine Elhuyar and Zientzia.net, instead of reading it, can now be heard.

If voice technologies have not been used until recent times, it is not because there was no need to be, but because the technology was not yet quite mature and advanced. The needs and possible applications have always been numerous.

What first comes to the head of these possible uses is the interaction with digital devices. Instead of giving orders to the computer, phone or tablet on the keyboard in writing or by clicking with the mouse, it can be more comfortable and fast in many cases doing it verbally. And to get the result of the machine, instead of reading it on the screen, it can often be more comfortable to hear. Examples of voice interaction are the Siri type dialogue agents that are increasingly visible on mobile devices (of which we had already spoken to you in January 2012).

They can also collaborate on interpersonal communication. Combined with machine translation, voice technologies allow to perform voice translators.

Another application is the management of information. Computers manage written information quickly and easily, and can make very useful tools like search engines. However, when it comes to audio recordings, the machines are not able to understand them and must be transcribed. On the contrary, if they are able to understand speech through voice technologies, the machines themselves would turn the voice into text and audio

Files could be easily indexed for search (for example, BBC is cataloging all radio audios of all its history for search) or automatically subheader movies.

Synthesis of voice

Within voice technologies, one important aspect is the knowledge of voice, but in this article we will focus on technology in the opposite direction: voice synthesis, also known as TTS (Text To Speech) in English. This technology generates the audio of a speech from a text, with synthetic or artificial voices, in the most natural way possible. And that, in part, is quite achieved, at least for a neutral intonation. Interestingly, the robots that appeared in old science fiction films were very smart machines and had no problem understanding what they were told, but they spoke in a very artificial and robotic way (of course). But in reality the opposite has happened: today machines can speak quite well, understand, but not so well, and still lack much room to be smart...

The researchers are also working on the synthesis of emotional voice, that is, the synthetic voice expresses emotions such as anger, joy, surprise or sorrow. And in many cases it is not enough to say things with neutral intonation, for example, if you want to duplicate the filme and the filme automatically.

To make the synthetic voice that is created seem natural, it is necessary to make many recordings of real people, obtaining a speech with the same voice as that person, which seems to be what a real person says. But this has a problem, since when many different voices are needed it does not serve (for example, to double the above mentioned films). Therefore, there is also the technology of the transformation of voices, that is, the technology to get a synthetic voice based on recordings seems to be owned by another person. It is used, for example, to produce voice synthesizers that look like your voice for people who have lost the ability to speak.

Voice synthesis in Basque to listen to the magazine Elhuyar and Zientzia.net!

We have already said earlier that voice technologies are quite advanced today and are increasingly used. However, these technologies depend on the language (perhaps with the exception of speaker detection) and are not at the same level of development for all languages. As always, these technologies are very developed for a few languages (those of always: English, Spanish, German, Chinese...) and for the majority of the others much more disconnected.

Despite not being up to these languages with better development, the Basque language is not, fortunately, one of the languages found in the last wagon. We have been working on voice technologies for the Basque language for years. And in this work we have as a referent and pioneer the research group of the UPV Aholkularitza. All the aforementioned technologies have been and are in preparation phase.

The most advanced technology for the Basque language of Kontseilua, of course, is that of vocal synthesis. They obtain a neutral synthetic voice of very good quality and that can be used in applications. For this reason, in collaboration with Zapore Jai, the Elhuyar Language and Technology unit has developed listening technology through voice synthesis instead of reading the web pages.

Because we no longer only navigate through desktop computers on the web. We are getting more and more online from our smartphones and tablets. And in them, the reading conditions of the web pages are not very adequate: it is a small screen (especially on the telephones), often we are moving (on foot, on the train, on the bus...), etc. However, in this type of devices we are very used to listening to content (music, podcasts...) with headphones. For this reason, we found it very interesting to develop this technology to listen to websites. Instead of reading the content on the computer or mobile device, the user can listen to it while doing something else.

For the moment, we have put this technology on the web of the magazine Elhuyar and in Zientzia.net. Being in a local content (an article, a report...) a bar appears in which a typical button appears in the form of “play”. By clicking here, we start listening to the article. Listening is repeated and the phrase we are listening is marked. There are also buttons to be able to navigate in the listener (to go to the previous or next sentence, to the previous paragram or to the side, or to the place we want). In addition, we can modify the voice (between a woman and a man), volume and speed. In addition, if we are in a number of the magazine, by pressing the button to listen, we will be able to hear successively all the articles of that number, which can be very interesting if we go by car on a relatively long journey, since instead of listening to the radio we can hear the whole magazine. Finally, in the interviews, he reads us with a different voice to the one chosen to differentiate questions, questions and answers. And all this with standard HTML5 technology (in HTML5 we speak to you in February 2010).

A good opportunity to know and enjoy voice technologies in Basque. Try it out and discover it!

Gai honi buruzko eduki gehiago

Elhuyarrek garatutako teknologia