Talking to machines
Communicating with machines only verbally is a reality today. In smartphones, for example, it is possible to perform an internet search or send a message without touching any button, using only voice. For the development of language technologies, several research centres and institutions, including the Elhuyar Foundation, have launched the Berbatek research project. One of the main lines of research has been on voice processing: how to use human voice, our words, to communicate with machines. The first results of the project were announced a few days ago; let’s see them.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: Inma, your work here is to create technology that allows humans and machines to communicate and understand each other, right?
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): Yes, to talk together.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: And how do you do that?
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): On the one hand, we generate voice from a text, using some algorithms, and then on the other hand, we take knowledge, voice, analyze it and generate the corresponding text.
Now that you're listening, there's a machine talking; it's not a recording. Verbaldia is created by a voice synthesizer, a technology developed by Aholab. Aholab is a laboratory at the University of the Basque Country that develops voice technologies. So far, man and machine have communicated in writing, through buttons; but today we are beginning to talk to smartphones or tablets, who also respond and fulfill what they have promised.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: Let’s play a little now with the voice technology that comes with the Android operating system. Let's see if we understand each other.
You can go directly to a website, open a specific map, program the GPS... all this can only be done in English, Spanish and/or other great languages. Aholab is working to make machines also in Basque. He especially works on vocal synthesis: TTS, Text To Speech. These are the steps.
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): Yes, you have that, on the one hand, linguistic modules, and on the other hand, you need acoustic models, and then you come up with some algorithms
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: We'll call it Sofware for people to understand
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): Yes, link sounds with software and extract the voice from there, as well or naturally as you can.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: Naturalness, is that perhaps the challenge?
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): At first, the challenge was understanding, many years ago. Then the systems were very understandable and the challenge was naturalness. And today, more than naturalness, putting emotions in the synthetic voice.
The basic elements, synthesis units or phonemes that make up the verbality are acquired from this type of audio or video recording. Here, the speaker’s face was also recorded to capture the gesture corresponding to each emotion.
One of the branches of the Berbat project is this demo for automatic duplication. The researcher Igor Leturia of the Elhuyar Foundation has participated in the development.
IGOR LETURIA, Elhuyar Foundation: This is a demo we’ve done as part of the Berbatek project to show how language technologies, in this case, can help the translation sector, precisely to make duplicates.
IGOR LETURIA, Elhuyar Foundation: This is the original version that we have adopted, in Spanish. The starting point is a video in Spanish, the video and its transcription. From there, a subtitle file is automatically created, that is, when a sentence starts and ends. And then this Spanish subtitle file is automatically translated into Basque, and audio or voice in Basque is automatically generated again.
IGOR LETURIA, Elhuyar Foundation: The technology for extracting subtitles from transcription is from Vicomtech, machine translation from IXA and voice generation or synthesis from Aholab.
IGOR LETURIA, Elhuyar Foundation: We have selected audio and subtitles here. What we are listening to and reading now is these subtitles automatically translated into Basque and the voice is the one automatically generated with Aholab’s synthesis technology
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: I said magic, I don't know if it is.
IGOR LETURIA, Elhuyar Foundation: Yeah, it's a little magical.
It’s just an example or a demo of what you can do using these technologies, but imagine, it can be the first step to listen to a film in Russian directly in Basque. The progress of speech synthesis is in full swing.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: Hey, Inma, let's turn my voice into a girl's voice... or a grandma's voice, for example, we've put some hard stuff on it.
Advanced speech synthesis can have a wide variety of uses: for example, to give voice to cartoon characters. Someone who will lose their voice forever due to an operation or illness can make recordings beforehand and then use that audio to make the voice synthesizer produce exactly the same voice as before. Or they are also exploring ways to make TV cheaply: the news would be read by a machine, such as the weather forecast.
In this other demo of Berbatek, in addition to voice synthesis, knowledge also comes into play. The aim here was to demonstrate the potential use of language and voice technologies in teaching.
Speech recognition takes a more or less opposite path to synthesis. The raw material for converting speech into text, e.g. the acoustic units 43, is initially supplied to it.
INMA HERNAEZ, Aholab, UPV/EHU: 43, forty-three. You've got it.
And then you have to tell the machine how you can receive these organized elements, give them linguistic models.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: You tell him what we're going to talk about, more or less; "I'm just going to tell you the numbers," you warn the machine
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): You set the lexicon, yes, limit the area. Then you can make it a little more complicated and include grammar, for example, but also limited grammar so that you know "where Newton was born"
If you speak to it using a lexicon or language that is not loaded, the system will not be able to understand it.
Assisted by IÑAKI LETURIA. The Elhuyar Foundation: What if I spoke to him like a person, in continuous language? "What if we went to dinner? "Would you understand?"
INMA HERNAEZ, Aholab Laboratory (UPV/EHU): Yeah, he'd understand if he had language models. Language models are linguistic, statistical models for calculating the probabilities of strings of words, and this is done by taking many texts. Statistical relationships between words and the probabilities of strings are calculated. And with all this information, linguistic models are built.
On the Internet, for example, the amount of text available to carry out this task determines, among other things, the level of development of a language in this area. This is also where the separation between large and small languages is obvious.
IGOR LETURIA, Elhuyar Foundation: If a language is not prepared, you will have to continue doing these things manually and we will go back to technology. That's why it's so important to develop this kind of thing.
We're just starting to use voice technologies. How far can we go? For example, how do you know if the person who gave voice to this report is a human being or a machine?
Buletina
Bidali zure helbide elektronikoa eta jaso asteroko buletina zure sarrera-ontzian







