Interview: Nicholas Ruiz, PhD. Speech translation and Natural Language Processing researcher and Speech Translation Advisor for Waverly Labs.
I give insights into the current trends in speech recognition and machine translation and recommend strategies on how to best translate from one language to another.
In my undergrad, around 2001, I was taking courses in computer science and foreign languages at the same time. It made me start thinking, “If people can be taught rules to learn a foreign language, can a computer be taught to do the same?” And if so, how can automatic translation help bridge the communication barriers we face when speaking with people from different languages and cultures?As I started to mature in the field of machine translation during my master’s and doctorate studies in Europe, I started challenging myself to immerse myself in the problem of speech translation, not just in the lab, but also in my everyday life. I attended speeches and services in Italian and observed how unofficial interpreters would work hard to translate the speaker’s words into English to help groups of international students follow along. Half the time, I wasn’t listening to the speaker; instead my mind would get lost brainstorming about how speech translation could help others understand and participate in everyday conversations before they learned enough of the language to communicate without help.
We’re moving into a time when speech recognition and machine translation can cover many conversation scenarios where an interpreter used to be necessary. In many languages, speech recognition systems are able to recognize over 90% of the words people say, and language pairs like English to Spanish are reaching record highs in accuracy. Although machine translations may occasionally sound funny, the technology has developed to point where two conversation partners can understand reasonably well what each one is saying. Machine translation technology can’t quite replace high risk translation scenarios where precise translations are critical, but it covers a lot of the need where a professional translation or human interpretation isn’t the preferred choice. Also, research has shown that today’s machine translation can help professional translators work faster, which has opened up new, and perhaps unexpected, possibilities in the industry where professional translators and machine translation technology work together.
Speech translation is made up of three parts, automatic speech recognition (or what some people call “voice recognition”), machine translation, and speech synthesis, which are usually done in three separate steps. Automatic speech recognition takes sound from the microphone and transcribes it into words. Those words are then translated into another language, using either statistical machine translation, or the newly popular neural machine translation techniques. The translated words are then converted into sounds that mimic the way native speakers would speak by the speech synthesizer.
In a nutshell, statistical machine translation tries to learn patterns for how phrases or groups of words are translated. Translation rules are automatically learned from lots of sentences that have been translated into another language. For example, a rule could be “my blue car” => “mi coche azul”, or “blue car” => “coche azul”. Each rule gets several scores that predict how likely the translation is used. The translation system tries to combine multiple rules to produce a translation in a target language by arranging (or “reordering”) the groups of words to maximize how fluent the translation sounds. These rules can be similar to phrasebooks that people use when visiting another country, but a typical translation system has hundreds of millions of translation rules that are learned automatically.Neural machine translation is a bit more of a black box. Most of these translation systems use an “encoder-decoder” model. If we consider English to Spanish translation, the “encoder” converts each of the English words into a sequence of numeric vectors and the “decoder” generates one Spanish word after another by picking information from each vector. An “attention model” weights each vector to decide which encoded parts of the English sentence are useful to produce the next translated word. Unlike in statistical machine translation, it is hard to understand how a neural machine translation system makes translation decisions; however in many cases neural machine translation produces more fluent translations.
As I mentioned earlier, the first step of speech translation is speech recognition. One of the challenges of automatic speech recognition is getting a high quality recording that reduces the amount of noise in the audio. Noisy audio confuses speech recognition systems. If the speech recognizer is unable to accurately recognize the words you say, then the translation will most likely come out as nonsense. While there are far-field recognition devices that allow you to speak from across the room, the distance between the microphone(s) and the speaker allow other noises to interfere with the signal, making speech recognition more difficult. But as the microphone becomes closer to the speaker, the recorded audio has higher quality and less noise. Bluetooth headsets were originally created to allow people to talk on the phone without wires, while maintaining high quality. As a step above most bluetooth headsets, Pilot uses ambient noise cancellation and has a microphone array configuration that is set up to maximize audio quality. These are tuned to the speech translation problem to ensure higher speech recognition quality, which helps the machine translator to do a better job.The goal of Pilot is to provide a natural, hands-free conversation experience, backed by speech translation technology to minimize the frustrations of cross-lingual communication. By sharing an earpiece with a friend you can engage in multilingual conversation using only one translation kit. We specifically designed Pilot as a translating earpiece, not only to increase speech recognition accuracy by the microphone position, but also to keep a human communication that is fluid and natural.