„Natural Language Processing? Hey, that’s what I do for a living!“ That’s what I thought when I heard about the live talk „A Glimpse at the Future of NLP“ (big thank you to Julia Böhm for pointing this out to me). As I am always curious about what happens in AI and language processing, I registered right away. And I was not disappointed.
In this conference, Marco Turchi, Head of the Machine Translation group at Fondazione Bruno Kessler, presented recent developments in automatic speech translation. And just to make this clear: This was not about machine interpretation, but about spoken language translation (SLT): spoken language is translated into written language. This text can then be used, e.g., for subtitling. Theoretically, it could then also be passed through TTS (text to speech) in order to deliver spoken interpretation, although this is not the purpose of SLT.
The classic approach of SLT, which has been used in the past decades, is cascading. It consists of two phases: First, the source speech is converted into written text by means of automatic speech recognition (ASR). This text is then passed through a machine translation (MT) system. The downside of this approach is that once the spoken language has been converted into written text, the MT system is ignorant of, e.g., the tone of the voice, background sounds (i.e. context information), age or gender of the speaker.
Now another, rather recent approach relies on using a single neural network to directly translate the input audio signal in one language into text in a different language without first transcribing it, i.e. converting it into written text. This end-to-end SLT translates directly from the spoken source text, thus has more contextual information available than what a transcript provides. The source speech is neither „normalised“ while being converted into written text, nor divided into segments that are treated separately from each other. Despite being very new, the quality of end-to-end SLT this year has already reached parity with the 30-year-old cascade approach. But it also has its peculiarities:
As the text is not segmented automatically (or naturally by punctuation, like in written text), the system must learn how to organise the text into meaningful units (similar to, but not necessarily sentences). I was intrigued to hear that efforts are being made to find the right „ear-voice-span“ or décalage, as we human interpreters call it. While a computer does not have this human problem of limited working memory, it still has to decide when to start producing its output – a tradeoff between lagging and performance. This was the point when I decided I wanted to ask some more questions about this whole SLT subject, and had a video chat with Marco Turchi (thank you, Marco!), just to ask him some more questions that maybe only interpreters find interesting:
Question: Could an end-to-end NLP system learn from human interpreters what a good ear-voice-span is? Are there other strategies from conference interpreting that machine interpreting systems are taught to deal with difficult situations, like for example chunking, summarising, explaining/commenting, inferencing, changes of sentence order, or complete reshaping of longer passages? (and guessing, haha)? But then I guess a machine won’t necessarily struggle with the same problems humans have, like excessive speed …
Marco Turchi: Human interpreting data could indeed be very helpful as a training base. But you need to bear in mind that neural systems can’t be taught rules. You don’t just tell them „wait until there is a meaningful chunk of information you can process before you start speaking“ like you do with students of conference interpreting. Neural networks, similar to human brains, learn by pattern recognition. This means that they need to be fed with human interpreting data so that they can listen to the work of enough human interpreters in order to „intuitively“ figure out what the right ear-voice-span is. These patterns, or strategies, are only implicit and difficult to interpret. So neural networks need to observe a huge amount of examples in order to recognise a pattern, much more than the human brain needs to learn the same thing.
Question: If human training data was used, could you give me an idea of if or how the learning system would deal with all those human imperfections, like omissions, hesitations, and also mistakes?
Marco Turchi: Of course, human training data would include pauses, hesitations, and errors. But researchers are studying ways of weighing these „errors“ in a smart way, so it is a good way forward.
Question: And what happens if the machine is translating a conference on mechanical engineering and someone makes a side remark about yesterday’s football match?
Marco Turchi: Machine translation tends to be literal, not creative. It produces different options and the problem is to select from it. To a certain extent, machines can be forced to comply with rules: They can be fed preferred terminology or names of persons, or they can be told that a speech is about a certain subject matter, let’s say car engines. Automatic domain adaptation, however, is a topic still being worked on. So it might be a challenge for a computer to recognise an unforeseen change of subject. Although of course, a machine does not forget its knowledge about football just because it is translating a speech about mechanical engineering. However, it lacks the situational awareness of a human interpreter to distinguish between the purposes of different elements of a spoken contribution.
Question: One problem that was mentioned in your online talk: real-live, human training data is simply not available, mainly due to permission and confidentiality issues. How do you go about this problem at the moment?
Marco Turchi: The current approach is to create datasets automatically. For our MuSt-C corpus, we have TED talks transcribed and translated by humans. These translations with their spoken source texts are then fed into our neural network for it to learn from. There are other such initiatives, like Facebook’s CoVoSt or Europarl-ST.
Question: So when will computers outperform humans? What’s the way forward?
Marco Turchi: Bringing machine interpreting to the same level as humans is not a goal that is practically relevant. It is just not realistic. Machine learning has its limitations. There is a steep learning curve at the beginning, which then flattens at a certain level with increasing specificity. Dialects or accents, for example, will always be difficult to learn for a neural network, as it is difficult to feed it with enough of such data for the system to recognise it as something „worth learning“ and not just noise, i.e. irrelevant deviations of normal speech.
The idea of all this research is always to help humans where computers are better. Computers, unlike humans, have no creativity, which is an essential element of interpreting. But they can be better at many other things. The most obvious are recognising numbers and named entities or finding a missing word more quickly. But there will certainly be more tasks computers can fulfill to support interpreters, which we are still to discover while the technology improves.
Thank you very much, Marco!
After all, I think that I prefer being supported by a machine than the other way around. The other day, in the booth, I had to read out pre-translated questions and answers provided by the customer. It was only halfway through the first round of questions that my colleague and I realised that we were reading out machine translations that had not been post-edited. While some parts were definitely not recognisable as machine translations, others were complete nonsense content-wise (although they still sounded good). So what we did was a new kind of simultaneous on-the-fly post-editing … Well, at least we won’t get bored too soon!
Further reading and testing:
beta.matesub.com (generates subtitles)
http://voicedocs.com/transcriber (transcribes audio and video files)
https://elitr.eu/technologies (European live translator – a current project to provide a solution to transcribe audio input for hearing-impaired listeners in multiple languages)
About the author
Anja Rütten is a freelance conference interpreter for German (A), Spanish (B), English (C) and French (C) based in Düsseldorf, Germany. She has specialised in knowledge management since the mid-1990s.