Fast but awkward vs. well thought-out but slow – Comparing machine translation to human translation | guest article by Luca Etzold

AI and with it MT has been gaining more and more traction in the language industry. However, human translators have not gone extinct just yet, and there are good reasons for it, some of which I explored as part of my Master’s thesis. In it, I translated a podcast interview from English to German and then tasked DeepL’s (then relatively new) LLM version with the same thing. Let’s see how it went:

Act I: The source material

The fictitious client in my thesis asked for the translation of an interview titled “Russian Creativity and Risk-Taking”. In said interview, the researchers Peter Roberts and Katarzyna Zysk discussed Russia’s military doctrine and the Western perspective on it. The interview had taken place in 2021 as part of the podcast Western Way of War, and the fictitious client wanted to publish a German version of the interview on the website of the German Institute for Defence and Strategic Studies (GIDS). Simple enough. However, the original interview was an oral conversation and the source text for the translation was an unedited transcript of the interview. Therefore, the source text was full of typical characteristics of spontaneous oral communication: redundancies, overly long sentence structures and ambiguity. Here’s an example from one of Zysk’s statements:

“So, I think in my mind, the Western way of war, when I think about it, it seems to me it is, and it’s going to be increasingly influenced, and challenged by the counter-responses and possible novel solutions developed by competitors such as Russia.”

Now, the target text was supposed to be a written text and the target recipients would therefore be reading it instead of listening to it. While, in oral communication, the characteristics described above are generally compensated for by a speaker’s intonation, this is not the case for written communication resulting in a less-than-optimal experience for the reader. The same applies to the ambiguity contained in the source text. Long story short, the translation of said interview needed to balance out the orality of the source text to make it a comprehensible and an enjoyable read for the target recipients.

Act II: Human at work

For my translation of the source text, I did an in-depth analysis of it using Christiane Nord’s model for translation-oriented text analyses. That way, I gained a better understanding of the source text’s message and function and its relation to the target text. I used this as a basis to produce a target text that was as well-researched and as tailored to the target readers’ needs as possible. Due to the discrepancy between the source and the target medium (oral vs. written), I had to step away from the source text’s wording quite a bit to improve the target text’s readability and make it easier to understand. In many instances, this meant shortening sentences, clarifying references between sentences and editing elliptic structures to form independent clauses. Here’s an example of a source sentence undergoing severe editing as part of the translation process:

Source text: “But also the defence innovation capability, which has been quite fascinating to observe, how they’ve managed to do it in a relatively short period of time.”

Target text: „Das gilt aber auch für Innovation im Bereich Verteidigung. Wie schnell Russland das geschafft hat, ist wirklich faszinierend.“

As you can imagine, this rather old-school approach to a translation (extensive analysis, lots of research, no use of MT) takes a lot of time. Guess what’s quite a bit faster? MT. Provided it is good and produces usable translations. Let’s see.

Act III: DeepL ex machina?

Probably everyone in the language industry knows DeepL. Still, here’s the rundown:

DeepL is a neural machine translation system developed by the Cologne-based company DeepL and was first introduced in 2017. Since then, DeepL has become one of the most-used MT systems on the market. At the time I wrote my Master’s thesis (2025), DeepL had recently introduced their new LLM (large language model) version that supposedly produces even better, more natural results than DeepL’s previous products. This is also the system I chose for my thesis.

For the evaluation of DeepL’s translation, I chose the Multidimensional Quality Metrics (MQM) by Lommel et al. This method enables its user to categorize issues found in a translation using issue dimensions and subordinate issue types, and to assess their severity ranging from “None” (= number of issues multiplied by 0) for preferential/non-consequential issues to “Critical” (= number of issues multiplied by 3) for issues that interfere with the intended function of the target text and/or render it incomprehensible/unusable.

Here’s my evaluation of DeepL’s translation of “Russian Creativity and Risk-Taking” from English to German:

Issue dimension	Absolute	None	Minor	Major	Critical	Weighted
Accuracy	31	2	16	10	3	45
thereof source text deficiencies/interferences	20	2	11	7	0	25
Fluency	21	6	12	3	0	18
thereof source text deficiencies/interferences	19	6	11	2	0	15
Style	56	16	33	7	0	47
thereof source text deficiencies/interferences	45	10	29	6	0	41
Terminology	8	0	8	0	0	8
thereof source text deficiencies/interferences	5	0	5	0	0	5
Total	116	24	69	20	3	118
thereof source text deficiencies/interferences	89	18	56	15	0	86

The majority of issues (40% of total weighted issues) found in the machine-produced target text belong to the issue dimension of Style. Mostly, these are issues of the Awkward issue type. This fact is due to the MT’s tendency to stick too closely to the overly long and complicated sentences of the source text. In fact, 83 % of all weighted Style issues can be attributed to a lack of distance between source and target text. Here’s an example:

Source text: “So, I do agree with the general assumption that there is a common characteristic to how the Western militaries fight, how they try to succeed against the rivalries, the adversaries.”

Target text: “Ich stimme der allgemeinen Annahme zu, dass es eine gemeinsame Eigenschaft gibt, wie westliche Streitkräfte kämpfen, wie sie versuchen, gegen ihre Rivalen und Gegner erfolgreich zu sein.”

To be honest, this lack of distance between source and target text was a little disappointing to see since other LLMs are oftentimes praised for their “freer” way of conveying the source message. Here, however, DeepL’s LLM version produced a text that was oftentimes hard to read due to the interference between source and target.

The second most-frequent issue dimension found in DeepL’s target text is Accuracy. These issues ‘only’ make up 27 % of all non-weighted issues, coming up way behind style, but they are quite a bit more severe, leading to Accuracy making up 38 % of all weighted issues. Most of these issues are Mistranslations. Here’s an example of a mistranslation so severe that it had to be weighted as Critical:

Source text: “I think it’s strongly influenced by the, you know, long lines in the Russian strategic thinking, going back in particular in the 19th-century thinkers, which are still referred to in the Russian military education also.”

Target text: “Ich denke, sie ist stark beeinflusst von der langen Tradition des russischen strategischen Denkens, die insbesondere auf die Denker des 19. Jahrhunderts zurückgeht und auch in der russischen Militärausbildung noch immer Bezug nimmt.”

In the example above, DeepL uses the wrong subject for the German relative sentence making it incomprehensible. While this example is not necessarily due to a lack of distance to the source material, a whopping 56% of all weighted Accuracy issues do stem from DeepL’s inability to step away from the source.

Fluency (15% of all weighted issues) and terminology (7% of all weighted issues) each make up relatively small portions of all weighted issues. Most Fluency issues (83%) are a result of source text ambiguities that DeepL was unable to resolve in the target text.

I know evaluating a translation using the model by Lommel et al. does bear the risk of only focusing on the negative. On a positive note, there are examples of good ‘decisions’ DeepL has made during the translation process. For instance, in this following sentence, DeepL has compressed and therefore improved Zysk’s statement quite a bit:

Source text: “So, I think in my mind, the Western Way of War, when I think about it, it seems to me it is (…)”

Target text: “Wenn ich darüber nachdenke, scheint mir, dass die westliche Art der Kriegsführung (…)”

However, 94 issues weighted Minor or higher, many of which caused by a lack of distance to the source material, outweigh the positive examples by far. Without correction by an experienced human translator, DeepL’s target text would not be usable for its intended purpose.

Act IV: What to take away from this miniature study

By no means do I claim to have reached any type of breakthrough or developed any kind of innovative study design. However, I do think that the type of study I conducted as part of my Master’s thesis nicely shows the added value human translators continue to bring to the table: their ability to tailor their work to the specific assignment at hand and their expertise in detecting and correcting source text deficiencies. At least in 2025, DeepL’s new and improved MT model was not able to fully grasp the message of the source text, step away from its exact wording and produce a target text that is comprehensible, pleasant to read and, well, usable. However, we as language professionals also have to make our clients understand that it is worthwhile to spend more time and more money on human work than just using relatively cheap machines. This might be a use case for studies like the one I conducted (of course at greater scale to produce actually reliable results): to pinpoint exactly where MT struggles and where our human expertise proves critical.

Thanks for reading!

About the author

Luca Etzold is based in Cologne and holds Master’s degrees in both specialized translation and conference interpreting. Her working languages are German and English.

Fast but awkward vs. well thought-out but slow – Comparing machine translation to human translation | guest article by Luca Etzold

Act I: The source material

Act II: Human at work

Act III: DeepL ex machina?

Act IV: What to take away from this miniature study

Share this:

Comments

Leave a ReplyCancel reply