NLP Machine Translation vs. LLM Translation

With the arrival of ChatGPT in late 2022, large language models have entered mainstream consciousness for people working inside or outside the world of technology. In the post below, I’ll compare the strengths and weaknesses of “classic” machine translation and LLM-based translation as I conclude that LLMs do not yet perform dramatically better in real-world translation scenarios.

Brief Overview of NLP Machine Translation & LLMs

So what is NLP, anyways? NLP stands for “natural language processing”. This is a broad domain inside the “AI” sphere that enables computers to understand, interpret, and generate human language.

Large language models (LLMs) are a subset of NLP models, utilizing extensive training data to perform a wide range of language tasks with minimal task-specific training.

Before LLMs entered mainstream awareness, many translation or localization companies were already using modern neural machine translation solutions like DeepL. These NLP-type systems were trained directly on massive datasets of bilingual sentence pairs. Such pairs were usually sentence or segment oriented, which is exactly how any CAT platform configures a translation project.

I’ve allowed myself two years’ of time to familiarize myself with LLMs before drawing any conclusions. So, let me break down the core differences and similarities that I’ve observed in day-to-day usage.

Machine Translation Was Already Robust

By 2018 or so, transformer-based MT already made language sound more human like. If you fed large segments of text into DeepL, it translated them surprisingly well. The difference, in my opinion, between “traditional MT” and LLMs is mostly about awareness and accessibility. To really leverage MT’s full capability, you usually needed to buy API access and plug it into a CAT platform. LLMs came wrapped in a chat interface, allowing basically anyone on Earth to sample its capabilities.

The Gap Between LLMs and MT is Smaller Than You Think

I think localization companies had to hop on the hype train, regardless of real-world outcomes, because the general population was already “sold” on how useful this technology could become. Perhaps I haven’t looked hard enough, but I haven’t seen any prominent “thought leader” or localization company executive say something like “LLMs are promising, but in terms of raw translation capability, this isn’t really the “game changer” (at least yet) that people say it is.

LLMs Excel at Human to Machine Collaboration

Traditional MT systems output translations, but LLMs can be consulted, challenged, or questioned in real time. A translator can ask it to explain a translation in detail. They can upload a style guide to customize tone and usage. They can ask an LLM to provide alternative translations or re-write translations on the fly. These are real perks that any translator would love to add to their toolkits. And they have.

Context Handling: A Modest Improvement

DeepL and other MT solutions already handled some context. They could accept termbases (terminology databases), metadata like keys or tags, and could also modify output based on neighboring segments. LLMs offer some improvement because they can ingest larger context windows to “reason” across text in a more adaptive way. But context handling itself isn’t new… it’s just a change in scale.

LLM Translation Issues

Large language models excel at producing plausible output, but they struggle with some key aspects of quality translation and localization.

1. Hallucinated Nuance

Sometimes the models add emotions or implications that aren’t present in the source text.

Source Text: 他看了我一眼。
Literal Translation: He glanced at me.
LLM Translation: He shot me a suspicious look.

There’s absolutely nothing in the source text that mentioned suspicion. This may matter or it may not… but strictly speaking, this is a mistranslation and an example of an LLM “trying to do too much”.

2. Over Localization

If you tell the LLM you’re adapting a Chinese game for an American audience, it might “try too hard” to localize the text.

Source Text: 高考
LLM Translation: SAT or College Entrance Finals
Human Translation: No single right answer.
This kind of translation needs to be handled with a lot more nuance. The 高考 (gao kao), in the Chinese cultural context, carries much more significance than the “SAT” does in the USA. Translating this into “SAT” really underscores the sheer amount of importance that students, parents, and Chinese society place on this exam. Students spend their entire “high school career” preparing for this exam. This exam alone determines what college a student will go to, and if they go to college at all. The stakes involved in taking the gao kao are way higher than signing up for and re-taking the SAT multiple times throughout a student’s high school years.

In the two cases above, the LLM provides fluent, plausible translations. But in reality, it lacks the precision and nuance in understanding needed to effectively translate these terms or segments. That is not to say that an “old school MT” solution would do better—I’m just trying to emphasize that LLMs aren’t much (or any) better than traditional machine translation in this regard.

So What Would I do?

I would actually keep the old school MT for now. I pay something like $35 USD for a DeepL API that I have used and used without worrying about tokens or usage. That’s not to say that DeepL won’t change their technology, up their rates, or make what I just said look foolish in the future. It’s just that, for now, a $35 a month API can provide solid performance at a price point that LLMs simply can’t currently match. This may change very soon. Who knows.

So are LLMs an amazing tool for companies or individuals doing localization and translation work? Absolutely. But I think the hype has reached such a level of hyperbole that decision makers who are unfamiliar with traditional MT may be duped into overpaying for a solution that has already existed for at least five years now.

The TLDR is: LLMs are genuinely useful and powerful collaboration tools but not as drop in MT replacements. And unless you plan on spending hours configuring your own privately-hosted model, it’s really difficult to beat the value traditional machine translation provides.