Today : Mar 29, 2025
Science
24 March 2025

Study Shows AI Language Models Struggle With Translationese

Despite advancements, translation issues persist in large language models and impact language learning effectiveness.

In a groundbreaking study published on March 6, 2025, researchers from the Shanghai AI Laboratory, along with experts from Westlake University in Hangzhou and Northeastern University in Shenyang, have revealed that large language models (LLMs) continue to struggle with an issue known as "translationese." This phenomenon refers to overly literal and unnatural translations that deviate from native linguistic norms. The research is particularly significant because it represents the first systematic study examining translationese in LLMs, building on insights from traditional machine translation (MT) systems.

The researchers assert, "To our knowledge, this is the first systematic study addressing translationese in LLMs." Given LLMs are trained on extensive datasets of native-language text, one might expect that they would be less prone to this issue. Yet, the study demonstrates that the problem of producing "unexpected" unnatural translations persists, highlighting translationese as a "persistent challenge" in AI translation.

Throughout their investigations, a variety of LLMs were evaluated, including GPT-4, GPT-3.5, ALMA-7B/13B, and Mistral-7B, focusing on translations between English and Chinese as well as German and English. The findings were striking; over 40% of translations produced by GPT-4 exhibited translationese errors. More alarmingly, the Mistral-7B model had the highest error rate, with a staggering 76% in translations from English to Chinese. This suggests that larger models may still produce more natural translations than their smaller counterparts, but significant issues remain.

The researchers delved further into whether specific prompting strategies could alleviate the translationese problem. They experimented with three types of prompts: a standard translation prompt, a specified prompt detailing naturalness requirements, and a polishing prompt that instructed models to refine their translations through a two-step process. However, they found that merely specifying naturalness requirements did not reliably reduce translationese; in some cases, it even worsened the translations. For example, under specified prompts, the rate of translationese errors in GPT-4 increased.

Conversely, the polishing prompt yielded positive results. When instructed to polish its outputs, GPT-4 managed to decrease translationese errors from 43% down to 25%. The researchers noted that their findings suggest LLMs are not inherently prone to producing translationese; rather, the supervised fine-tuning methods applied during training contribute to these biases by favoring faithfulness over fluency. Notably, over 34% of the fine-tune training data reviewed showed evidence of translationese, thus perpetuating unnatural patterns in model outputs.

To mitigate these biases, researchers proposed two effective strategies: first, employing LLMs to enhance gold reference translations before fine-tuning; second, utilizing these models to filter out unnatural translations from training datasets. Experiments with models Llama-3.1-8B and Qwen-2.5-7B demonstrated that these approaches can significantly enhance translation quality, as the researchers concluded, "These findings underscored the importance of addressing data quality and training methodologies in developing robust and natural translation systems."

Meanwhile, on the language-learning front, individuals are exploring AI chatbots as an innovative tool for improving language proficiency. One user reflects on their experience with the AI chatbot Langua, which offers a unique blend of role play and personal interaction, allowing users to practice conversational Spanish in real-world contexts.

The author of this anecdote has been studying Spanish for two decades and shares their struggles and triumphs while preparing to move to Spain. By engaging in practical scenarios through Langua, they noticed immediate improvements in their conversational skills, citing the chatbot's ability to remove "disfluencies" and provide coherent alternative phrases as major benefits.

For instance, during a role-playing session where the author pretended to be a personal trainer, the chatbot understood and responded appropriately to various requests, providing a conversational experience that traditional methods lack. “I’ve grabbed the bar,” the chatbot responded when prompted to integrate specific vocabulary into the dialogue.

In avowing the chatbot's efficiency, the author mentions that Langua's monthly subscription of $25 is akin to the cost of a single tutoring session, making it a cost-effective alternative for daily practice.

However, the author also acknowledges that no AI can replace human interaction entirely. Their tutor, Maria, offers not just linguistic expertise but also emotional connection, humor, and cultural insights that enrich the learning experience. The stark difference between conversations with AI and real people highlights the importance of interpersonal communication skills, which AI cannot fully replicate.

Despite the notable advancements made by chatbots, the author emphasizes that language learning is an ongoing journey that benefits from diverse methods competing for fluency. By balancing AI-assisted conversation practice with traditional tutoring and real-world interactions, learners may find the best path toward language mastery.

As evidenced by the interactions with both AI chatbots and ongoing research into LLMs, both the technological and linguistic landscapes are evolving rapidly. The dual challenges of creating effective AI translation systems and facilitating real human connections in language learning continue to inspire innovation and exploration within the field, promising a future where fluent communication becomes increasingly attainable for learners worldwide.