Business & Technology

RWS study finds multilingual AI gaps are narrowing

Published

on


RWS has published results from a TrainAI study on multilingual synthetic data generation by large language models. The research found that performance gaps between major and underrepresented languages have narrowed.

Results can also shift markedly between model releases, with changes in quality and cost-related measures that may alter which system is best suited to a given task.

The findings examine how leading large language models perform when generating synthetic data across multiple languages, an area drawing increased attention as companies deploy AI tools in global content and customer-facing workflows.

One of the clearest trends is improved output in languages that have historically received less support from mainstream models. According to the research, the gap between English and lower-resource languages has narrowed significantly in the latest generation of systems tested.

Among the examples cited, Google’s Gemini Pro recorded quality scores above 4.5 out of 5 in Kinyarwanda, a language in which earlier model versions had struggled to generate coherent text. The study also pointed to broader gains across the market, including improvements in GPT and Claude Sonnet models.

Benchmark Drift

Alongside those gains, researchers highlighted what they called benchmark drift. In practice, that means a newer version of a model can perform worse than an earlier one on some tasks, even as the broader market trend points to improvement.

The latest version of GPT trailed smaller models on several content generation tasks where its predecessor had remained competitive. The study also found that tokenizer efficiency, which affects how much text can be processed for a given cost, varied sharply between successive releases.

That matters for businesses comparing systems for translation, content creation and other multilingual applications, because a model upgrade may change both output quality and economics. A model family that performs well in one release may lose ground in the next, while a rival system may improve in specific languages or workloads.

Public rankings alone are not enough for model selection, particularly for companies operating across multiple markets and language groups. Instead, the research recommends testing models repeatedly against specific use cases as new versions become available.

Enterprise Use

The study’s conclusions are aimed at companies building AI into production workflows rather than conducting one-off experiments. For those users, differences in cultural nuance, text coherence and token efficiency can shape both customer experience and operating cost.

RWS, which provides language, content and AI-related services, framed the findings as a case for closer oversight of model choice and for retaining human review in multilingual deployments. It said high-quality, culturally nuanced data remains central to assessing whether a model is suitable for a specific market or business function.

Vasagi Kothandapani, Chief Executive Officer of TrainAI by RWS, set out that position in a statement accompanying the findings. “This study signals a transformative moment that’s not about replacing human expertise, but about elevating it with the right technology,” Kothandapani said.

“As AI becomes more capable across languages, the need for deep cultural intelligence and human validation is more critical than ever. This is why RWS is guiding enterprises into this new reality by integrating these powerful technologies into content workflows with experts in the loop to ensure accuracy, cultural resonance, and brand consistency on a global scale.”

The research also examined less visible technical measures that can influence adoption decisions. One is tokenizer efficiency, which affects the number of tokens a model uses to represent text and therefore has a direct bearing on usage costs.

Tomáš Burkert, Head of Innovation at TrainAI by RWS, said that issue is often overlooked when businesses compare systems. “A model’s real-world value often comes down to specific, frequently overlooked metrics,” Burkert said.

“Factors like tokenizer efficiency, which can make one model 3.5 times more cost-effective than another in certain languages, are critical. The foundation of a successful AI strategy is a continuous validation process, rooted in high-quality, culturally nuanced AI data, to ensure you’re not just adopting a model, but the optimal model to address your unique enterprise requirements.”

RWS said the overall direction of travel in multilingual AI remains positive, with stronger performance across a wider range of languages than in earlier generations. But the central argument is that improvement is uneven and that model selection should be treated as an ongoing process rather than a one-time decision.

The report recommends independent evaluation with each new release to check whether a model still matches the intended use case, particularly when language coverage, quality thresholds and cost control are all under review.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Copyright © 2026 Oxinfo.co.uk. All right reserved.