Search for...

“AI Shouldn’t Be English-Centric”: Meet Latvian Foundation Model Optimized for 34 European Languages

“AI Shouldn’t Be English-Centric”: Meet Latvian Foundation Model Optimized for 34 European Languages, TheRecursive.com
~

If you used ChatGPT, Gemini, or Perplexity in any other language than English, I bet you didn’t get the best results. There is this uncanny feel: like the model is thinking in its native language — English, but gives us a rigid translation into a language it’s not fully fluent in. The effect scales proportionally to the size of the language (i.e., how many people use it). So, with smaller countries and less popular languages, the results seem less optimal.

Unfortunately, we live in a region that is like a black box to most AI models… One company decided to change that – Tilde, an AI and language tech company based in Latvia. Last year, Tilde won Large AI Grand Challenge funded by the European Union, with a plan to develop a non-English-centric AI model, optimized for all European languages.

A year later they launched TildeOpen — a 30B open-source multilingual foundation model. For this feat, they were given access to the powerful LUMI supercomputer, and granted 2 million GPU hours. Why they decided to take on this challenge, what the TildeOpen model is capable of, how it compares to much bigger models, and why this example is more important than the sum of its parts… We sought answers with Tilde’s CEO and Chairman of the Board, Artūrs Vasiļevskis.

From Latvian diacritic signs to AI Grand Challenge

The Large AI Grand Challenge was born out of a strategic push from the European Commission, motivated by reports like the one from Mario Draghi, seeking to prevent Europe from losing the AI development race entirely. The core condition was to deliver a state-of-the-art open foundational AI model that could serve as a bedrock for European society — universities, public administration, startups, and businesses — to build next-generation applications.

Tilde’s motivation to apply was twofold: technological ambition and a deeply held belief that every language deserves equal respect in the digital age. This mission stems from the company’s 34-year history, starting in 1991 when they created a basic application just to enable typing Latvian letters with diacritical signs on computers.So the founders built a tiny application to fix that,Arturs recalled. That little tool becameTilde”—named after the ~ symbol that used to sit unused on their keyboards.

Read more:  CEE Lags in AI Adoption, Front-Office Assistants Could Change That

The company slowly expanded its ambitions: digital dictionaries, then machine translation for Baltic languages, and finally neural machine translation. Tilde launched its production-ready neural MT system just days after Google released its own, and later beat Google in several rounds of the global WMT translation competition. Five years in a row, in fact.

And that background shaped how they approached the challenge.

“We saw this AI Grand Challenge as a way to accomplish that mission: to provide equal high-quality support in large language models for all European languages.”

Beating giants with smarter training

Most multilingual models today, whether they come from California or Shenzhen, look multilingual only on the box. Their training data is overwhelmingly English. Ninety to ninety-five percent English, Arturs said. Everything else gets shoved into the remaining few percent. For Latvian, Estonian, Polish, Croatian, or Serbian, that might mean 0.01%.

What you get? A model that can answer a Baltic history question but stumbles on basic sentence structure. Or mixes registers. Or sounds like a machine trying to imitate Wikipedia. TildeOpen was built to address this quality gap with three fundamental data training elements, Arturs explains:

1) Equal representation for all languages

In Tilde’s model, all 34 official European languages are equally represented in the training data. This ensures the model fundamentally understands the structure and context of languages like Latvian, Estonian, and Slovenian, rather than treating them as an afterthought based on English linguistic rules.

“AI Shouldn’t Be English-Centric”: Meet Latvian Foundation Model Optimized for 34 European Languages, TheRecursive.com
Credits: Tilde

The hardest part though? Data quality for smaller languages. Not (just) the volume, the balance. German and Spanish have endless datasets; Latvian or Slovenian do not. That meant extra filtering layers, domain diversity checks, and manual validation to make sure the model wouldn’t lean too heavily on a single industry or genre.

“For ‘small languages,’ it was a challenge to filter out data originating from Russian propaganda, as well as to proportionally balance the remaining data across sectors — for example, to avoid an excessive emphasis on the EU legislative domain. As a result, a large amount of data had to be filtered out from an already small dataset, especially when compared to, for instance, German or Spanish.”

2) Language-specific tokenizers

Read more:  Croatia's SplxAI Raises €1.8M to Enhance AI Chatbot Security Amid Growing Cyber Threats

Unlike US models, which Arturs says apply the same linguistic rules to all languages, Tilde used a dedicated tokenizer for each language. This means the data was linguistically prepared in aright wayaccording to the specific morphology of that language.

“What does this mean in practice? It means that when you are using, for example, ChatGPT or Copilot in Latvian or German languageit impacts the speed of generation, and power consumed. The models are simply not efficient as with English.”

By treating each language uniquely, Tilde Open has proven to be 30 to 40% more efficient than comparable models in terms of speed and electricity consumption when processing non-English European languages.

3) Fighting disinformation

A major concern addressed by Tilde was the threat of corrupted, untrustworthy AI. With the help of Ukrainian and European intelligence, the Tilde team developed algorithms to filter their dataset of over two trillion tokens with Russian propaganda, particularly elements associated with the infamousPravda network.”

If those language models and those solutions are infected with a Russian propaganda, a Russian view on the history, Russian view on the political systems, then this is the technology through [which] you can impact the decisions,Arturs noted, underscoring the vital importance of data sanitization for societal safety.

The power of being lean

The result of that approach is Tilde Open, a 30-billion-parameter model tuned for European languages. So far they have only released a base model, no instruction model. That means the model is released so developers and researchers can fine-tune it themselves for specific, non-chat tasks (like summarization, code generation, or custom internal systems), it’s not for general public use.

Tilde tested its final model on benchmarks against three other models: Gemma 2, EuroLLM, ALIA, as well as ChatGPT’s own translation capabilities. Despite being smaller by 60 times, the TildeOpen matched GPT-4-level translation quality and even outperformed it in context-heavy, culture-specific cases, when translating across non-English EU languages, Vasiļevskis claims.

Read more:  We Need To Rethink AI to Save Our Democracy

The model’s efficiency and relatively smaller size (for a foundation model) have two massive practical implications for Europe, besides their immediate effect.

    • Foundation for national models: It provides an excellent base for building smaller, specialized national LLMs for individual European countries.
    • On-premise deployment: It’ssmall enough to be able to deploy it on European physical infrastructure,including government, non-profit or enterprise premises, which is crucial for organizations with strict security and data privacy requirements.

Arturs emphasized the two competing philosophies in AI development:One strategy… is building larger, bigger models, generic ones.The other, which Europe can excel at, isBuilding smaller but very focused, dedicated models towards to a specific task accomplishing.”

It doable!”

Tilde Open was developed by a relatively small team of only 10 PhDs and data scientists, and it serves as a great example.It’s doable. We have to work on our own European solutions and applications.”

The way forward, according to Arturs, involves a continued push for shared infrastructure, such as supercomputers, and for programs that encourage data collection and sharing. These initiatives, launched decades ago, are nowreally paying off”.

Tilde is now looking ahead, with plans to expand Tilde Open’s reach by extending its context window to 64k tokens and incorporating other major regional languages, such as Arabic and Hindi. They also plan to use the foundational model to develop even smaller (around 10-billion-parameter) models to make AI more broadly accessible on less demanding infrastructure.

Help us grow the emerging innovation hubs in Central and Eastern Europe

Every single contribution of yours helps us guarantee our independence and sustainable future. With your financial support, we can keep on providing constructive reporting on the developments in the region, give even more global visibility to our ecosystem, and educate the next generation of innovation journalists and content creators.

Find out more about how your donation could help us shape the story of the CEE entrepreneurial ecosystem!

One-time donation

You can also support The Recursive’s mission with a pick-any-amount, one-time donation. 👍

Ana Marija is the Editor-in-Chief of The Recursive. Even though her beginnings go back to mainstream media, her passion for technology prevailed. She polished her journalistic and editorial craft at Croatia's Netokracija, where she covered topics from startups life to software development. She oversaw the production of various video and content projects, as well as community events - but most of all she enjoys sharing valuable experiences of the founders, developers, and technology experts.