Search for...

How Machines Absorb Cultural Heritage (And What Gets Lost in Translation)

How Machines Absorb Cultural Heritage (And What Gets Lost in Translation), TheRecursive.com
https://therecursive.com/author/karolyboczka/

Károly Boczka is a multilingual AI evaluator and former diplomat with 25 years in international relations, including postings at the Embassy of Hungary in Zagreb and Belgrade. He specializes in AI bias research and multilingual model evaluation, with a focus on Central and Eastern European languages.
~

An AI confidently declares that Nikola Tesla was Serbian. Prompt the same model in Croatian, and Tesla becomes Croatian. Both answers arrive with perfect grammar. Neither comes with any doubt.

This is not a bug. It is a feature of how large language models work: they absorb cultural heritage the way a sponge absorbs water. They take in what surrounds them, and they reproduce it in the shape of the container you hand them. The problem is that Central and Eastern Europe has never had just one container.

The Grammar Illusion

Over 90% of AI training data comes from English-language sources. This is not a secret, but the consequences are rarely discussed outside technical circles. When a model learns Hungarian, Croatian, or Serbian, it learns the syntax and vocabulary reasonably well.

The way these systems are trained, scale is everything and quality is an afterthought: a doctoral thesis and a nationalist rant on an obscure forum corner contribute equally to the pile, as long as the token count keeps climbing.

What gets lost in that indiscriminate accumulation is the humanity and normality encoded in those languages: the everyday rhythms of how real people think, argue, joke, and grieve. And alongside that, the folk tales, the historical grievances, the songs, the specific irony, and the collective memory that make a language more than a communication tool.

To test this gap, I built a benchmark of 100 Hungarian cultural riddles. Not translation exercises, but probes of cultural memory: questions that require knowing what Hungarians learn in history class, what they sing at village festivals, which poets they can recite from memory. The leading AI models were run through both their consumer interfaces and production APIs.

The results were stark. Models scored between 80 and 100% on Hungarian linguistic quality. Grammar, style, fluency: near perfect. Factual accuracy on cultural content: 48% on average through APIs. A coin flip. A middle-aged Hungarian solved the same riddles with 95% accuracy. The models can speak the language. They do not understand what they are saying.

The second finding was arguably worse: across 600 outputs, the models produced 193 confident fabrications but admitted uncertainty only 12 times. A 16-to-1 ratio of confident hallucination over honest ignorance. The system did not know it did not know.

Read more:  13 Founders Predict Europe’s Tech Trends for 2023

When history gets three versions

The grammar illusion becomes a cultural liability when you move into shared but disputed history. Central and Eastern Europe has a lot of that.

An other benchmark put this under pressure: a trilingual red-teaming evaluation across Croatian, Serbian, and Hungarian, built around the historical tensions the three communities actually argue about: the figure of Zrínyi, the MOL-INA dispute, Tesla’s nationality, the 1990s conflicts, Tito’s legacy, NATO’s 1999 bombing campaign. Each prompt asked a model to respond as a proud national voice, defending its side with conviction but without crossing into hostility.

Bias did not disappear in any model, it only changed shape.

Some models retreated into diplomatic vagueness, repeating “both sides share responsibility” even when the task explicitly asked them to take a position. Others went further, quietly mirroring the nationalist framing embedded in the language they were prompted in. And the patterns were not random:

    • Serbian-language answers consistently leaned toward collective framing and shared victimhood.
    • Croatian ones favored individual accountability and legal argumentation.
    • Hungarian prompts produced a cooler, more detached factual register.

Each model followed that rhythm like a mirror, not of the truth, but of the cultural weight carried in each language’s training data.

One model reproduced a phrase from historical propaganda, translated smoothly into the target language, apparently without recognizing what it was carrying. The sentence described the killing of ethnic minorities as an act of justice rather than a crime. The model did not generate this with hostility. It generated it because the training data, somewhere, contained that framing, and the model had no mechanism to flag it as something that required pause.

Seven out of thirty duels triggered hallucinated historical “facts,” most often when a model tried to be diplomatically balanced. Faced with genuinely contested history, it invented compromises that never existed, constructing a peaceful fiction because the real record was too uncomfortable to navigate.

The system preferred a coherent lie to an honest “I don’t know.”

This is what it looks like when a machine absorbs cultural heritage without understanding it: confident, fluent, and occasionally dangerous.

Read more:  What can stoicism teach you about entrepreneurship with Ciprian Borodescu from Algolia

Why this matters beyond the lab

For large Western European languages, the training data archive is massive and relatively diverse. For Hungarian, Croatian, Serbian, and most CEE languages, it is thin, uneven, and often mediated through English-language summaries that flatten regional complexity into digestible approximations. These systems were not trained to understand the region. They were trained to simulate understanding of it, which is a different thing entirely.

Cultural misrepresentation in a riddle game is harmless. The same dynamics operating in content moderation, legal interpretation, historical education tools, or mental health applications are not harmless.

When a model trained on English-dominant data processes a hate speech complaint in Croatian, it applies categories built primarily around English-language slurs, English-language political contexts, and English-language cultural associations. The local specificity, the in-group references, the historically charged terminology that any educated Croatian speaker would recognize immediately, may be invisible to the model entirely.

The same applies to positive cultural transmission. AI tutoring tools, translation services, and digital archives increasingly mediate how younger generations access their own cultural heritage. If those tools were not trained on that heritage, what they transmit is a reconstruction, not the original.

What can be done about it?

Some of the most consequential work on this problem is already happening, quietly and without much fanfare.

Poland offers one of the strongest proofs of concept in our region. Dr. Marek Kozłowski and his team at the AI Lab in the National Information Processing Institute took part in building PLLuM, a family of open-source models trained on around 200 billion tokens of organically sourced Polish text: web content, books, and academic papers, grown from Polish sources deliberately rather than translated from English. Funded by Poland’s Ministry of Digital Affairs, a dedicated government body whose very existence signals how seriously the country takes these issues, PLLuM is now being integrated into public administration, education, and citizen-facing services. In December 2025, it was launched in the mObywatel application, a citizen services platform with almost 11 million users. It is not trying to beat GPT-4. It is trying to actually understand Poland, which is a different and arguably more important goal.

Read more:  Can AI Help Us Stop World Conflicts? This Slovakian Startup Makes A Bold Bet

Cohere’s Global-MMLU project approached the problem from the evaluation side, translating one of AI’s most widely used academic benchmarks into dozens of languages. If you only evaluate AI in English, you only know how smart it is in English. Cohere’s Aya initiative pushed this further, building a genuinely multilingual instruction dataset by recruiting contributors from underrepresented language communities worldwide. Both projects operate on the premise that quality multilingual AI requires people who actually speak those languages to be involved in building and testing the systems, not just consuming them.

Humane Intelligence is doing similar work from a governance and safety angle, commissioning bias research specifically in low-resource languages and publishing findings that corporate AI labs rarely make public about their own models. Also worth mentioning is the work of Latvia-based Tilde AI, they are tackling the problem from the language technology side, developing models and tools specifically optimized for European languages, including several that larger global systems routinely underserve. Their open model TildeOpen covers 34 European languages, with a deliberate focus on linguistic quality over raw scale.

Next step: treat AI training data as a cultural infrastructure question

These efforts represent a different model of how AI gets built: not a single dominant culture generating training data and then exporting it globally, but communities contributing their own languages, their own knowledge, and their own standards for what counts as accurate and fair.

It requires scale and coordination that independent researchers and small NGOs cannot provide alone. Institutions, tech companies, and policymakers in the CEE region need to treat AI training data as a cultural infrastructure question, not only a technical one. The EU AI Act creates some obligations in this space. But regulation alone does not produce datasets. That requires deliberate investment in the kind of cultural content that makes a language more than grammar: the stories, the ambiguities, the history that three neighbors remember differently, and the wisdom to know which differences matter.

Machines absorb what we feed them. The question for our region is whether we are willing to do the work of feeding them something worth passing on.

Help us grow the emerging innovation hubs in Central and Eastern Europe

Every single contribution of yours helps us guarantee our independence and sustainable future. With your financial support, we can keep on providing constructive reporting on the developments in the region, give even more global visibility to our ecosystem, and educate the next generation of innovation journalists and content creators.

Find out more about how your donation could help us shape the story of the CEE entrepreneurial ecosystem!

One-time donation

You can also support The Recursive’s mission with a pick-any-amount, one-time donation. 👍