In a nutshell:
- The Institute for Language and Speech Processing of Athena Research & Innovation Center introduced the first Greek LLM, Meltemi.
- The model is built on top of Mistral-7B.
Get the details:
One of the main objectives for developing a Greek LLM stems from the limitations of existing open LLMs, which are predominantly trained on monolingual datasets in English.
In March, the Bulgarian Institute for Computer Science, Artificial Intelligence and Technology (INSAIT) launched the first free Bulgarian trained open LLM, BgGPT. Across Europe, there are other examples, such as the German LeoLM and the Spanish Aguila.
The Athena Research Center is an R&D organization in the area of language, creative technologies, and digital humanities in Greece. Their focus lies in advancing scientific and technological research in information, communication and knowledge technologies.
The Institute explains they constructed a standardized LLM evaluation suite for the Greek language. There are two variants of Meltemi under version 1, both released under the Apache 2.0 License. The foundation model is Meltemi-7B-v1, which is further optimized for specific tasks with instruction tuning, and Meltemi-7B-Instruct-v1 is specifically designed for chat applications.
The models were trained on AWS infrastructure with support from a GRNET grant, providing networking and cloud computing services to academic and research institutions.
The project is led by Vassilis Katsouros, a Research Director at the Institute for Language and Speech Processing for more than 20 years.
Meltemi is named after the dry north wind, coming across the Aegean Sea during the summer.
In their own words:
“Meltemi is developed as a bilingual model, maintaining its capabilities for the English language, while being extended to understand and generate fluent text in Modern Greek using state-of-the-art techniques,” shared the Greek Institute. “The original version of Mistral-7Β is trained on a large corpus of English text. We extend the pretraining of Mistral-7Β with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately 40 billion tokens.”