In the spirit of the Olympics and Greece! π₯
So essentially,
Meltemi is the first ever open model in Greek!
Paper:
Meltemi: The first open Large Language Model for Greek (11 Pages)
Researchers from Athens, Greece are interested in developing the open Large Language Models (LLMs) for the Greek language.
Hmm..Whatβs the background?
The researchers provide an introduction to Meltemi 7B and Meltemi 7B Instruct, the first open Large Language Models (LLMs) for the Greek language. This paper describes that continuous pre-training of LLMs, a method of adapting existing models to new languages, is a significantly feasible research area.
Ok, So what is proposed in the research paper?
The researchers acknowledge the scarcity of Greek text compared to languages like English. To overcome this, the developers adopted a multi-faceted approach:
They compiled a vast Greek monolingual corpus from diverse sources such as Wikipedia, academic repositories, parliamentary proceedings, and pre-filtered web resources
Instead of building a new model from scratch, they leveraged the existing Mistral 7B model and performed continuous pre-training with a mixture of Greek, English, and parallel (English-Greek) data
They implemented a rigorous filtering pipeline involving rule-based filters, fluency scores, and specialized tools like Monocleaner and LASER to eliminate noisy and low-quality data
The original Mistral 7B tokenizer, designed primarily for English, proved inefficient for Greek text, leading to higher computational costs. To address this, they expanded the tokenizer's vocabulary size from 32,000 to 61,362 tokens, incorporating Greek-specific characters and sub-word units
A new benchmark suite was created specifically for evaluating LLM performance in Greek. This suite includes translated versions of established English benchmarks, publicly available Greek question-answering datasets, and a novel dataset focused on medical question answering.202122 The results indicate that Meltemi 7B Instruct significantly outperforms its base model (Mistral 7B) on Greek-focused tasks.
Whatβs next?
The researchers identify several directions for future research:
While continuous pre-training offers a more efficient way to adapt existing models to new languages compared to training from scratch, the sources express interest in exploring even more efficient methods for model adaptation
The creation of Meltemi 7B, while groundbreaking for the Greek language, consumed a significant amount of energy (2300 kWh). Recognizing the environmental impact, future work aims to adapt larger language models for Greek while prioritizing sustainability
Going beyond text-only models, future work aims to explore multimodality for Greek language models
So essentially,
Meltemi is the first ever open model in Greek!
Learned something new? Consider sharing with your friends!