LLaMAX for maximal language translation
So essentially,
LLaMAX model maxes out language translation for 100+ languages!
Paper:
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages (24 Pages)
Github:
https://github.com/CONE-MT/LLaMAX/
Researchers from Shanghai AI Laboratory, Nanjing University and Carnegie Mellon University introduce LLaMAX, a new series of LLaMA models enhanced through multilingual continual pre-training.
Hmm..What’s the background?
Large Language Models (LLMs) excel in high-resource language translation but struggle with low-resource languages due to a lack of sufficient multilingual data. This is particularly evident in the disparity between English-centric and Arabic-centric translation performance.
Existing efforts to improve low-resource language translation focus on fine-tuning or training language-specific models but have limitations in language coverage and performance. For instance, simply adding language-specific tokens to expand vocabulary can negatively impact multilingual performance.
This research introduces LLaMAX, a series of LLaMA models enhanced through extensive multilingual continual pre-training on 102 languages.
Ok, So what is proposed in the research paper?
The main ideas in the paper center around addressing the challenges of low-resource language translation for Large Language Models (LLMs):
Rather than training new LLMs from scratch or focusing solely on fine-tuning, continual pre-training with a large and diverse multilingual dataset offers a more effective way to enhance translation capabilities, especially for low-resource languages
Through extensive analysis, the authors argue that this seemingly straightforward approach can actually hinder multilingual performance by increasing training complexity and negatively impacting the model’s ability to capture linguistic patterns across languages. Instead, the paper advocates for sticking with the original vocabulary of the LLM, in this case, the LLaMA model’s Byte-level Byte Pair Encoding (BBPE) tokenizer, for optimal performance
The paper highlights the importance of selecting an appropriate multilingual dictionary, with a preference for dictionaries that have a higher number of entities for the target language.
LLaMAX significantly outperforms other open-source LLMs in multilingual translation, achieving comparable results to the specialized translation model M2M-100-12B on the Flores-101 benchmark. LLaMAX also demonstrates significant performance enhancements even for languages not included in the training set when evaluated on Flores-200.
What’s next?
The researchers share to several promising directions for future work:
Optimizing the language extension framework to match or exceed the performance of advanced translation systems like MADLAD-400
Exploring the potential for significant translation performance improvements by conducting pivot translation experiments based on LLaMAX2-Alpaca
Further investigation into improving data preprocessing methods for better model training outcomes, including a thorough evaluation of open-source data quality
So essentially,
LLaMAX model maxes out language translation for 100+ languages!