OLMoE: A Fully Open Model
So essentially,
Fully open-source OLMoE is best has the best performance per cost ⚡
Paper: OLMoE: Open Mixture-of-Experts Language Models (61 Pages)
Code: https://github.com/allenai/OLMoE
Researchers from Allen Institute for AI, Contextual AI, University of Washington and Princeton University are proposing new MoE model. While there has been significant progress in refining the sparsely-gated Mixture of Experts layer since its inception, evidenced by advancements in routing techniques, expert segmentation, stability, and efficiency, the widespread adoption of Mixture of Experts in LLMs is still limited.
Hmm..What’s the background?
OLMoE-1B-7B distinguishes itself as a fully open-source MoE language model, with its weights, training data, code, and logs publicly available. . This transparency aims to facilitate further research and development within the community. It has undergone extensive training on a massive dataset of 5 trillion tokens, the most extensively trained MoE model to date.
Ok, So what is proposed in the research paper?
Unlike conventional dense language models, OLMoE-1B-7B employs a MoE architecture, activating only a subset of its parameters for each input token. This leads to effective computing.
The model incorporates 64 small experts in each MoE layer, with 8 activated per token. This fine-grained approach with numerous small experts, as opposed to a few large ones, is highlighted as a key design choice for achieving strong performance. Also OLMoE-1B-7B utilizes a dropless token-based routing algorithm.
This unique combination of architectural choices and training regime results in a model that achieves state-of-the-art performance among similarly sized models, even surpassing larger models on certain benchmarks.
What’s next?
Future work could explore increasing the model's size, enabling it to utilize a greater number of parameters per input token and potentially bridge the performance gap with significantly larger models.
So essentially,
Fully open-source OLMoE is best has the best performance per cost ⚡
Learned something new? Consider sharing with your friends!