Apple: The pound-for-pound Champ π₯β‘π
So essentially,
Apple DCLM-7B is the most effectively trainable model
Paper:
DataComp-LM: In search of the next generation of training sets for language models (88 Pages)
Github:
https://github.com/mlfoundations/dclm
Researchers from Apple and ML Foundations are interested in optimizing the cost of training models. As the need for efficient training increases, research is shifting towards improving the training datasets for better efficiency and generalization across various tasks.
Hmm..Whatβs the background?
Large training datasets are fundamental to the advancements observed in language modeling. The quality and characteristics of these datasets directly impact the capabilities and performance of LLMs. As LLMs grow in scale and complexity, the computational costs associated with their training are also rising. This necessitates a shift in research focus towards optimizing training datasets to achieve better efficiency and model generalization.
Ok, So what is proposed in the research paper?
The paper introduces DCLM, a benchmark designed to facilitate controlled experiments with training datasets for language models. Here are some key points:
It provides a standardized environment for researchers to assess the effectiveness of their data curation strategies
It including different text extraction techniques, deduplication strategies, and model-based quality filtering approaches
A key element of DCLM is DCLM-POOL, a publicly available corpus containing 240 trillion tokens extracted from Common Crawl
The paper also proposes DCLM-BASELINE, a high-quality training dataset derived from DCLM-POOL
The DCLM model is comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6Γ less compute than Llama 3 8B.
Whatβs next?
The researchers propose the following of future research directions:
The current DCLM evaluation primarily focuses on language understanding tasks. Future work could expand the benchmark to encompass code and math domains
The researchers emphasize the need for a deeper understanding of how data quality, filtering ratios, and multi-epoch training interact
State-of-the-art language models continue to grow in size upto 400B+
So essentially,
Apple DCLM-7B is the most effectively trainable model