So Essentially Substack

Share this post

What going on with the Platypus on HuggingFace?

soessentially.substack.com

Discover more from So Essentially Substack

Latest research in AI Essentialized!
Continue reading
Sign in

What going on with the Platypus on HuggingFace?

And is it related to Perry?

Dhruv Diddi
Aug 20, 2023
2
Share this post

What going on with the Platypus on HuggingFace?

soessentially.substack.com
Share

So essentially,

"Platypus LLMs, trained on newly released OpenPlatypus dataset, are leading the HuggingFace OpenLLM LeaderBoard"

Paper: Platypus: Quick, Cheap, and Powerful Refinement of LLMs [17 Pages]

Researchers from Boston University have established themselves at the top of the leaderboard of HuggingFace’s Open LLM Leaderboard. (August 2023)

The main question they tackled was:

How can we fully optimize the LLM dataset and training processes to produce superior LLM models?

In the paper, they describe the following as their proposed advantages:

  • Their curated dataset Open-Platypus (now released) for STEM tasks

  • Their process of fine-tuning and merging LoRA modules in order to conserve the strong prior of pre-trained LLMs, while bringing specific domain knowledge to the surface

  • Their methods of checking for test data leaks and contamination in the training data

The Platypus family of LLMs is able to achieve strong performance in quantitative LLM metrics across model sizes using just a fraction of the fine-tuning data and overall compute that are required for other state-of-the-art fine-tuned LLMs.

A 13B Platypus model can be trained on a single A100 GPU using 25k questions in 5 hours. More datasets and code are available at https://platypus-llm.github.io

They curated the dataset with specific goals. By focusing on depth in specific areas, diversity of input prompts, and keeping the size of the training set small, they aimed to maximize the precision and relevance of our models’ outputs.

To achieve this, a content-filtered STEM instruction-tuned dataset which draws from a variety of open-source datasets, was generated called Open-Platypus. The dataset is focused on improving LLMs’ STEM and logic knowledge and is made up of 11 open-source datasets. It is comprised mainly of human-designed questions, with only 10% of questions generated by an LLM. They reduced data redundancy and checked for contamination of open LLM training sets in important LLM test sets, and with the descriptions of the training data filtering process in order to avoid pitfalls.

One large limitation of this approach, especially for domain-specific models derived from large,pre-trained ones, is that the fine-tuning process can be time-consuming and costly. There are limitations in retraining and additional dataset fine-tuning.

Here are the results from the paper

Source: Paper

Their future work might delve deeper into understanding the nuances of model merging, especially in the context of models with similar baseline scores including exploring integration with alpaca and orca-style datasets, as well as examining the potential of QLoRA in the pipeline, inquiring into LIMA strategy within the LoRA and PEFT landscapes and potentially leveraging models like Lazarus, a successful LoRA merge of 6 models.

Thanks for reading So Essentially Substack! Subscribe for free to receive new posts and support my work.

2
Share this post

What going on with the Platypus on HuggingFace?

soessentially.substack.com
Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 So Essentially
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing