RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
And can we scale human feedback with AI?
So essentially,
"AI Feedback seems as good as Human Feedback for Reinforcement Learning"
Paper: RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Researchers from Google Research directly compare Reinforcement Learning from Human Feedback(RLHF) and Reinforcement Learning from AI Feedback(RLAIF).
From previous research we know that Large Language Models (LLMs) exhibit a high degree of alignment with human judgment, even outperforming humans on some tasks. In this paper we are considering the summarization task.
The main questions they tackled were:
How do AI feedback and Human feedback differ in the task of summarization?
Can we maximize the alignment of AI-generated preferences with human preference?
How can RLAIF enhance RLHF in the future?
In the paper, they describe the following as their approach:
They setup RLHF pipeline using 3 phrases of supervised fine-tuning, reward model training, and reinforcement learning-based fine-tuning
To derive an AI preference label, the LLM is first prompted to verbally explain its thoughts on the quality of the two candidates. The LLM response is then appended to the original prompt (orange) and fed to the LLM a second time to generate a preference distribution over "1" vs. "2" based on their log probabilities
For their evaluations they used the filtered Reddit TL;DR dataset curated by
OpenAI. TL;DR contains ∼3 million posts from Reddit2 across a variety of topics alongside summaries of the posts written by the original authors.
In conclusion, the reseachers showed that RLAIF can produce comparable improvements to RLHF without depending on human annotators. It even greatly improves upon a SFT baseline. Additionally, In head-to-head comparisons, the experiments show that RLAIF and RLHF are preferred by humans at similar rates.
The researchers also used different techniques to generate the AI labels and conducted scaling studies to figure out optimal settings for generating aligned preferences.
Their future work might delve deeper into understanding other tasks and generalizability. There still needs to be further analysis on resource optimization and seeing how a combination of RLHF and RLAIF can be used for better results through mechanisms of “self improvement”.