How to DETOX a language? ☣️⚠️🤖

Jun 27, 2024

So essentially,

Language model tuned to avoid toxic language in English will also avoid it in other languages!

Paper: Preference Tuning For Toxicity Mitigation Generalizes Across Languages (19 Pages)

Researchers from Brown University want to make language models less toxic. This study discovered that preference tuning with Direct Preference Optimization (DPO) using only English training data significantly reduces the toxicity level in LLMs' generations across 17 different languages such as Spanish, Russian, Chinese and Korean.

Source: https://lexica.art/prompt/3edfbd3d-294d-48de-81cc-1138c565b5e5

Hmm..What’s the background?

Multilingual Large Language Models (LLMs) are being used more globally, making it essential to ensure their safety across all languages. Previous research suggested that using preference tuning to prevent LLMs from responding to dangerous requests might not work well across different languages. Past efforts to reduce toxicity in languages other than English have relied on translating toxic and non-toxic examples from English which is resource-intensive, especially for languages with limited data. This study investigates if training LLMs with English-only data can effectively reduce toxicity in their outputs in multiple languages without the need for translation.

Ok, So what is proposed in the research paper?

Researchers used a technique called Direct Preference Optimization (DPO) to fine-tune different pre-trained multilingual LLMs.
They trained these models using a dataset containing English prompts and pairs of toxic (undesirable) and non-toxic (desirable) responses.
To test the effectiveness of this approach, they used a benchmark called RTP-LX, which includes prompts designed to elicit potentially harmful responses in 17 different languages
The researchers evaluated toxicity, fluency, and diversity of the model's generated text before and after the fine-tuning process. They also analyzed the inner workings of the LLMs, focusing on specific components called neurons and their activation patterns, to understand why this English-only training could reduce toxicity in other languages.
Source: Paper

They found that within the model's internal memory, similar concepts across different languages were grouped together. These concepts included categories like sexual content or political issues, with associated words from various languages clustered together. This discovery, which researchers called the "dual multilinguality of MLP," demonstrates how LLMs can learn and apply knowledge across multiple languages, even without explicit training in each specific language.

What’s next?

Further investigation is needed to determine the extent to which this method reduces culturally specific toxicity and research is needed to explore alternative methods for mitigating biases and promoting fairness in multilingual LLMs.

So essentially,

Language model tuned to avoid toxic language in English will also avoid it in other languages!

So Essentially

Discussion about this post