Over 1,000,000,000 Personas 🫨🤖
So essentially,
By simulating interactions with 1 billion potential users, Persona Hub helps AI anticipate needs and develop proactive solutions.
Paper:
Scaling Synthetic Data Creation with 1,000,000,000 Personas (20 Pages)
Github:
https://github.com/tencent-ailab/persona-hub
Researchers from Tencent AI Lab Seattle are addressing the limitations of existing methods for creating diverse synthetic data at scale. They are motivated by the need for a more versatile and scalable approach to creating diverse synthetic data.
Hmm..What’s the background?
Two main existing paradigms, instance-driven and key-point-driven, are discussed in the paper but these existing methods struggle to achieve both scalability and diversity in synthetic data generation.
Instance-driven approaches rely on a seed corpus for diversification, which means all generated responses start from a unique seed
Key-point-driven approaches utilize a curated list of key points or concepts to diversify the synthetic data however this can be limiting
Ok, So what is proposed in the research paper?
The paper introduces a novel persona-driven methodology for synthetic data creation, leveraging a massive collection of 1 billion diverse personas called Persona Hub. It comprising approximately 13% of the world’s population, is automatically curated from massive web data using two scalable approaches: Text-to-Persona and Persona-to-Persona. Text-to-Persona derives personas from web texts by inferring the likely characteristics of someone who would read, write, like, or dislike that text. Persona-to-Persona expands the collection by deriving personas through interpersonal relationships with existing personas in Persona Hub.
The paper demonstrates Persona Hub's efficacy in synthesizing various data types, including math and logical reasoning problems, user instructions, knowledge-rich texts, game NPCs, and tool development prompts.
Evaluation on mathematical reasoning tasks using Qwen2-7B, a 7B LLM, shows that fine-tuning with Persona Hub-generated data achieves comparable performance to larger models like GPT-4-turbo-preview on the MATH benchmark.
The authors argue that Persona Hub facilitates a paradigm shift in data creation, potentially enabling LLMs to create diverse and high-quality data at scale, mitigating the reliance on human-generated data.
What’s next?
The authors acknowledge the current persona descriptions in Persona Hub primarily focus on major aspects and lack fine-grained details. Future work aims to enrich these descriptions with more comprehensive information, similar to Wikipedia articles, enhancing their uniqueness and utility for personalized interactions.
Given the potential for misuse, future research should focus on establishing ethical guidelines and regulations for utilizing Persona Hub and similar technologies. This includes addressing data security concerns, ensuring fair competition in AI development, and mitigating the risks associated with replicating AI capabilities.
So essentially,
By simulating interactions with 1 billion potential users, Persona Hub helps AI anticipate needs and develop proactive solutions.