PersonaGym

Welcome to PersonaGym - A comprehensive pipeline for generating personalized synthetic training data through diverse persona-based user-AI interactions.

What is PersonaGym?

PersonaGym implements a 6-stage data generation pipeline that creates high-quality synthetic datasets by simulating diverse user-AI conversations with different personas. Unlike traditional data collection methods that require real user interactions, PersonaGym generates training data programmatically while maintaining realistic diversity and semantic complexity.

The pipeline enables:

  • Zero-shot transfer to new personas without fine-tuning

  • Robust model training through semantic noise injection

  • Cost-effective data generation at scale

Key Features

6-Stage Pipeline

Persona Generation → Query Adaptation → Interaction Simulation
→ Distractor Application → Training Data Export → Token Analysis

Rich Persona System

  • 30+ dimensions across 5 categories (demographics, communication style, constraints, etc.)

  • Diversity-enforced sampling with configurable Hamming distance constraints

  • LLM-based system prompt generation from persona features

Three-Layer Semantic Noise

  • Surface Noise (50%): Intent/slots preserved, surface form changes

  • Incomplete Info (30%): Intent clear, slots missing/vague

  • Semantic Ambiguity (20%): Intent uncertain or multiple intents

  • 9+ noise strategies per layer for realistic imperfect user inputs

Multi-Provider LLM Support

  • OpenAI (GPT-4o, GPT-5.2, etc.)

  • Anthropic (Claude via OpenRouter)

  • OpenRouter (Grok, Gemini, Llama, Mistral, etc.)

  • Weighted model pool for diverse assistant responses

Comprehensive Token Tracking

  • Per-module, per-model statistics

  • Cost analysis and reporting

  • JSON export for downstream analysis

Quick Example

from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Initialize and run the full pipeline
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
result = pipeline.run(num_personas=5)

print(f"Generated {len(result['personas'])} personas")
print(f"Created {len(result['interactions'])} interactions")
print(f"Exported {result['training_data']['total_samples']} training samples")

Command Line Usage:

# Run full pipeline with 10 personas
python run.py --num-personas 10

# Run only persona generation
python run.py --stage persona --num-personas 100

# Skip noise injection
python run.py --skip-distractor --num-personas 5

Documentation Structure

User Guide

Additional Information

Installation

# Clone the repository
git clone https://github.com/yccm/LLM_PPOpt.git
cd LLM_PPOpt

# Install dependencies
pip install -r requirements.txt

# Set up API keys
export OPENAI_API_KEY="your-api-key"

See the Installation guide for detailed instructions.

License

PersonaGym is released under the MIT License. See License for details.

Community

Citation

If you use this work, please cite:

@article{ma2026synthetic,
  title={Synthetic Interaction Data for Scalable Personalization in Large Language Models},
  author={Ma, Yuchen and Huang, Yue and Wang, Wenjie and Luo, Xiaonan and Zhang, Xiangliang and Feuerriegel, Stefan},
  year={2026}
}

Indices and Tables