Quick Start
This guide will help you get started with PersonaGym in minutes.
5-Minute Introduction
Here’s a complete workflow from configuration to training data generation:
from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline
# Initialize pipeline with configuration
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
# Run the complete 6-stage pipeline
result = pipeline.run(num_personas=5)
# Access results
print(f"Personas: {len(result['personas'])}")
print(f"Interactions: {len(result['interactions'])}")
print(f"Training samples: {result['training_data']['total_samples']}")
Or use the command line:
python run.py --num-personas 5
Core Workflows
1. Basic Pipeline (Persona Generation Only)
For generating personas without interactions:
python run.py --mode basic --num-personas 100
from src.pipeline import PersonaGenerationPipeline
pipeline = PersonaGenerationPipeline("config.yaml")
result = pipeline.run(
num_personas=100,
generate_prompts=True,
export_dataset=True
)
print(f"Generated {result['num_personas']} personas")
2. Enhanced Pipeline (Full Data Generation)
For complete training data with interactions and noise:
python run.py --mode enhanced --num-personas 10
from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
result = pipeline.run(num_personas=10)
# Results include all stages
personas = result['personas'] # List of persona specs
queries = result['queries'] # Dict: persona_id -> queries
interactions = result['interactions'] # List of interactions
training_data = result['training_data'] # Export statistics
3. Stage-by-Stage Execution
Run specific stages independently:
# Only persona generation
python run.py --stage persona --num-personas 50
# Only query generation (requires existing personas)
python run.py --stage query
# Only interaction generation
python run.py --stage interaction
# Only training data export
python run.py --stage training
4. Skip Specific Features
# Skip style transfer in query generation
python run.py --skip-query-transfer --num-personas 5
# Skip noise injection
python run.py --skip-distractor --num-personas 5
Configuration Overview
The main configuration file config.yaml controls all pipeline behavior:
# API Configuration
api:
provider: "openai"
api_key: "${OPENAI_API_KEY}" # From environment
model: "gpt-4o-mini"
inference:
temperature: 0.7
max_completion_tokens: 2048
# Persona Generation
persona_generation:
num_personas: 1000
sampling_strategy: "random"
feature_availability_rate: 0.7
diversity:
enabled: true
min_hamming_distance: 3
# Query Generation
query_generation:
selection:
queries_per_persona: 5
style_transfer:
enabled: true
transfer_probability: 0.5
# Interaction Generation
interaction_generation:
min_turns: 2
max_turns: 5
max_workers: 20
# Distractor (Noise Injection)
distractor:
enabled: true
use_semantic: true
activation_probability: 0.25
# Experiment Settings
experiment:
seed: 42
incremental: true
Understanding Output
Output Directory Structure
After running the pipeline:
output/
├── personas/
│ ├── persona_20260206_001.json
│ ├── persona_20260206_002.json
│ └── ...
├── queries/
│ ├── persona_20260206_001.json
│ └── ...
├── interactions/
│ ├── interaction_persona_001_*.json
│ ├── index.json
│ └── ...
└── training_data/
├── train_samples_20260206_123456.json
├── statistics.json
└── token_usage_20260206_123456.json
Sample Output Formats
Persona Spec (output/personas/persona_*.json):
{
"persona_id": "persona_20260206_001",
"features": {
"age_band": "25_34",
"role": "engineer",
"communication_style": "casual",
"response_length": "medium",
"technical_level": "advanced"
},
"system_prompt": "You are assisting a 25-34 year old engineer...",
"metadata": {
"created_at": "2026-02-06T10:30:00",
"num_features": 15
}
}
Training Sample (output/training_data/train_samples_*.json):
{
"sample_id": "sample_20260206_001",
"persona_id": "persona_001",
"persona_features": {...},
"original_query": "Help me write a Python script",
"initial_query": "hey can u help me write some python code",
"prompt_trajectory": [
"hey can u help me write some python code",
"make it handle errors better",
"looks good thanks"
],
"full_conversation": [
{"role": "user", "content": "hey can u help me..."},
{"role": "assistant", "content": "Sure! Here's a..."},
...
],
"num_turns": 3
}
Common Patterns
Pattern 1: End-to-End Pipeline
from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline
# Full pipeline with all stages
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
result = pipeline.run(num_personas=100)
# Check token usage
if 'training_data' in result:
print(f"Total samples: {result['training_data']['total_samples']}")
Pattern 2: Incremental Generation
Resume from previous runs (skips existing outputs):
# In config.yaml
experiment:
incremental: true
# First run: generates 50 personas
python run.py --num-personas 50
# Second run: skips existing 50, generates 50 more
python run.py --num-personas 100
Pattern 3: Custom Persona Features
Modify input/persona.yaml to customize dimensions:
dimensions:
role:
name: role
is_constraint: true
values:
- student
- researcher
- developer
- manager
- custom_role # Add custom values
Pattern 4: Multi-Provider Model Pool
Use diverse models for assistant responses:
# In config.yaml
interaction_generation:
assistant_model:
model_pool:
- provider: openai
model: gpt-4o-mini
weight: 0.3
- provider: openrouter
model: anthropic/claude-3.5-haiku
weight: 0.3
- provider: openrouter
model: google/gemini-2.0-flash-exp
weight: 0.4
Next Steps
Deep Dive: Check out Tutorials for step-by-step guides
Configuration: See Configuration Guide for all options
Persona System: Learn about Persona Dimensions
API Reference: Explore the API Documentation
Troubleshooting
Empty Responses
If you see “Assistant returned empty response”:
Check your API key has sufficient credits
Verify the model name is correct
Try reducing
max_workersto avoid rate limits
Slow Generation
For faster generation:
experiment:
batch_size: 10 # Process in batches
interaction_generation:
max_workers: 10 # Parallel workers
Memory Issues
For large-scale generation:
# Generate in batches
python run.py --num-personas 100 # First batch
python run.py --num-personas 200 # Incremental adds 100 more
For more help, see the Installation guide’s troubleshooting section.