Quick Start

This guide will help you get started with PersonaGym in minutes.

5-Minute Introduction

Here’s a complete workflow from configuration to training data generation:

from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Initialize pipeline with configuration
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")

# Run the complete 6-stage pipeline
result = pipeline.run(num_personas=5)

# Access results
print(f"Personas: {len(result['personas'])}")
print(f"Interactions: {len(result['interactions'])}")
print(f"Training samples: {result['training_data']['total_samples']}")

Or use the command line:

python run.py --num-personas 5

Core Workflows

1. Basic Pipeline (Persona Generation Only)

For generating personas without interactions:

python run.py --mode basic --num-personas 100
from src.pipeline import PersonaGenerationPipeline

pipeline = PersonaGenerationPipeline("config.yaml")
result = pipeline.run(
    num_personas=100,
    generate_prompts=True,
    export_dataset=True
)

print(f"Generated {result['num_personas']} personas")

2. Enhanced Pipeline (Full Data Generation)

For complete training data with interactions and noise:

python run.py --mode enhanced --num-personas 10
from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
result = pipeline.run(num_personas=10)

# Results include all stages
personas = result['personas']           # List of persona specs
queries = result['queries']             # Dict: persona_id -> queries
interactions = result['interactions']   # List of interactions
training_data = result['training_data'] # Export statistics

3. Stage-by-Stage Execution

Run specific stages independently:

# Only persona generation
python run.py --stage persona --num-personas 50

# Only query generation (requires existing personas)
python run.py --stage query

# Only interaction generation
python run.py --stage interaction

# Only training data export
python run.py --stage training

4. Skip Specific Features

# Skip style transfer in query generation
python run.py --skip-query-transfer --num-personas 5

# Skip noise injection
python run.py --skip-distractor --num-personas 5

Configuration Overview

The main configuration file config.yaml controls all pipeline behavior:

# API Configuration
api:
  provider: "openai"
  api_key: "${OPENAI_API_KEY}"  # From environment
  model: "gpt-4o-mini"
  inference:
    temperature: 0.7
    max_completion_tokens: 2048

# Persona Generation
persona_generation:
  num_personas: 1000
  sampling_strategy: "random"
  feature_availability_rate: 0.7
  diversity:
    enabled: true
    min_hamming_distance: 3

# Query Generation
query_generation:
  selection:
    queries_per_persona: 5
  style_transfer:
    enabled: true
    transfer_probability: 0.5

# Interaction Generation
interaction_generation:
  min_turns: 2
  max_turns: 5
  max_workers: 20

# Distractor (Noise Injection)
distractor:
  enabled: true
  use_semantic: true
  activation_probability: 0.25

# Experiment Settings
experiment:
  seed: 42
  incremental: true

Understanding Output

Output Directory Structure

After running the pipeline:

output/
├── personas/
│   ├── persona_20260206_001.json
│   ├── persona_20260206_002.json
│   └── ...
├── queries/
│   ├── persona_20260206_001.json
│   └── ...
├── interactions/
│   ├── interaction_persona_001_*.json
│   ├── index.json
│   └── ...
└── training_data/
    ├── train_samples_20260206_123456.json
    ├── statistics.json
    └── token_usage_20260206_123456.json

Sample Output Formats

Persona Spec (output/personas/persona_*.json):

{
  "persona_id": "persona_20260206_001",
  "features": {
    "age_band": "25_34",
    "role": "engineer",
    "communication_style": "casual",
    "response_length": "medium",
    "technical_level": "advanced"
  },
  "system_prompt": "You are assisting a 25-34 year old engineer...",
  "metadata": {
    "created_at": "2026-02-06T10:30:00",
    "num_features": 15
  }
}

Training Sample (output/training_data/train_samples_*.json):

{
  "sample_id": "sample_20260206_001",
  "persona_id": "persona_001",
  "persona_features": {...},
  "original_query": "Help me write a Python script",
  "initial_query": "hey can u help me write some python code",
  "prompt_trajectory": [
    "hey can u help me write some python code",
    "make it handle errors better",
    "looks good thanks"
  ],
  "full_conversation": [
    {"role": "user", "content": "hey can u help me..."},
    {"role": "assistant", "content": "Sure! Here's a..."},
    ...
  ],
  "num_turns": 3
}

Common Patterns

Pattern 1: End-to-End Pipeline

from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Full pipeline with all stages
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")
result = pipeline.run(num_personas=100)

# Check token usage
if 'training_data' in result:
    print(f"Total samples: {result['training_data']['total_samples']}")

Pattern 2: Incremental Generation

Resume from previous runs (skips existing outputs):

# In config.yaml
experiment:
  incremental: true
# First run: generates 50 personas
python run.py --num-personas 50

# Second run: skips existing 50, generates 50 more
python run.py --num-personas 100

Pattern 3: Custom Persona Features

Modify input/persona.yaml to customize dimensions:

dimensions:
  role:
    name: role
    is_constraint: true
    values:
      - student
      - researcher
      - developer
      - manager
      - custom_role  # Add custom values

Pattern 4: Multi-Provider Model Pool

Use diverse models for assistant responses:

# In config.yaml
interaction_generation:
  assistant_model:
    model_pool:
      - provider: openai
        model: gpt-4o-mini
        weight: 0.3
      - provider: openrouter
        model: anthropic/claude-3.5-haiku
        weight: 0.3
      - provider: openrouter
        model: google/gemini-2.0-flash-exp
        weight: 0.4

Next Steps

Troubleshooting

Empty Responses

If you see “Assistant returned empty response”:

  1. Check your API key has sufficient credits

  2. Verify the model name is correct

  3. Try reducing max_workers to avoid rate limits

Slow Generation

For faster generation:

experiment:
  batch_size: 10  # Process in batches

interaction_generation:
  max_workers: 10  # Parallel workers

Memory Issues

For large-scale generation:

# Generate in batches
python run.py --num-personas 100  # First batch
python run.py --num-personas 200  # Incremental adds 100 more

For more help, see the Installation guide’s troubleshooting section.