Tutorial 1: Basics

Learn the core concepts of PersonaGym and run your first pipeline.

Learning Objectives

By the end of this tutorial, you will:

Understand the 6-stage pipeline architecture
Run a basic pipeline execution
Interpret the output files

Pipeline Overview

PersonaGym generates synthetic training data through 6 stages:

Persona Generation    → Create diverse user profiles
System Prompt         → Generate personalized prompts
Query Generation      → Assign and adapt queries
Interaction           → Simulate conversations
Distractor            → Add realistic noise
Training Data Export  → Output structured samples

Step 1: Configuration

First, review the main configuration file:

# config.yaml (key sections)

api:
  provider: "openai"
  model: "gpt-4o-mini"

persona_generation:
  num_personas: 10           # Start small

query_generation:
  selection:
    queries_per_persona: 3   # 3 queries per persona

interaction_generation:
  min_turns: 2
  max_turns: 4

distractor:
  enabled: true
  activation_probability: 0.25

Step 2: First Run

Run the pipeline with a small number of personas:

python run.py --num-personas 5

Expected output:

================================================================================
Starting Enhanced Persona Generation Pipeline
Mode: INCREMENTAL (will skip existing outputs)
================================================================================

[Stage 1] Generating Personas...
Generating personas: 100%|████████████████████| 5/5 [00:15<00:00]
[OK] Generated 5 personas

[Stage 2] Generating Queries - batch 1...
[OK] Generated queries for 5 personas

[Stage 3] Generating Interactions - batch 1...
Generating interactions: 100%|████████████████████| 15/15 [01:30<00:00]
[OK] Generated 15 interactions (batch 1)

[Stage 4] Distractor Application - SKIPPED (already applied in Stage 3)

[Stage 5] Collecting and Exporting Training Data...
[OK] Exported training data

================================================================================
Pipeline Complete
================================================================================

PIPELINE SUMMARY:
  Personas Generated: 5
  Queries Generated: 15
  Interactions Generated: 15
  Training Samples: 15

TOKEN USAGE SUMMARY:
  Total API Calls: 85
  Total Input Tokens: 42,500
  Total Output Tokens: 25,000
  Total Tokens: 67,500

Step 3: Explore Output

After running, check the output directory:

output/
├── personas/
│   ├── persona_20260206_001.json
│   ├── persona_20260206_002.json
│   └── ...
├── queries/
│   ├── persona_20260206_001.json
│   └── ...
├── interactions/
│   ├── interaction_persona_001_*.json
│   ├── index.json
│   └── ...
└── training_data/
    ├── train_samples_20260206_*.json
    ├── statistics.json
    └── token_usage_*.json

View a Persona

cat output/personas/persona_20260206_001.json

{
  "persona_id": "persona_20260206_001",
  "features": {
    "age_band": "25_34",
    "role": "engineer",
    "communication_style": "casual",
    "response_length": "medium"
  },
  "system_prompt": "You are helping a 25-34 year old engineer...",
  "metadata": {
    "created_at": "2026-02-06T10:30:00"
  }
}

View an Interaction

cat output/interactions/interaction_persona_001_*.json

Key fields:

messages: The conversation turns
num_turns: Number of exchanges
metadata.distractor_applied: Whether noise was added

View Training Data

cat output/training_data/train_samples_*.json | head -50

Key fields:

prompt_trajectory: All user prompts in order
full_conversation: Complete dialogue
noisy_initial_queries: Noise variations

Step 4: Programmatic Usage

You can also run the pipeline from Python:

from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Initialize
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")

# Run
result = pipeline.run(num_personas=5)

# Access results
print(f"Personas: {len(result['personas'])}")
print(f"Interactions: {len(result['interactions'])}")

# Inspect first persona
persona = result['personas'][0]
print(f"ID: {persona['persona_id']}")
print(f"Features: {persona['features']}")

Key Concepts

Incremental Mode

By default, the pipeline skips existing outputs:

experiment:
  incremental: true

Run again with more personas:

python run.py --num-personas 10
# Only generates 5 new personas (5 already exist)

Stage Control

Run specific stages:

# Only persona generation
python run.py --stage persona --num-personas 20

# Only training data export
python run.py --stage training

Skip Features

# Skip noise injection
python run.py --skip-distractor

# Skip query style transfer
python run.py --skip-query-transfer

Exercises

Run with 10 personas and compare token usage
Disable distractor and observe the difference in outputs
Check statistics.json to understand data distribution

Next Steps

Continue to Tutorial 2: Persona Generation to learn about customizing personas.