Tutorial 1: Basics

Learn the core concepts of PersonaGym and run your first pipeline.

Learning Objectives

By the end of this tutorial, you will:

  • Understand the 6-stage pipeline architecture

  • Run a basic pipeline execution

  • Interpret the output files

Pipeline Overview

PersonaGym generates synthetic training data through 6 stages:

1. Persona Generation    → Create diverse user profiles
2. System Prompt         → Generate personalized prompts
3. Query Generation      → Assign and adapt queries
4. Interaction           → Simulate conversations
5. Distractor            → Add realistic noise
6. Training Data Export  → Output structured samples

Step 1: Configuration

First, review the main configuration file:

# config.yaml (key sections)

api:
  provider: "openai"
  model: "gpt-4o-mini"

persona_generation:
  num_personas: 10           # Start small

query_generation:
  selection:
    queries_per_persona: 3   # 3 queries per persona

interaction_generation:
  min_turns: 2
  max_turns: 4

distractor:
  enabled: true
  activation_probability: 0.25

Step 2: First Run

Run the pipeline with a small number of personas:

python run.py --num-personas 5

Expected output:

================================================================================
Starting Enhanced Persona Generation Pipeline
Mode: INCREMENTAL (will skip existing outputs)
================================================================================

[Stage 1] Generating Personas...
Generating personas: 100%|████████████████████| 5/5 [00:15<00:00]
[OK] Generated 5 personas

[Stage 2] Generating Queries - batch 1...
[OK] Generated queries for 5 personas

[Stage 3] Generating Interactions - batch 1...
Generating interactions: 100%|████████████████████| 15/15 [01:30<00:00]
[OK] Generated 15 interactions (batch 1)

[Stage 4] Distractor Application - SKIPPED (already applied in Stage 3)

[Stage 5] Collecting and Exporting Training Data...
[OK] Exported training data

================================================================================
Pipeline Complete
================================================================================

PIPELINE SUMMARY:
  Personas Generated: 5
  Queries Generated: 15
  Interactions Generated: 15
  Training Samples: 15

TOKEN USAGE SUMMARY:
  Total API Calls: 85
  Total Input Tokens: 42,500
  Total Output Tokens: 25,000
  Total Tokens: 67,500

Step 3: Explore Output

After running, check the output directory:

output/
├── personas/
│   ├── persona_20260206_001.json
│   ├── persona_20260206_002.json
│   └── ...
├── queries/
│   ├── persona_20260206_001.json
│   └── ...
├── interactions/
│   ├── interaction_persona_001_*.json
│   ├── index.json
│   └── ...
└── training_data/
    ├── train_samples_20260206_*.json
    ├── statistics.json
    └── token_usage_*.json

View a Persona

cat output/personas/persona_20260206_001.json
{
  "persona_id": "persona_20260206_001",
  "features": {
    "age_band": "25_34",
    "role": "engineer",
    "communication_style": "casual",
    "response_length": "medium"
  },
  "system_prompt": "You are helping a 25-34 year old engineer...",
  "metadata": {
    "created_at": "2026-02-06T10:30:00"
  }
}

View an Interaction

cat output/interactions/interaction_persona_001_*.json

Key fields:

  • messages: The conversation turns

  • num_turns: Number of exchanges

  • metadata.distractor_applied: Whether noise was added

View Training Data

cat output/training_data/train_samples_*.json | head -50

Key fields:

  • prompt_trajectory: All user prompts in order

  • full_conversation: Complete dialogue

  • noisy_initial_queries: Noise variations

Step 4: Programmatic Usage

You can also run the pipeline from Python:

from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Initialize
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")

# Run
result = pipeline.run(num_personas=5)

# Access results
print(f"Personas: {len(result['personas'])}")
print(f"Interactions: {len(result['interactions'])}")

# Inspect first persona
persona = result['personas'][0]
print(f"ID: {persona['persona_id']}")
print(f"Features: {persona['features']}")

Key Concepts

Incremental Mode

By default, the pipeline skips existing outputs:

experiment:
  incremental: true

Run again with more personas:

python run.py --num-personas 10
# Only generates 5 new personas (5 already exist)

Stage Control

Run specific stages:

# Only persona generation
python run.py --stage persona --num-personas 20

# Only training data export
python run.py --stage training

Skip Features

# Skip noise injection
python run.py --skip-distractor

# Skip query style transfer
python run.py --skip-query-transfer

Exercises

  1. Run with 10 personas and compare token usage

  2. Disable distractor and observe the difference in outputs

  3. Check statistics.json to understand data distribution

Next Steps

Continue to Tutorial 2: Persona Generation to learn about customizing personas.