# Tutorial 1: Basics

Learn the core concepts of PersonaGym and run your first pipeline.

## Learning Objectives

By the end of this tutorial, you will:

- Understand the 6-stage pipeline architecture
- Run a basic pipeline execution
- Interpret the output files

## Pipeline Overview

PersonaGym generates synthetic training data through 6 stages:

```
1. Persona Generation    → Create diverse user profiles
2. System Prompt         → Generate personalized prompts
3. Query Generation      → Assign and adapt queries
4. Interaction           → Simulate conversations
5. Distractor            → Add realistic noise
6. Training Data Export  → Output structured samples
```

## Step 1: Configuration

First, review the main configuration file:

```yaml
# config.yaml (key sections)

api:
  provider: "openai"
  model: "gpt-4o-mini"

persona_generation:
  num_personas: 10           # Start small

query_generation:
  selection:
    queries_per_persona: 3   # 3 queries per persona

interaction_generation:
  min_turns: 2
  max_turns: 4

distractor:
  enabled: true
  activation_probability: 0.25
```

## Step 2: First Run

Run the pipeline with a small number of personas:

```bash
python run.py --num-personas 5
```

Expected output:

```
================================================================================
Starting Enhanced Persona Generation Pipeline
Mode: INCREMENTAL (will skip existing outputs)
================================================================================

[Stage 1] Generating Personas...
Generating personas: 100%|████████████████████| 5/5 [00:15<00:00]
[OK] Generated 5 personas

[Stage 2] Generating Queries - batch 1...
[OK] Generated queries for 5 personas

[Stage 3] Generating Interactions - batch 1...
Generating interactions: 100%|████████████████████| 15/15 [01:30<00:00]
[OK] Generated 15 interactions (batch 1)

[Stage 4] Distractor Application - SKIPPED (already applied in Stage 3)

[Stage 5] Collecting and Exporting Training Data...
[OK] Exported training data

================================================================================
Pipeline Complete
================================================================================

PIPELINE SUMMARY:
  Personas Generated: 5
  Queries Generated: 15
  Interactions Generated: 15
  Training Samples: 15

TOKEN USAGE SUMMARY:
  Total API Calls: 85
  Total Input Tokens: 42,500
  Total Output Tokens: 25,000
  Total Tokens: 67,500
```

## Step 3: Explore Output

After running, check the output directory:

```
output/
├── personas/
│   ├── persona_20260206_001.json
│   ├── persona_20260206_002.json
│   └── ...
├── queries/
│   ├── persona_20260206_001.json
│   └── ...
├── interactions/
│   ├── interaction_persona_001_*.json
│   ├── index.json
│   └── ...
└── training_data/
    ├── train_samples_20260206_*.json
    ├── statistics.json
    └── token_usage_*.json
```

### View a Persona

```bash
cat output/personas/persona_20260206_001.json
```

```json
{
  "persona_id": "persona_20260206_001",
  "features": {
    "age_band": "25_34",
    "role": "engineer",
    "communication_style": "casual",
    "response_length": "medium"
  },
  "system_prompt": "You are helping a 25-34 year old engineer...",
  "metadata": {
    "created_at": "2026-02-06T10:30:00"
  }
}
```

### View an Interaction

```bash
cat output/interactions/interaction_persona_001_*.json
```

Key fields:

- `messages`: The conversation turns
- `num_turns`: Number of exchanges
- `metadata.distractor_applied`: Whether noise was added

### View Training Data

```bash
cat output/training_data/train_samples_*.json | head -50
```

Key fields:

- `prompt_trajectory`: All user prompts in order
- `full_conversation`: Complete dialogue
- `noisy_initial_queries`: Noise variations

## Step 4: Programmatic Usage

You can also run the pipeline from Python:

```python
from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline

# Initialize
pipeline = EnhancedPersonaGenerationPipeline("config.yaml")

# Run
result = pipeline.run(num_personas=5)

# Access results
print(f"Personas: {len(result['personas'])}")
print(f"Interactions: {len(result['interactions'])}")

# Inspect first persona
persona = result['personas'][0]
print(f"ID: {persona['persona_id']}")
print(f"Features: {persona['features']}")
```

## Key Concepts

### Incremental Mode

By default, the pipeline skips existing outputs:

```yaml
experiment:
  incremental: true
```

Run again with more personas:

```bash
python run.py --num-personas 10
# Only generates 5 new personas (5 already exist)
```

### Stage Control

Run specific stages:

```bash
# Only persona generation
python run.py --stage persona --num-personas 20

# Only training data export
python run.py --stage training
```

### Skip Features

```bash
# Skip noise injection
python run.py --skip-distractor

# Skip query style transfer
python run.py --skip-query-transfer
```

## Exercises

1. **Run with 10 personas** and compare token usage
2. **Disable distractor** and observe the difference in outputs
3. **Check statistics.json** to understand data distribution

## Next Steps

Continue to [Tutorial 2: Persona Generation](tutorial_02_persona_generation.md) to learn about customizing personas.