# Persona System

This guide covers the persona generation system in PersonaGym.

## Overview

The persona system creates diverse user profiles that drive realistic conversation simulation. Each persona consists of:

- **Features**: A set of dimension-value pairs (e.g., `age_band: 25_34`)
- **System Prompt**: An LLM-generated prompt describing the persona
- **Metadata**: Creation timestamp, feature count, etc.

## Persona Dimensions

Personas are defined by 30+ dimensions across 5 categories in `input/persona.yaml`:

### Basic Info (Constraint Dimensions)

| Dimension | Values | Description |
|-----------|--------|-------------|
| `age_band` | u18, 18_24, 25_34, 35_44, 45_60, 60p | Age group |
| `role` | student, engineer, data, pm, designer, ... | Professional role |
| `seniority` | intern, entry, junior, mid, senior, lead, director, c_level | Experience level |
| `education` | hs, bachelor, master, phd | Education level |
| `language` | english, chinese, spanish, french, german, ... | Primary language |

### Communication Style

| Dimension | Values | Description |
|-----------|--------|-------------|
| `communication_style` | casual, professional, technical, creative | Communication preference |
| `response_length` | very_short, short, medium, long, very_long | Preferred response length |
| `technical_level` | beginner, intermediate, advanced, expert | Technical expertise |
| `tone` | friendly, formal, humorous, serious | Conversation tone |

### Query Preferences

| Dimension | Values | Description |
|-----------|--------|-------------|
| `query_length_pref` | concise, moderate, detailed | Query length preference |
| `explanation_style` | step_by_step, overview, examples, minimal | Learning style |
| `feedback_frequency` | rarely, sometimes, often, always | Follow-up frequency |

## Persona Sampling

### Configuration

Sampling is configured in `input/sampling_config.yaml`:

```yaml
sampling:
  feature_availability_rate: 0.7    # % of dimensions per persona
  min_features: 10
  max_features: 20
  required_dimensions:
    - query_length_pref             # Always include these

diversity:
  enabled: true
  min_hamming_distance: 3           # Minimum feature difference
  max_retries: 100                  # Retries for unique persona
```

### Sampling Strategies

```python
from src.sampling import PersonaSampler
from src.persona_bank import PersonaBank

bank = PersonaBank("input/persona.yaml")
sampler = PersonaSampler("input/sampling_config.yaml", bank)

# Sample a single persona
features = sampler.sample_persona()
# {'age_band': '25_34', 'role': 'engineer', 'communication_style': 'casual', ...}
```

### Diversity Enforcement

The sampler ensures diversity using Hamming distance:

```python
# Two personas with these features:
persona_1 = {'age': '25_34', 'role': 'engineer', 'style': 'casual'}
persona_2 = {'age': '25_34', 'role': 'designer', 'style': 'formal'}

# Hamming distance = 2 (role and style differ)
# If min_hamming_distance = 3, persona_2 would be rejected
```

## System Prompt Generation

### Template

The system prompt template (`prompts/persona_to_system_prompt.txt`) converts features to natural language:

```
You are an AI assistant helping a user with the following characteristics:

{persona_features}

Adapt your responses to match the user's:
- Communication style: {communication_style}
- Technical level: {technical_level}
- Preferred response length: {response_length}

Be helpful, accurate, and appropriate for this user's needs.
```

### LLM Formulator

```python
from src.llm_client import create_llm_client, LLMFormulator

client = create_llm_client(config['api'])
formulator = LLMFormulator(
    client,
    template_path="prompts/persona_to_system_prompt.txt",
    max_retries=3,
    validate_output=True,
    min_prompt_length=50
)

# Generate system prompt from features
system_prompt = formulator.formulate(features)
```

## PersonaSpec Data Structure

```python
from src.persona_spec import PersonaSpec, generate_persona_id

# Create a persona spec
spec = PersonaSpec(
    persona_id=generate_persona_id(features),
    features=features,
    system_prompt=system_prompt,
    metadata={
        'num_features': len(features),
        'has_prompt': True
    }
)

# Convert to dictionary
data = spec.to_dict()
```

### JSON Output Format

```json
{
  "persona_id": "persona_20260206_001",
  "features": {
    "age_band": "25_34",
    "role": "engineer",
    "seniority": "mid",
    "communication_style": "casual",
    "response_length": "medium",
    "technical_level": "advanced",
    "query_length_pref": "moderate"
  },
  "system_prompt": "You are helping a mid-level engineer in their late 20s...",
  "metadata": {
    "created_at": "2026-02-06T10:30:00",
    "num_features": 7,
    "has_prompt": true
  }
}
```

## Persona Storage

### Saving Personas

```python
from src.persona_spec import PersonaSpecStorage

storage = PersonaSpecStorage("output/personas", format="json")

# Save a single persona
storage.save(spec)

# Load all personas
all_personas = storage.load_all()
```

### Incremental Mode

In incremental mode, existing personas are preserved:

```python
# Check existing personas
existing = storage.load_all()
print(f"Found {len(existing)} existing personas")

# Only generate missing personas
remaining = num_personas - len(existing)
```

## Customizing Persona Dimensions

### Adding New Dimensions

Edit `input/persona.yaml`:

```yaml
dimensions:
  # Add a new dimension
  industry:
    name: industry
    is_constraint: false
    values:
      - technology
      - healthcare
      - finance
      - education
      - retail
      - manufacturing
```

### Modifying Existing Dimensions

```yaml
dimensions:
  role:
    name: role
    is_constraint: true
    values:
      - student
      - researcher
      - developer
      - manager
      - executive
      - consultant      # Add new value
      - freelancer      # Add new value
```

### Dimension Constraints

Constraint dimensions (`is_constraint: true`) are always included:

```yaml
dimensions:
  age_band:
    name: age_band
    is_constraint: true   # Always sampled
    values: [...]

  hobby:
    name: hobby
    is_constraint: false  # Optionally sampled
    values: [...]
```

## API Reference

### PersonaBank

```python
class PersonaBank:
    """Manages persona dimension definitions."""

    def __init__(self, persona_path: str):
        """Load persona dimensions from YAML file."""

    def get_dimension(self, name: str) -> Dict:
        """Get a specific dimension definition."""

    def get_all_dimensions(self) -> Dict[str, Dict]:
        """Get all dimension definitions."""

    def get_constraint_dimensions(self) -> List[str]:
        """Get names of constraint dimensions."""
```

### PersonaSampler

```python
class PersonaSampler:
    """Samples persona features with diversity constraints."""

    def __init__(self, config_path: str, persona_bank: PersonaBank):
        """Initialize sampler with configuration."""

    def sample_persona(self) -> Dict[str, str]:
        """Sample a single persona's features."""

    def sample_batch(self, n: int) -> List[Dict[str, str]]:
        """Sample multiple personas with diversity."""
```

### PersonaSpec

```python
@dataclass
class PersonaSpec:
    """Represents a complete persona specification."""

    persona_id: str
    features: Dict[str, str]
    system_prompt: Optional[str]
    metadata: Dict[str, Any]

    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary."""

    @classmethod
    def from_dict(cls, data: Dict) -> 'PersonaSpec':
        """Create from dictionary."""
```

## Best Practices

### 1. Balance Diversity and Realism

```yaml
diversity:
  min_hamming_distance: 3   # Not too high (unrealistic)
                            # Not too low (redundant personas)
```

### 2. Include Essential Dimensions

```yaml
sampling:
  required_dimensions:
    - query_length_pref     # Critical for query adaptation
    - technical_level       # Affects response complexity
```

### 3. Validate System Prompts

```yaml
formulation:
  validate_output: true
  min_prompt_length: 50     # Reject too-short prompts
```

### 4. Use Incremental Mode for Large Runs

```yaml
experiment:
  incremental: true         # Resume from existing personas
```

## See Also

- [Configuration](configuration.md) - Full configuration reference
- [Query Generation](query_generation.md) - How personas affect queries
- [Persona API](../api/persona.md) - Detailed API documentation