Persona System
This guide covers the persona generation system in PersonaGym.
Overview
The persona system creates diverse user profiles that drive realistic conversation simulation. Each persona consists of:
Features: A set of dimension-value pairs (e.g.,
age_band: 25_34)System Prompt: An LLM-generated prompt describing the persona
Metadata: Creation timestamp, feature count, etc.
Persona Dimensions
Personas are defined by 30+ dimensions across 5 categories in input/persona.yaml:
Basic Info (Constraint Dimensions)
Dimension |
Values |
Description |
|---|---|---|
|
u18, 18_24, 25_34, 35_44, 45_60, 60p |
Age group |
|
student, engineer, data, pm, designer, … |
Professional role |
|
intern, entry, junior, mid, senior, lead, director, c_level |
Experience level |
|
hs, bachelor, master, phd |
Education level |
|
english, chinese, spanish, french, german, … |
Primary language |
Communication Style
Dimension |
Values |
Description |
|---|---|---|
|
casual, professional, technical, creative |
Communication preference |
|
very_short, short, medium, long, very_long |
Preferred response length |
|
beginner, intermediate, advanced, expert |
Technical expertise |
|
friendly, formal, humorous, serious |
Conversation tone |
Query Preferences
Dimension |
Values |
Description |
|---|---|---|
|
concise, moderate, detailed |
Query length preference |
|
step_by_step, overview, examples, minimal |
Learning style |
|
rarely, sometimes, often, always |
Follow-up frequency |
Persona Sampling
Configuration
Sampling is configured in input/sampling_config.yaml:
sampling:
feature_availability_rate: 0.7 # % of dimensions per persona
min_features: 10
max_features: 20
required_dimensions:
- query_length_pref # Always include these
diversity:
enabled: true
min_hamming_distance: 3 # Minimum feature difference
max_retries: 100 # Retries for unique persona
Sampling Strategies
from src.sampling import PersonaSampler
from src.persona_bank import PersonaBank
bank = PersonaBank("input/persona.yaml")
sampler = PersonaSampler("input/sampling_config.yaml", bank)
# Sample a single persona
features = sampler.sample_persona()
# {'age_band': '25_34', 'role': 'engineer', 'communication_style': 'casual', ...}
Diversity Enforcement
The sampler ensures diversity using Hamming distance:
# Two personas with these features:
persona_1 = {'age': '25_34', 'role': 'engineer', 'style': 'casual'}
persona_2 = {'age': '25_34', 'role': 'designer', 'style': 'formal'}
# Hamming distance = 2 (role and style differ)
# If min_hamming_distance = 3, persona_2 would be rejected
System Prompt Generation
Template
The system prompt template (prompts/persona_to_system_prompt.txt) converts features to natural language:
You are an AI assistant helping a user with the following characteristics:
{persona_features}
Adapt your responses to match the user's:
- Communication style: {communication_style}
- Technical level: {technical_level}
- Preferred response length: {response_length}
Be helpful, accurate, and appropriate for this user's needs.
LLM Formulator
from src.llm_client import create_llm_client, LLMFormulator
client = create_llm_client(config['api'])
formulator = LLMFormulator(
client,
template_path="prompts/persona_to_system_prompt.txt",
max_retries=3,
validate_output=True,
min_prompt_length=50
)
# Generate system prompt from features
system_prompt = formulator.formulate(features)
PersonaSpec Data Structure
from src.persona_spec import PersonaSpec, generate_persona_id
# Create a persona spec
spec = PersonaSpec(
persona_id=generate_persona_id(features),
features=features,
system_prompt=system_prompt,
metadata={
'num_features': len(features),
'has_prompt': True
}
)
# Convert to dictionary
data = spec.to_dict()
JSON Output Format
{
"persona_id": "persona_20260206_001",
"features": {
"age_band": "25_34",
"role": "engineer",
"seniority": "mid",
"communication_style": "casual",
"response_length": "medium",
"technical_level": "advanced",
"query_length_pref": "moderate"
},
"system_prompt": "You are helping a mid-level engineer in their late 20s...",
"metadata": {
"created_at": "2026-02-06T10:30:00",
"num_features": 7,
"has_prompt": true
}
}
Persona Storage
Saving Personas
from src.persona_spec import PersonaSpecStorage
storage = PersonaSpecStorage("output/personas", format="json")
# Save a single persona
storage.save(spec)
# Load all personas
all_personas = storage.load_all()
Incremental Mode
In incremental mode, existing personas are preserved:
# Check existing personas
existing = storage.load_all()
print(f"Found {len(existing)} existing personas")
# Only generate missing personas
remaining = num_personas - len(existing)
Customizing Persona Dimensions
Adding New Dimensions
Edit input/persona.yaml:
dimensions:
# Add a new dimension
industry:
name: industry
is_constraint: false
values:
- technology
- healthcare
- finance
- education
- retail
- manufacturing
Modifying Existing Dimensions
dimensions:
role:
name: role
is_constraint: true
values:
- student
- researcher
- developer
- manager
- executive
- consultant # Add new value
- freelancer # Add new value
Dimension Constraints
Constraint dimensions (is_constraint: true) are always included:
dimensions:
age_band:
name: age_band
is_constraint: true # Always sampled
values: [...]
hobby:
name: hobby
is_constraint: false # Optionally sampled
values: [...]
API Reference
PersonaBank
class PersonaBank:
"""Manages persona dimension definitions."""
def __init__(self, persona_path: str):
"""Load persona dimensions from YAML file."""
def get_dimension(self, name: str) -> Dict:
"""Get a specific dimension definition."""
def get_all_dimensions(self) -> Dict[str, Dict]:
"""Get all dimension definitions."""
def get_constraint_dimensions(self) -> List[str]:
"""Get names of constraint dimensions."""
PersonaSampler
class PersonaSampler:
"""Samples persona features with diversity constraints."""
def __init__(self, config_path: str, persona_bank: PersonaBank):
"""Initialize sampler with configuration."""
def sample_persona(self) -> Dict[str, str]:
"""Sample a single persona's features."""
def sample_batch(self, n: int) -> List[Dict[str, str]]:
"""Sample multiple personas with diversity."""
PersonaSpec
@dataclass
class PersonaSpec:
"""Represents a complete persona specification."""
persona_id: str
features: Dict[str, str]
system_prompt: Optional[str]
metadata: Dict[str, Any]
def to_dict(self) -> Dict[str, Any]:
"""Convert to dictionary."""
@classmethod
def from_dict(cls, data: Dict) -> 'PersonaSpec':
"""Create from dictionary."""
Best Practices
1. Balance Diversity and Realism
diversity:
min_hamming_distance: 3 # Not too high (unrealistic)
# Not too low (redundant personas)
2. Include Essential Dimensions
sampling:
required_dimensions:
- query_length_pref # Critical for query adaptation
- technical_level # Affects response complexity
3. Validate System Prompts
formulation:
validate_output: true
min_prompt_length: 50 # Reject too-short prompts
4. Use Incremental Mode for Large Runs
experiment:
incremental: true # Resume from existing personas
See Also
Configuration - Full configuration reference
Query Generation - How personas affect queries
Persona API - Detailed API documentation