Query Generation
This guide covers the query generation and style transfer system in PersonaGym.
Overview
The query generation stage:
Loads seed queries from a dataset
Assigns queries to personas
Optionally applies style transfer to match persona preferences
Persists queries for interaction generation
Query Dataset
Input Format
Queries are stored in JSONL format (input/query.jsonl):
{"text": "Help me write an email to my boss", "metadata": {"source": "dataset1", "category": "writing"}}
{"text": "Explain how neural networks work", "metadata": {"source": "dataset2", "category": "education"}}
{"text": "Debug this Python code", "metadata": {"source": "dataset3", "category": "coding"}}
Dataset Statistics
from src.query_generator import QueryDataset
dataset = QueryDataset("input/query.jsonl")
print(f"Total queries: {len(dataset)}")
print(f"Sources: {dataset.get_sources()}")
Query Assignment
Configuration
query_generation:
dataset:
path: "input/query.jsonl"
max_queries: null # null = use all
selection:
queries_per_persona: 5 # Queries assigned per persona
Selection Logic
from src.query_generator import UserQueryGenerator
generator = UserQueryGenerator(config, llm_client)
# Generate queries for a single persona
queries = generator.generate_queries_for_persona(
persona_features={'role': 'engineer', 'technical_level': 'advanced'},
num_queries=5
)
Each query includes:
{
'query_id': 'query_001',
'original_query': 'Help me debug this code',
'adapted_query': 'yo can u help me fix this buggy code', # After style transfer
'inferred_domain': 'coding',
'inferred_scenario': 'debugging',
'target_turns': 3
}
Style Transfer
Purpose
Style transfer adapts queries to match persona communication preferences:
Persona Style |
Original Query |
Adapted Query |
|---|---|---|
Casual |
“Please help me write an email” |
“hey can u help me write an email” |
Technical |
“Explain machine learning” |
“Provide a technical overview of ML algorithms” |
Formal |
“Fix this bug” |
“I would appreciate assistance in resolving this software defect” |
Configuration
query_generation:
style_transfer:
enabled: true
transfer_probability: 0.5 # 50% of queries are transferred
template: "prompts/query_style_transfer.txt"
Template
The style transfer template (prompts/query_style_transfer.txt):
You are adapting a user query to match a specific persona's communication style.
Original Query: {original_query}
Persona Features:
{persona_features}
Rewrite the query to match this persona's:
- Communication style: {communication_style}
- Query length preference: {query_length_pref}
- Tone: {tone}
Output only the adapted query, nothing else.
Programmatic Usage
# With style transfer
queries = generator.generate_queries_for_persona(
persona_features=features,
num_queries=5,
apply_style_transfer=True
)
# Without style transfer
queries = generator.generate_queries_for_persona(
persona_features=features,
num_queries=5,
apply_style_transfer=False
)
Query Storage
Persistence
Queries are persisted per-persona for incremental mode:
from src.query_storage import QueryStorage
storage = QueryStorage("output/queries")
# Save queries for a persona
storage.save_persona(persona_id, queries)
# Load queries for a persona
stored_queries = storage.load_persona(persona_id)
# Load all queries
all_queries = storage.load_all() # Dict[persona_id, List[queries]]
Output Format
output/queries/persona_20260206_001.json:
{
"persona_id": "persona_20260206_001",
"queries": [
{
"query_id": "query_001",
"original_query": "Help me write an email",
"adapted_query": "hey can u help me write an email",
"inferred_domain": "writing",
"inferred_scenario": "email_composition",
"target_turns": 3,
"style_transferred": true
},
...
],
"metadata": {
"created_at": "2026-02-06T10:30:00",
"total_queries": 5
}
}
Domain and Scenario Inference
Automatic Inference
The system infers domain and scenario from query content:
query = "Help me debug this Python script"
# Inferred:
# domain: "coding"
# scenario: "debugging"
Impact on Persona
Inferred domain/scenario updates the persona’s system prompt:
# Original persona features
features = {'role': 'engineer', 'domain': 'general'}
# After query assignment
updated_features = {'role': 'engineer', 'domain': 'coding'}
# System prompt regenerated with coding context
Batch Generation
For Multiple Personas
# Generate queries for all personas
queries_map = generator.generate_queries_batch(personas)
# Returns: Dict[persona_id, List[query_dict]]
for persona_id, queries in queries_map.items():
print(f"{persona_id}: {len(queries)} queries")
Tracking Used Queries
# Mark queries as used (prevents reuse)
generator.mark_used_query_ids({'query_001', 'query_002'})
# Next generation will skip these queries
new_queries = generator.generate_queries_for_persona(features, num_queries=5)
Advanced Usage
Custom Query Selection
class CustomQueryGenerator(UserQueryGenerator):
def select_queries(self, persona_features, num_queries):
# Custom selection logic
# e.g., match queries to persona's domain
domain = persona_features.get('domain')
matching_queries = self.dataset.filter_by_domain(domain)
return matching_queries[:num_queries]
Partial Style Transfer
For queries that should only be partially adapted:
query_generation:
style_transfer:
template: "prompts/query_style_transfer_partial.txt"
The partial template preserves key information while adapting style:
Adapt only the surface form of the query.
Preserve:
- All technical terms
- Specific requirements
- Key constraints
Only change:
- Greeting/closing style
- Formality level
- Sentence structure
API Reference
UserQueryGenerator
class UserQueryGenerator:
"""Generates and adapts queries for personas."""
def __init__(self, config: Dict, llm_client: LLMClient):
"""Initialize with configuration and LLM client."""
def generate_queries_for_persona(
self,
persona_features: Dict[str, str],
num_queries: int = 5,
apply_style_transfer: bool = True
) -> List[Dict[str, Any]]:
"""Generate queries for a single persona."""
def generate_queries_batch(
self,
personas: List[Dict[str, Any]]
) -> Dict[str, List[Dict[str, Any]]]:
"""Generate queries for multiple personas."""
def mark_used_query_ids(self, query_ids: Set[str]) -> None:
"""Mark query IDs as used to prevent reuse."""
QueryDataset
class QueryDataset:
"""Manages the seed query dataset."""
def __init__(self, path: str):
"""Load queries from JSONL file."""
def __len__(self) -> int:
"""Return number of queries."""
def sample(self, n: int) -> List[Dict]:
"""Sample n random queries."""
def get_sources(self) -> Set[str]:
"""Get unique source identifiers."""
QueryStorage
class QueryStorage:
"""Persists queries per persona."""
def __init__(self, output_dir: str):
"""Initialize storage directory."""
def save_persona(self, persona_id: str, queries: List[Dict]) -> str:
"""Save queries for a persona. Returns file path."""
def load_persona(self, persona_id: str) -> List[Dict]:
"""Load queries for a persona."""
def load_all(self) -> Dict[str, List[Dict]]:
"""Load all stored queries."""
def append_persona(self, persona_id: str, queries: List[Dict]) -> None:
"""Append queries to existing persona file."""
Best Practices
1. Diverse Query Dataset
Include queries from various domains and complexity levels:
{"text": "Simple greeting", "metadata": {"complexity": "low"}}
{"text": "Complex technical question with multiple parts", "metadata": {"complexity": "high"}}
2. Balanced Style Transfer
style_transfer:
transfer_probability: 0.5 # Not 100% - keep some original queries
3. Query Deduplication
# Remove duplicate queries
queries = generator._dedupe_queries(queries)
4. Monitor Query Usage
# Check remaining queries
remaining = len(dataset) - len(used_query_ids)
print(f"Remaining queries: {remaining}")
See Also
Persona System - How personas affect queries
Interaction Generation - Using queries in conversations
Query API - Detailed API documentation