PersonaGym
Getting Started
Installation
Requirements
Python Dependencies
Install from Source
Step 1: Clone the Repository
Step 2: Create Virtual Environment
Step 3: Install Dependencies
API Keys Configuration
Option 1: Environment Variables (Recommended)
Option 2: Configuration File
Verify Installation
Test Basic Pipeline
Troubleshooting
Import Errors
API Authentication Errors
Rate Limiting
Encoding Issues (Windows)
Development Installation
Directory Structure After Installation
Next Steps
Quick Start
5-Minute Introduction
Core Workflows
1. Basic Pipeline (Persona Generation Only)
2. Enhanced Pipeline (Full Data Generation)
3. Stage-by-Stage Execution
4. Skip Specific Features
Configuration Overview
Understanding Output
Output Directory Structure
Sample Output Formats
Common Patterns
Pattern 1: End-to-End Pipeline
Pattern 2: Incremental Generation
Pattern 3: Custom Persona Features
Pattern 4: Multi-Provider Model Pool
Next Steps
Troubleshooting
Empty Responses
Slow Generation
Memory Issues
Tutorials
Tutorial Overview
Prerequisites
What You’ll Learn
Example Notebooks
Getting Help
Let’s Get Started!
User Guide
Configuration
Configuration File Structure
API Configuration
Single Provider
Environment Variable Reference
OpenRouter Configuration
Paths Configuration
Persona Generation Configuration
Sampling Strategies
Formulation Configuration
Query Generation Configuration
Interaction Generation Configuration
Distractor Configuration
Semantic Distractor Layers
Training Data Configuration
Experiment Configuration
Configuration Validation
Validation Checks
Environment Variables
Complete Example
See Also
Persona System
Overview
Persona Dimensions
Basic Info (Constraint Dimensions)
Communication Style
Query Preferences
Persona Sampling
Configuration
Sampling Strategies
Diversity Enforcement
System Prompt Generation
Template
LLM Formulator
PersonaSpec Data Structure
JSON Output Format
Persona Storage
Saving Personas
Incremental Mode
Customizing Persona Dimensions
Adding New Dimensions
Modifying Existing Dimensions
Dimension Constraints
API Reference
PersonaBank
PersonaSampler
PersonaSpec
Best Practices
1. Balance Diversity and Realism
2. Include Essential Dimensions
3. Validate System Prompts
4. Use Incremental Mode for Large Runs
See Also
Query Generation
Overview
Query Dataset
Input Format
Dataset Statistics
Query Assignment
Configuration
Selection Logic
Style Transfer
Purpose
Configuration
Template
Programmatic Usage
Query Storage
Persistence
Output Format
Domain and Scenario Inference
Automatic Inference
Impact on Persona
Batch Generation
For Multiple Personas
Tracking Used Queries
Advanced Usage
Custom Query Selection
Partial Style Transfer
API Reference
UserQueryGenerator
QueryDataset
QueryStorage
Best Practices
1. Diverse Query Dataset
2. Balanced Style Transfer
3. Query Deduplication
4. Monitor Query Usage
See Also
Interaction Generation
Overview
Architecture
Configuration
Conversation Flow
1. Initial Query
2. Assistant Response
3. User Feedback
Model Pool
Weighted Model Selection
Model Locking
Interaction Data Structure
Output Format
Programmatic Usage
Single Interaction
Batch Generation
Interaction Storage
Incremental Saving
Index File
Distractor Integration
Real-time Noise Application
Metadata Tracking
Error Handling
Retry Logic
Supplement Rounds
Concurrent Execution
Thread Safety
Worker Configuration
API Reference
InteractionGenerator
AssistantModel
UserFeedbackModel
Best Practices
1. Configure Appropriate Turn Limits
2. Use Model Pool for Diversity
3. Enable Incremental Storage
4. Monitor Success Rate
See Also
Distractor System
Overview
Architecture
Three-Layer Semantic Distractor
Layer Overview
Layer 1: Surface Noise
Layer 2: Incomplete Information
Layer 3: Semantic Ambiguity
Configuration
Enable Semantic Distractor
Strategy Configuration
Programmatic Usage
Create Distractor
Apply Noise
Batch Processing
NoiseResult Data Structure
Output Example
Intent/Slot Extraction
ExtractedSemantics
Preservation by Layer
Legacy Rule-Based Distractor
Configuration
Available Strategies
Usage
Integration with Interactions
Real-time Application
Metadata Structure
API Reference
SemanticDistractorModel
DistractorModel
Factory Function
Best Practices
1. Start with Low Activation Probability
2. Balance Layer Weights
3. Use Mandatory Strategies Sparingly
4. Monitor Noise Quality
See Also
Training Data
Overview
TrainingSample Structure
Output Format
Configuration
Collection Process
From Interactions
From Storage
Export Process
Basic Export
With Train/Val/Test Split
Export Statistics
Output Files
Directory Structure
Statistics File
Programmatic Usage
Complete Workflow
Custom Transformation
Data Validation
Sample Validation
Batch Validation
Format Options
Include/Exclude Fields
Minimal Format
HuggingFace Integration
Prepare for Upload
Upload to Hub
API Reference
TrainingSample
TrainingDataCollector
TrainingDataExporter
Best Practices
1. Validate Before Export
2. Use Timestamps for Versioning
3. Monitor Statistics
4. Incremental Export
See Also
Token Tracking
Overview
Architecture
Basic Usage
Automatic Tracking
Manual Tracking
TokenUsage Structure
Statistics Output
Summary
By Module
By Model
Export Format
JSON Output
Export Methods
Cost Analysis
Estimate Costs
Cost per Sample
Module Breakdown
Tracked Modules
Cost Distribution
Analysis Scripts
analyze_token_usage.py
analyze_for_paper.py
API Reference
TokenTracker
Convenience Functions
Thread Safety
Best Practices
1. Enable Tracking Early
2. Export Regularly
3. Monitor Cost During Development
4. Analyze Before Scale-Up
See Also
Examples
Basic Pipeline Example
Overview
Complete Example
Expected Output
Command Line Equivalent
Configuration
Key Takeaways
Next Steps
Enhanced Pipeline Example
Overview
Complete Example
Expected Output
Command Line Equivalent
Stage Control
Key Takeaways
Next Steps
Custom Persona Example
Overview
Custom Dimensions
Edit persona.yaml
Custom Sampling
Edit sampling_config.yaml
Programmatic Example
Key Takeaways
Next Steps
Multi-Provider LLM Example
Overview
Configuration
Programmatic Usage
Model Selection
Token Tracking
Key Takeaways
See Also
API Reference
API Reference
Overview
Quick Links
Pipeline
LLM Client
Persona
Query
Interaction
Distractor
Training Data
Utils
Quick Module Reference
Core Classes
Data Classes
Factory Functions
Module Dependencies
Type Hints
Error Handling
Search
Pipeline API
EnhancedPersonaGenerationPipeline
Class Definition
Constructor
Methods
run
Attributes
Usage Example
PersonaGenerationPipeline
Class Definition
Constructor
Methods
run
reset
Usage Example
Command Line Interface
Usage
Options
Stages
Examples
See Also
LLM Client API
Overview
Factory Function
create_llm_client
LLMClient (Abstract Base)
OpenAIClient
Constructor
Methods
generate
generate_with_tokens
Attributes
LLMFormulator
Constructor
Methods
formulate
Usage Examples
Basic Usage
OpenRouter Usage
System Prompt Generation
Error Handling
See Also
Persona API
PersonaBank
Class Definition
Constructor
Methods
PersonaSampler
Class Definition
Constructor
Methods
PersonaSpec
Class Definition
Methods
Utility Functions
PersonaSpecStorage
Class Definition
Constructor
Methods
Usage Examples
Complete Workflow
See Also
Query API
UserQueryGenerator
Class Definition
Constructor
Methods
QueryDataset
Class Definition
Constructor
Methods
QueryStorage
Class Definition
Constructor
Methods
Usage Examples
Generate Queries
Batch Generation
Persist Queries
See Also
Interaction API
InteractionGenerator
Class Definition
Constructor
Methods
Interaction
Class Definition
Methods
Message
Class Definition
InteractionStorage
Class Definition
Constructor
Methods
AssistantModel
Class Definition
Methods
UserFeedbackModel
Class Definition
Methods
Usage Examples
Single Interaction
Batch Generation
See Also
Distractor API
Factory Function
SemanticDistractorModel
Class Definition
Constructor
Methods
Attributes
DistractorModel
Class Definition
Methods
NoiseResult
Class Definition
Methods
NoisyVersion
Class Definition
IntentSlotExtractor
Class Definition
Methods
LLMNoiseGenerator
Class Definition
Methods
Usage Examples
Semantic Distractor
Legacy Distractor
Intent Extraction
See Also
Training Data API
TrainingSample
Class Definition
Methods
TrainingDataCollector
Class Definition
Constructor
Methods
TrainingDataExporter
Class Definition
Constructor
Methods
Usage Examples
Collect from Interactions
Export to Files
Compute Statistics
Complete Pipeline Integration
Output Format
Sample JSON
Statistics JSON
See Also
Utils API
TokenTracker
Class Definition
Class Methods
Instance Methods
Convenience Functions
TokenUsage
Class Definition
ColoredLogger
Functions
Color Types
Config Validation
Functions
ValidationCheck
ValidationIssue
Usage Examples
Token Tracking
Colored Output
Config Validation
Thread Safety
See Also
Additional Information
License
MIT License
Third-Party Licenses
Commercial Use
Attribution
Questions
PersonaGym
Index
Index