# Pipeline API Main pipeline classes for orchestrating data generation. ## EnhancedPersonaGenerationPipeline Complete 6-stage pipeline for generating synthetic training data. ```python from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline ``` ### Class Definition ```python class EnhancedPersonaGenerationPipeline: """ Complete pipeline for generating synthetic training data. Stages: 1. Persona Generation 2. System Prompt Formulation 3. Query Generation with Style Transfer 4. Multi-turn Interaction Generation 5. Distractor/Noise Application 6. Training Data Collection and Export """ ``` ### Constructor ```python def __init__(self, config_path: str): """ Initialize the enhanced pipeline. Args: config_path: Path to configuration YAML file Raises: FileNotFoundError: If config file not found ValueError: If config validation fails """ ``` ### Methods #### run ```python def run(self, num_personas: Optional[int] = None) -> Dict[str, Any]: """ Run the complete pipeline. Args: num_personas: Number of personas to generate. If None, uses value from config. Returns: Dictionary with pipeline results: - 'personas': List of persona dictionaries - 'queries': Dict mapping persona_id to queries - 'interactions': List of interaction dictionaries - 'enhanced_interactions': Interactions with noise (deprecated) - 'training_data': Export statistics Example: >>> pipeline = EnhancedPersonaGenerationPipeline("config.yaml") >>> result = pipeline.run(num_personas=10) >>> print(f"Generated {len(result['personas'])} personas") """ ``` ### Attributes | Attribute | Type | Description | |-----------|------|-------------| | `config` | `Dict[str, Any]` | Loaded configuration | | `stages` | `Dict[str, bool]` | Which stages to run | | `incremental` | `bool` | Skip existing outputs | | `token_tracker` | `TokenTracker` | Token usage tracker | | `llm_client` | `LLMClient` | Main LLM client | | `persona_bank` | `PersonaBank` | Persona dimensions | | `sampler` | `PersonaSampler` | Feature sampler | | `query_generator` | `UserQueryGenerator` | Query generator | | `interaction_generator` | `InteractionGenerator` | Dialogue generator | | `distractor` | `DistractorModel` | Noise injector | | `training_collector` | `TrainingDataCollector` | Sample collector | | `training_exporter` | `TrainingDataExporter` | Sample exporter | ### Usage Example ```python from src.enhanced_pipeline import EnhancedPersonaGenerationPipeline # Initialize pipeline pipeline = EnhancedPersonaGenerationPipeline("config.yaml") # Run full pipeline result = pipeline.run(num_personas=100) # Access results print(f"Personas: {len(result['personas'])}") print(f"Queries: {sum(len(q) for q in result['queries'].values())}") print(f"Interactions: {len(result['interactions'])}") print(f"Training samples: {result['training_data']['total_samples']}") ``` --- ## PersonaGenerationPipeline Basic pipeline for persona generation only (without interactions). ```python from src.pipeline import PersonaGenerationPipeline ``` ### Class Definition ```python class PersonaGenerationPipeline: """ Basic pipeline for generating personas with system prompts. Use this for: - Generating personas without interactions - Lightweight persona generation at scale - Exporting persona datasets """ ``` ### Constructor ```python def __init__(self, config_path: str): """ Initialize the basic pipeline. Args: config_path: Path to configuration YAML file """ ``` ### Methods #### run ```python def run( self, num_personas: Optional[int] = None, generate_prompts: bool = True, export_dataset: bool = False, dataset_path: Optional[str] = None ) -> Dict[str, Any]: """ Run the persona generation pipeline. Args: num_personas: Number of personas to generate generate_prompts: Whether to generate system prompts export_dataset: Whether to export to dataset file dataset_path: Custom export path Returns: Dictionary with: - 'num_personas': Count of generated personas - 'personas': List of PersonaSpec objects - 'dataset_path': Path to exported dataset (if exported) """ ``` #### reset ```python def reset(self) -> None: """ Reset pipeline state. Clears stored personas and resets sampler state. """ ``` ### Usage Example ```python from src.pipeline import PersonaGenerationPipeline # Initialize pipeline = PersonaGenerationPipeline("config.yaml") # Generate personas only result = pipeline.run( num_personas=1000, generate_prompts=True, export_dataset=True, dataset_path="personas.json" ) print(f"Generated {result['num_personas']} personas") print(f"Exported to {result['dataset_path']}") ``` --- ## Command Line Interface The `run.py` script provides CLI access to both pipelines. ### Usage ```bash python run.py [OPTIONS] ``` ### Options | Option | Description | Default | |--------|-------------|---------| | `--config PATH` | Configuration file path | `config.yaml` | | `--num-personas NUM` | Number of personas | From config | | `--mode {basic,enhanced}` | Pipeline mode | `enhanced` | | `--stage STAGE` | Run specific stage | `all` | | `--skip-query-transfer` | Skip style transfer | `False` | | `--skip-distractor` | Skip noise injection | `False` | | `--reset` | Reset pipeline state | `False` | ### Stages | Stage | Description | |-------|-------------| | `all` | Run all stages | | `persona` | Persona generation only | | `query` | Query generation only | | `interaction` | Interaction generation only | | `distractor` | Distractor application only | | `training` | Training data export only | ### Examples ```bash # Full enhanced pipeline python run.py --num-personas 100 # Basic mode only python run.py --mode basic --num-personas 1000 # Skip noise injection python run.py --skip-distractor --num-personas 50 # Run specific stage python run.py --stage interaction ``` --- ## See Also - [Configuration](../user_guide/configuration.md) - Pipeline configuration - [Quick Start](../quickstart.md) - Getting started guide