NexaCompute Pipeline Execution Guide
Overview
This guide explains how to run the complete NexaCompute pipeline for dataset generation and training using tmux sessions.Pipeline Stages
- Data Generation (
data_gen): Generate teacher outputs for full dataset - Filtering (
filtering): Apply basic filters + SampleGate quality gates - Packaging (
packaging): Convert filtered data to SFT format - Training (
training): Train model on distilled dataset
Quick Start
Launch All Jobs
Manual Execution
If you prefer to run stages manually:1. Data Generation
2. Filtering
3. Packaging
4. Training
Tmux Session Management
List Sessions
Attach to Session
Detach from Session
PressCtrl+B then D (while inside tmux)
Kill Session
File Locations
Inputs
- Teacher inputs:
data/processed/distillation/teacher_inputs/teacher_inputs_v1.parquet - System prompt:
data/system_prompt_template.txt
Outputs
- Teacher outputs:
data/processed/distillation/teacher_outputs/teacher_outputs_v1.parquet - Filtered data:
data/processed/distillation/filtered/filtered_v1.parquet - Rejections:
data/processed/distillation/filtered/rejections.parquet - SFT dataset:
data/processed/training/sft_dataset.jsonl
Logs
- Data generation:
logs/data_gen.log - Filtering:
logs/filtering.log - Packaging:
logs/packaging.log - Training:
logs/training.log
Configuration
Distillation Config
- Location:
nexa_distill/configs/distill_config.yaml - Model:
gpt-4o-mini - Batch size: 8
Filter Config
- Location:
nexa_distill/configs/filters.yaml - Min judge score: 0.80 (set in SampleGate)
Training Config
- Location:
nexa_train/configs/baseline_distill.yaml - Distributed: 2 GPUs
- Mixed precision: Enabled
Environment Variables
Ensure.env file contains: