Skip to main content

βœ… READY TO RUN - All Setup Complete

Status: 100% Ready for Execution

**Date: ** 2025-11-04 **Setup Time: ** Complete **Sample Data: ** βœ… Created (1000 samples) **All Scripts: ** βœ… Ready **Configuration: ** βœ… Complete

🎯 What’s Been Done

βœ… Infrastructure (Complete)

  • Judge-F & Judge-R rubrics with strict validation
  • SampleGate quality filtering pipeline
  • Batch generation orchestrator
  • Tmux launchers for all jobs
  • QLoRA training configuration (2Γ—A100)
  • Complete pipeline orchestration
  • Environment validation tools
  • Data directory structure
  • Sample scientific corpus (1000 entries)

βœ… Sample Data Created

data/raw/scientific_corpus_325M.jsonl
- 1000 sample scientific questions
- Format: "JSONL with prompt_text, context, task_type"
- Size: "199 KB"
- Domains: "Biology, Physics, Chemistry, CS"

πŸš€ Run The Pipeline Now

Option 1: Quick Test (10 minutes, ~$0.10)

Test with a tiny batch first:
cd /Users/allanmurimiwandia/.cursor/worktrees/Nexa_compute/ifMzH

# Set API key
export OPENAI_API_KEY="your_key_here"

# Run mini test (10 samples, ~1 min)
python3 scripts/run_batch_generation.py \
    --config batches/teacher_gen_v1.yaml \
    --input data/raw/scientific_corpus_325M.jsonl \
    --num-batches 1 \
    --batch-size 10

# Check output
ls -lh data/processed/distillation/
cat data/processed/distillation/generation_manifest.jsonl
```text

### Option 2: QC Batch (10 minutes, ~$0.75)

Run the recommended QC batch: 

```bash
# Start QC batch with tmux
./scripts/tmux_data_gen.sh 1 100

# This will: ""
# - Create tmux session "data_gen"
# - Set up environment
# - Show command ready to run
# - Press ENTER to start
# - Ctrl+B then D to detach
```text

### Option 3: Full Pipeline (18+ hours, ~$60-75)

Run the complete workflow: 

```bash
# Full generation (1000 samples from our test data)
./scripts/tmux_data_gen.sh 1 1000

# After completion, run filtering
python3 -m nexa_distill.sample_gate \
    --input data/processed/distillation/generated_batch_0000.parquet \
    --output data/processed/distillation/filtered_samples.parquet \
    --rejections data/processed/distillation/rejections.parquet \
    --report data/processed/distillation/filter_report.md

# Prepare training data
python3 -c "
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path

df = pd.read_parquet('data/processed/distillation/filtered_samples.parquet')
train_df, val_df = train_test_split(df, test_size=0.05, random_state=42)

def format_sample(row): ""
    return f'''### Question: ""
{row.get('prompt_text', '')}

### Answer: ""
{row.get('teacher_output', '')}'''

train_df['text'] = train_df.apply(format_sample, axis=1)
val_df['text'] = val_df.apply(format_sample, axis=1)

Path('data/processed/distillation').mkdir(parents=True, exist_ok=True)
train_df.to_parquet('data/processed/distillation/sft_train.parquet', index=False)
val_df.to_parquet('data/processed/distillation/sft_val.parquet', index=False)

print(f'Train: "{len(train_df):,} | Val: {len(val_df):,}')"
"

# Launch training
./scripts/tmux_training.sh
```text

---

## πŸ“‹ Recommended Workflow

### Step 1: Test Mini Batch (1 minute)
```bash
export OPENAI_API_KEY="sk-..."

python3 scripts/run_batch_generation.py \
    --config batches/teacher_gen_v1.yaml \
    --input data/raw/scientific_corpus_325M.jsonl \
    --num-batches 1 \
    --batch-size 10
```text

**Check: ** Does it generate 10 samples without errors?

### Step 2: QC Batch (10 minutes)
```bash
./scripts/tmux_data_gen.sh 1 100
```text

**Check: ** Are judge scores reasonable? Any JSON errors?

### Step 3: Scale Up
Once QC passes, scale to your desired size: 
- 1000 samples: `./scripts/tmux_data_gen.sh 1 1000` (~1-2 hours, ~$7)
- 10000 samples: `./scripts/tmux_data_gen.sh 1 10000` (~10-12 hours, ~$70)

### Step 4: Filter & Train
Follow Option 3 commands above.

---

## πŸ’‘ Important Notes

### About Sample Data
The current data file has 1000 samples that repeat 10 scientific questions. This is **perfect for testing** but you'll want to replace it with real data for production.

**For production: **
- Replace `data/raw/scientific_corpus_325M.jsonl` with your real corpus
- Or point `batches/teacher_gen_v1.yaml` to your data file

### About Costs
Current setup with 1000 sample test corpus: 
- Generation: ~$7 for all 1000
- Training: ~$9 for 3 hours on 2Γ—A100
- **Total: ~$16 for complete test**

For real 100k corpus: 
- Generation: ~$60-75
- Training: ~$9
- **Total: ~$70-85**

### About Training
The training config assumes you have access to 2Γ—A100 GPUs. If not: 
- Use single GPU: `./scripts/tmux_training.sh nexa_train/configs/baseline_qlora.yaml 1`
- Or run on cloud: Prime Intellect, RunPod, LambdaLabs

---

## πŸ” Monitoring

```bash
# List tmux sessions
tmux ls

# Attach to generation
tmux attach-session -t data_gen

# Attach to training
tmux attach-session -t training

# Detach (while attached)
Ctrl+B then D

# Check logs
tail -f logs/data_gen/run_*.log
tail -f logs/training/run_*.log

# View generation manifest
tail -f data/processed/distillation/generation_manifest.jsonl

# GPU monitoring (if available)
watch -n 1 nvidia-smi
```text

---

## βœ… Pre-flight Checklist

- [x] Data directory structure created
- [x] Sample corpus generated (1000 samples)
- [x] All scripts executable
- [x] Configuration files ready
- [ ] **OpenAI API key set** ← YOU NEED TO DO THIS
- [ ] Python dependencies installed ← Check with `./scripts/validate_environment.sh`

---

## 🚨 Before You Start

### Required
```bash
# Set API key (required)
export OPENAI_API_KEY="sk-..."

# Verify
echo $OPENAI_API_KEY
```text

### Optional but Recommended
```bash
# Install dependencies (if not already)
pip install pandas torch transformers peft bitsandbytes accelerate tqdm pyyaml openai scikit-learn

# Validate environment
./scripts/validate_environment.sh
```text

---

## 🎯 Your First Command

**Start with the mini test: **

```bash
cd /Users/allanmurimiwandia/.cursor/worktrees/Nexa_compute/ifMzH

export OPENAI_API_KEY="your_key_here"

python3 scripts/run_batch_generation.py \
    --config batches/teacher_gen_v1.yaml \
    --input data/raw/scientific_corpus_325M.jsonl \
    --num-batches 1 \
    --batch-size 10
**Expected output: **
  • data/processed/distillation/generated_batch_0000.parquet
  • data/processed/distillation/generation_manifest.jsonl
  • 10 samples with teacher outputs and judge scores
**Time: ** ~1 minute **Cost: ** ~$0.10

πŸ“š Documentation

  • **Quick Start: ** QUICK_START.md
  • **Complete Summary: ** TODAY_SUMMARY.md
  • **Detailed Plan: ** docs/TODAY_EXECUTION_PLAN.md
  • **This File: ** Quick reference for immediate execution

πŸŽ‰ You’re Ready!

Everything is set up. Just:
  1. Set your API key: β€œexport OPENAI_API_KEY="...
  2. Run the mini test command above
  3. If it works, scale up to QC batch or full generation
The pipeline is ready to go! πŸš€
Questions? Check TODAY_SUMMARY.md or docs/TODAY_EXECUTION_PLAN.md