cd /Users/allanmurimiwandia/.cursor/worktrees/Nexa_compute/ifMzH
# Set API key
export OPENAI_API_KEY="your_key_here"
# Run mini test (10 samples, ~1 min)
python3 scripts/run_batch_generation.py \
--config batches/teacher_gen_v1.yaml \
--input data/raw/scientific_corpus_325M.jsonl \
--num-batches 1 \
--batch-size 10
# Check output
ls -lh data/processed/distillation/
cat data/processed/distillation/generation_manifest.jsonl
```text
### Option 2: QC Batch (10 minutes, ~$0.75)
Run the recommended QC batch:
```bash
# Start QC batch with tmux
./scripts/tmux_data_gen.sh 1 100
# This will: ""
# - Create tmux session "data_gen"
# - Set up environment
# - Show command ready to run
# - Press ENTER to start
# - Ctrl+B then D to detach
```text
### Option 3: Full Pipeline (18+ hours, ~$60-75)
Run the complete workflow:
```bash
# Full generation (1000 samples from our test data)
./scripts/tmux_data_gen.sh 1 1000
# After completion, run filtering
python3 -m nexa_distill.sample_gate \
--input data/processed/distillation/generated_batch_0000.parquet \
--output data/processed/distillation/filtered_samples.parquet \
--rejections data/processed/distillation/rejections.parquet \
--report data/processed/distillation/filter_report.md
# Prepare training data
python3 -c "
import pandas as pd
from sklearn.model_selection import train_test_split
from pathlib import Path
df = pd.read_parquet('data/processed/distillation/filtered_samples.parquet')
train_df, val_df = train_test_split(df, test_size=0.05, random_state=42)
def format_sample(row): ""
return f'''### Question: ""
{row.get('prompt_text', '')}
### Answer: ""
{row.get('teacher_output', '')}'''
train_df['text'] = train_df.apply(format_sample, axis=1)
val_df['text'] = val_df.apply(format_sample, axis=1)
Path('data/processed/distillation').mkdir(parents=True, exist_ok=True)
train_df.to_parquet('data/processed/distillation/sft_train.parquet', index=False)
val_df.to_parquet('data/processed/distillation/sft_val.parquet', index=False)
print(f'Train: "{len(train_df):,} | Val: {len(val_df):,}')"
"
# Launch training
./scripts/tmux_training.sh
```text
---
## π Recommended Workflow
### Step 1: Test Mini Batch (1 minute)
```bash
export OPENAI_API_KEY="sk-..."
python3 scripts/run_batch_generation.py \
--config batches/teacher_gen_v1.yaml \
--input data/raw/scientific_corpus_325M.jsonl \
--num-batches 1 \
--batch-size 10
```text
**Check: ** Does it generate 10 samples without errors?
### Step 2: QC Batch (10 minutes)
```bash
./scripts/tmux_data_gen.sh 1 100
```text
**Check: ** Are judge scores reasonable? Any JSON errors?
### Step 3: Scale Up
Once QC passes, scale to your desired size:
- 1000 samples: `./scripts/tmux_data_gen.sh 1 1000` (~1-2 hours, ~$7)
- 10000 samples: `./scripts/tmux_data_gen.sh 1 10000` (~10-12 hours, ~$70)
### Step 4: Filter & Train
Follow Option 3 commands above.
---
## π‘ Important Notes
### About Sample Data
The current data file has 1000 samples that repeat 10 scientific questions. This is **perfect for testing** but you'll want to replace it with real data for production.
**For production: **
- Replace `data/raw/scientific_corpus_325M.jsonl` with your real corpus
- Or point `batches/teacher_gen_v1.yaml` to your data file
### About Costs
Current setup with 1000 sample test corpus:
- Generation: ~$7 for all 1000
- Training: ~$9 for 3 hours on 2ΓA100
- **Total: ~$16 for complete test**
For real 100k corpus:
- Generation: ~$60-75
- Training: ~$9
- **Total: ~$70-85**
### About Training
The training config assumes you have access to 2ΓA100 GPUs. If not:
- Use single GPU: `./scripts/tmux_training.sh nexa_train/configs/baseline_qlora.yaml 1`
- Or run on cloud: Prime Intellect, RunPod, LambdaLabs
---
## π Monitoring
```bash
# List tmux sessions
tmux ls
# Attach to generation
tmux attach-session -t data_gen
# Attach to training
tmux attach-session -t training
# Detach (while attached)
Ctrl+B then D
# Check logs
tail -f logs/data_gen/run_*.log
tail -f logs/training/run_*.log
# View generation manifest
tail -f data/processed/distillation/generation_manifest.jsonl
# GPU monitoring (if available)
watch -n 1 nvidia-smi
```text
---
## β
Pre-flight Checklist
- [x] Data directory structure created
- [x] Sample corpus generated (1000 samples)
- [x] All scripts executable
- [x] Configuration files ready
- [ ] **OpenAI API key set** β YOU NEED TO DO THIS
- [ ] Python dependencies installed β Check with `./scripts/validate_environment.sh`
---
## π¨ Before You Start
### Required
```bash
# Set API key (required)
export OPENAI_API_KEY="sk-..."
# Verify
echo $OPENAI_API_KEY
```text
### Optional but Recommended
```bash
# Install dependencies (if not already)
pip install pandas torch transformers peft bitsandbytes accelerate tqdm pyyaml openai scikit-learn
# Validate environment
./scripts/validate_environment.sh
```text
---
## π― Your First Command
**Start with the mini test: **
```bash
cd /Users/allanmurimiwandia/.cursor/worktrees/Nexa_compute/ifMzH
export OPENAI_API_KEY="your_key_here"
python3 scripts/run_batch_generation.py \
--config batches/teacher_gen_v1.yaml \
--input data/raw/scientific_corpus_325M.jsonl \
--num-batches 1 \
--batch-size 10