Operations Runbook
Complete guide for operating the NexaCompute platform.Environment Provisioning
- Install Python 3.11+.
- Copy
.envto.env.localand adjust endpoints or credentials as needed. - Create virtualenv (
python -m venv .venv && source .venv/bin/activate). - Install deps (
pip install -r requirements.txtoruv pip install -r requirements.txt). - (Optional) Build Docker image for reproducible runs.
Data Lifecycle
- Raw Data: Place raw datasets under
data/raw/or sync from remote storage (scripts/sync_s3.py). - Processed Data: Preprocessing outputs live in
data/processed/organized by purpose (distillation, training, evaluation). - Metadata: Dataset metadata dumps to manifests in
data/processed/{category}/manifests/.
Querying Data
Use the query utility for reliable data access:Running Training
Basic Training
- Override config values inline (e.g.,
--override training.optimizer.lr=0.0001). - Distributed run via
scripts/launch_ddp.shor by settingtraining.distributed.world_size > 1(the pipeline auto-launches DDP workers).
Distillation Training
Evaluating Models
- Metrics saved to
artifacts/runs//metrics.json. - Predictions saved when
evaluation.save_predictionsis true. - Evaluation outputs organized in
data/processed/evaluation/.
Hyperparameter Search
- Uses random search around optimizer LR and dropout.
- Extend strategy in
scripts/hyperparameter_search.py.
Packaging
- Produces tarball with model weights, config snapshot, metrics.
Monitoring
- TensorBoard:
tensorboard --logdir logs/ - MLflow: Set
MLFLOW_TRACKING_URI(or configure via.env), pipeline logs params/metrics automatically when enabled. - W&B: Configure
WANDB_API_KEYin.envfor automatic logging.
Run Manifests
Each training run records a manifest with complete metadata for reproducibility and lineage tracking.Manifest Schema
Manifests are stored atruns/manifests/run_manifest.json or data/processed/{category}/manifests/ with the following structure:
Manifest Contents
- run_id: Unique identifier for the run (format:
{name}_{timestamp}). - config: Path to the configuration file used.
- run_dir: Directory containing run artifacts.
- metrics: Final evaluation metrics (accuracy, loss, etc.).
- checkpoint: Path to the final model checkpoint.
- created_at: ISO 8601 timestamp of run creation.
- version: Dataset or model version used.
- dataset: Dataset identifier and version.
- model: Model architecture identifier.
Using Manifests
Manifests enable:- Reproducibility: Recreate exact training conditions.
- Lineage Tracking: Trace model outputs back to source data.
- Cost Tracking: Link runs to compute costs.
- Leaderboard: Aggregate metrics across runs.
Troubleshooting
- Logs: Check logs under
logs/andartifacts/runs//. - GPU Visibility: Verify GPU visibility (
nvidia-smi). - Determinism: For deterministic issues, set
training.distributed.seed. - Storage: Verify storage paths are correctly mounted (see
POLICY.md). - Distributed Issues: Check node health before joining distributed runs.
Common Workflows
Full Training Pipeline
- Prepare data:
python -m nexa_data.prepare --config configs/data.yaml - Train model:
python -m nexa_train.train --config configs/baseline.yaml - Evaluate:
python -m nexa_eval.judge --checkpoint - Package:
python scripts/cli.py package --checkpoint
Distillation Pipeline
- Generate teacher inputs: Run
nexa_data/data_analysis/distill_data_overview.ipynb - Collect teacher outputs:
python -m nexa_distill.collect_teacher - Filter and package:
python -m nexa_distill.filter_pairs - Train student:
python -m nexa_train.distill