Skip to main content

NexaCompute Project Documentation

This directory contains comprehensive documentation for the NexaCompute platform.

Documentation Index

Core Documentation (4 Essential Docs)

  1. QUICK_START.md - Start here for essential getting started information
  2. ARCHITECTURE.md - Complete system architecture and design principles
  3. RUNBOOK.md - Operations guide with manifest schema and workflows
  4. POLICY.md - Consolidated policies (storage, safety, cost)

Reference Documentation

  1. DISTILLATION.md - Complete distillation guide (data formats, evaluation, compute planning, how to run)

Supporting Documentation

  • Nexa_compute_roadmap.md - Project roadmap and future plans

Documentation Structure

All documentation follows a consistent structure:
  • Purpose and scope clearly defined
  • Code examples and schemas where applicable
  • Cross-references to related documents
  • Version information for major changes
  • Getting Started: See runbook.md for operational procedures
  • Data Pipeline: See DATA_FORMAT.md and nexa_data/ module
  • Training: See nexa_train/ module and Nexa_distill.md
  • Evaluation: See EVAL_FRAMEWORK.md and eval-and-benchmarking.md
  • Infrastructure: See nexa_infra/ module

NexaCompute

ML Lab in a Box β€” A self-contained, production-grade machine learning research and development platform for rapid experimentation, model training, and knowledge distillation.

Overview

NexaCompute is a complete machine learning platform that packages everything you need to run sophisticated ML workflowsβ€”from data preparation to model deploymentβ€”in a single, reproducible system. Designed for researchers and practitioners who need to iterate quickly on ephemeral GPU infrastructure while maintaining rigorous reproducibility and cost awareness. Core Philosophy: Everything runs on disposable compute, with durable results and complete lineage tracking. Each experiment is fully reproducible, cost-tracked, and automatically documented.

What Makes It Different

  • Complete ML Pipeline: Data preparation, training, distillation, evaluation, and feedback loops in one platform
  • Infrastructure-Agnostic: Works seamlessly across GPU providers (Lambda Labs, CoreWeave, RunPod, AWS, etc.)
  • Reproducible by Design: Every run generates manifests with complete provenance
  • Cost-Aware: Built-in cost tracking and optimization
  • Production-Ready: Battle-tested infrastructure and operational best practices

Key Features

πŸ”¬ Knowledge Distillation Pipeline

Transform raw data into high-quality training datasets via teacher-student distillation:
  • Automated teacher completion collection
  • Quality filtering and human-in-the-loop inspection
  • SFT-ready dataset packaging
  • Complete workflow from prompts to trained models

πŸ“Š Data Management

  • Organized storage hierarchy (data/raw/ β†’ data/processed/)
  • Query interface for reliable data access
  • Dataset versioning and manifest tracking
  • Support for JSONL, Parquet, and compressed formats
  • Automated feedback loops for data improvement

πŸš€ Training & Evaluation

  • Distributed training with DDP support
  • HuggingFace integration
  • Automatic checkpointing and resume
  • Evaluation with LLM-as-judge and rubric-based scoring
  • Real-time monitoring and telemetry

πŸ“ˆ Visualization & Dashboards

  • Streamlit-based UI for data exploration
  • Evaluation leaderboards
  • Training statistics visualization
  • Distillation data inspection

πŸ”§ Infrastructure Orchestration

  • One-command cluster provisioning
  • Automated job launching and management
  • Cost tracking and reporting
  • Multi-provider support (Lambda, CoreWeave, AWS, etc.)

🧬 Lifecycle Coverage

  • Pre-training (roadmap) β€” large-scale corpus preparation and tokenizer support
  • Fine-tuning & SFT β€” supervised instruction tuning with project-scoped datasets
  • RL / RLHF (roadmap) β€” reward modelling and policy optimisation pipelines
  • Mid-training Telemetry β€” checkpointing, logging, and interactive dashboards
  • Post-training & Serving β€” evaluation, guardrails, and deployment controllers
  • Data Management β€” curated, versioned datasets with manifests and provenance

Architecture

NexaCompute is organized into six distinct modules, each serving a specific purpose in the ML pipeline:
nexa_compute/
β”œβ”€β”€ projects/       # Project-scoped assets (configs, docs, manifests, pipelines)
β”œβ”€β”€ nexa_data/       # Data preparation, analysis, and feedback
β”œβ”€β”€ nexa_distill/    # Knowledge distillation pipeline
β”œβ”€β”€ nexa_train/      # Model training and fine-tuning
β”œβ”€β”€ nexa_eval/       # Evaluation and benchmarking
β”œβ”€β”€ nexa_ui/         # Visualization and dashboards
└── nexa_infra/      # Infrastructure and orchestration
Each module is self-contained with clear boundaries, communicating via versioned data artifacts rather than direct imports. This design ensures maintainability, testability, and extensibility. See Architecture Documentation for complete details.

Project Organization

  • Project assets live under projects/{project_slug}/
  • Guardrails and conventions documented in docs/conventions/
  • Active projects catalogued in docs/projects/README.md

Quick Start

1. Configure API Keys
cp .env.example .env
# Edit .env with your API keys (OpenAI, HuggingFace, W&B, etc.)
2. Bootstrap GPU Node
# On your GPU cluster (Prime Intellect, Lambda, etc.)
export TAILSCALE_AUTH_KEY="your-key"  # Optional
export SSH_PUBLIC_KEY="ssh-ed25519 ..."  # Your SSH key

# Run bootstrap
bash nexa_infra/Boostrap.sh
3. Deploy Code
# From local machine
rsync -avz --exclude='.git' . user@gpu-node:/workspace/nexa_compute/
scp .env user@gpu-node:/workspace/nexa_compute/.env
4. Run Complete Pipeline
# SSH to node
ssh user@gpu-node
cd /workspace/nexa_compute

# Install dependencies
pip install -r requirements.txt

# Run training
python orchestrate.py launch --config nexa_train/configs/baseline.yaml
See SETUP.md for complete turn-key setup guide.

Local Development

# Clone repository
git clone 
cd Nexa_compute

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure API keys
cp .env.example .env
# Edit .env with your keys

Basic Workflow

1. Prepare Data
orchestrate.py prepare_data --config nexa_train/configs/baseline.yaml
2. Run Knowledge Distillation
# Generate teacher inputs
jupyter notebook nexa_data/data_analysis/distill_data_overview.ipynb

# Collect teacher completions
python -m nexa_distill.collect_teacher \
  --src data/processed/scientific_assistant/distillation/teacher_inputs/teacher_inputs_v1.parquet \
  --teacher openrouter:gpt-4o

# Filter and package
python -m nexa_distill.filter_pairs
python -m nexa_distill.to_sft
3. Train Model
orchestrate.py launch --config nexa_train/configs/baseline.yaml
4. Evaluate
orchestrate.py evaluate --checkpoint 
5. Visualize Results
orchestrate.py leaderboard  # Launch Streamlit dashboard
6. Serve Inference
orchestrate.py inference   # Start inference server

Complete Distillation Pipeline

For a complete example, see the Distillation Guide.

Core Modules

nexa_data/ β€” Data Pipeline

Data preparation, analysis, and automated feedback loops.
  • Data Analysis: Jupyter notebooks and query utilities (nexa_data/data_analysis/)
  • Feedback Loop: Improve data based on evaluation weaknesses (nexa_data/feedback/)
  • Data Loaders: PyTorch DataLoader integrations
  • Dataset Registry: Versioned dataset management

nexa_distill/ β€” Knowledge Distillation

Transform raw data into high-quality training datasets.
  • Teacher completion collection
  • Quality filtering and inspection
  • SFT dataset packaging
  • Human-in-the-loop review interface

nexa_train/ β€” Model Training

Training and fine-tuning with distributed support.
  • HuggingFace and custom training backends
  • Distributed training (DDP)
  • Automatic checkpointing
  • Hyperparameter sweeps
  • W&B and MLflow integration

nexa_eval/ β€” Evaluation

Comprehensive evaluation and benchmarking.
  • LLM-as-judge evaluation
  • Rubric-based scoring
  • Metric aggregation
  • Leaderboard generation

nexa_ui/ β€” Visualization

Streamlit dashboards for data and metrics.
  • Evaluation leaderboards
  • Distillation data visualization
  • Training statistics
  • Reads from organized data/processed/ structure

nexa_inference/ β€” Model Serving

Production-ready inference server for trained models.
  • FastAPI-based inference server
  • REST API for model predictions
  • Health checks and model info endpoints
  • Docker-ready deployment

nexa_infra/ β€” Infrastructure

Cluster provisioning, job management, and orchestration.
  • Multi-provider cluster provisioning
  • Automated job launching
  • Cost tracking
  • Code synchronization
  • One-command bootstrap script

Data Organization

All data follows a clean, organized structure:
data/
β”œβ”€β”€ raw/              # Raw input data (JSON, JSONL, Parquet)
└── processed/        # Organized outputs by purpose
    β”œβ”€β”€ distillation/ # Teacher inputs, outputs, filtered data, SFT datasets
    β”œβ”€β”€ training/     # Training splits and pretrain data
    β”œβ”€β”€ evaluation/   # Predictions, metrics, reports, feedback
    └── raw_summary/  # Analysis summaries
Query Interface:
from nexa_data.data_analysis.query_data import DataQuery

query = DataQuery()
teacher_df = query.get_teacher_inputs(version="v1")
pretrain_df = query.get_pretrain_dataset(shard="001")

Turn-Key Solution

NexaCompute is designed as a complete turn-key solution:
  • Bring Your Own Compute: Works with any GPU provider (Prime Intellect, Lambda Labs, CoreWeave, AWS, etc.)
  • One-Command Bootstrap: bash nexa_infra/Boostrap.sh sets up entire environment
  • API Key Management: Configure once via .env, use everywhere
  • Reproducible Docker: Consistent environments across all deployments
  • Complete Pipeline: Data β†’ Training β†’ Evaluation β†’ Inference
  • Production Ready: Inference server included for model deployment
See SETUP.md for complete turn-key setup guide.

Documentation

Comprehensive documentation is available in docs/Overview_of_Project/:

Requirements

  • Python: 3.11+
  • PyTorch: 2.1.0+
  • GPU: NVIDIA GPU with CUDA support (recommended)
  • Dependencies: See requirements.txt

Optional

  • Jupyter: For data analysis notebooks (pip install jupyter)
  • AWS CLI: For S3 storage syncing
  • Docker: For containerized deployment
  • Streamlit: Already included in requirements for UI dashboards
  • W&B Account: For experiment tracking (configure via API key)

Usage Examples

Complete Distillation Workflow

# 1. Generate teacher inputs from enhanced prompts
jupyter notebook nexa_data/data_analysis/distill_data_overview.ipynb

# 2. Collect teacher completions
python -m nexa_distill.collect_teacher \
  --src data/processed/scientific_assistant/distillation/teacher_inputs/teacher_inputs_v1.parquet \
  --teacher openrouter:gpt-4o \
  --max-samples 6000

# 3. Filter and package
python -m nexa_distill.filter_pairs
python -m nexa_distill.to_sft

# 4. Train student model
python -m nexa_train.distill \
  --dataset data/processed/scientific_assistant/distillation/sft_datasets/sft_scientific_v1.jsonl

# 5. Evaluate
orchestrate.py evaluate --checkpoint 

# 6. View results
orchestrate.py leaderboard

Infrastructure Provisioning

# Provision cluster (Prime Intellect, Lambda, etc.)
orchestrate.py provision --bootstrap

# Sync code to cluster
orchestrate.py sync user@gpu-node:/workspace/nexa_compute

# Launch training job
orchestrate.py launch --config nexa_train/configs/baseline.yaml

# Teardown cluster
orchestrate.py teardown

Model Inference

# Start inference server
orchestrate.py inference \
  --checkpoint data/processed/training/checkpoints/latest/final.pt \
  --port 8000

# Or via Docker
docker-compose -f docker/docker-compose.yaml --profile inference up

# Test inference
curl -X POST http://localhost:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Your input here", "max_tokens": 512}'

Data Analysis

from nexa_data.data_analysis.query_data import DataQuery

# Query processed datasets
query = DataQuery()

# Load teacher inputs
teacher_df = query.get_teacher_inputs(version="v1")

# List available datasets
datasets = query.list_available_datasets()

Project Structure

Nexa_compute/
β”œβ”€β”€ nexa_data/          # Data pipeline
β”œβ”€β”€ nexa_distill/       # Knowledge distillation
β”œβ”€β”€ nexa_train/         # Model training
β”œβ”€β”€ nexa_eval/          # Evaluation
β”œβ”€β”€ nexa_ui/            # Visualization
β”œβ”€β”€ nexa_infra/         # Infrastructure
β”œβ”€β”€ data/               # Data storage (raw + processed)
β”œβ”€β”€ docs/               # Documentation
β”œβ”€β”€ scripts/            # Utility scripts
β”œβ”€β”€ orchestrate.py      # Unified CLI
└── pyproject.toml      # Project configuration

Contributing

NexaCompute follows a modular architecture where each module is self-contained. To extend:
  1. Register new datasets: Add to nexa_data/manifest/dataset_registry.yaml
  2. Register new models: Use nexa_train/models/registry.py
  3. Add evaluation metrics: Extend nexa_eval/judge.py
  4. Custom training backends: Implement in nexa_train/backends/
See Architecture Documentation for extensibility patterns.

License

[Specify license]

Support

For questions, issues, or contributions:
NexaCompute β€” ML Lab in a Box. Everything you need to run sophisticated ML workflows, from data to deployment.