NexaCompute Project Documentation
This directory contains comprehensive documentation for the NexaCompute platform.Documentation Index
Core Documentation (4 Essential Docs)
QUICK_START.md- Start here for essential getting started informationARCHITECTURE.md- Complete system architecture and design principlesRUNBOOK.md- Operations guide with manifest schema and workflowsPOLICY.md- Consolidated policies (storage, safety, cost)
Reference Documentation
DISTILLATION.md- Complete distillation guide (data formats, evaluation, compute planning, how to run)
Supporting Documentation
Nexa_compute_roadmap.md- Project roadmap and future plans
Documentation Structure
All documentation follows a consistent structure:- Purpose and scope clearly defined
- Code examples and schemas where applicable
- Cross-references to related documents
- Version information for major changes
Quick Links
- Getting Started: See
runbook.mdfor operational procedures - Data Pipeline: See
DATA_FORMAT.mdandnexa_data/module - Training: See
nexa_train/module andNexa_distill.md - Evaluation: See
EVAL_FRAMEWORK.mdandeval-and-benchmarking.md - Infrastructure: See
nexa_infra/module
NexaCompute
ML Lab in a Box β A self-contained, production-grade machine learning research and development platform for rapid experimentation, model training, and knowledge distillation.Overview
NexaCompute is a complete machine learning platform that packages everything you need to run sophisticated ML workflowsβfrom data preparation to model deploymentβin a single, reproducible system. Designed for researchers and practitioners who need to iterate quickly on ephemeral GPU infrastructure while maintaining rigorous reproducibility and cost awareness. Core Philosophy: Everything runs on disposable compute, with durable results and complete lineage tracking. Each experiment is fully reproducible, cost-tracked, and automatically documented.What Makes It Different
- Complete ML Pipeline: Data preparation, training, distillation, evaluation, and feedback loops in one platform
- Infrastructure-Agnostic: Works seamlessly across GPU providers (Lambda Labs, CoreWeave, RunPod, AWS, etc.)
- Reproducible by Design: Every run generates manifests with complete provenance
- Cost-Aware: Built-in cost tracking and optimization
- Production-Ready: Battle-tested infrastructure and operational best practices
Key Features
π¬ Knowledge Distillation Pipeline
Transform raw data into high-quality training datasets via teacher-student distillation:- Automated teacher completion collection
- Quality filtering and human-in-the-loop inspection
- SFT-ready dataset packaging
- Complete workflow from prompts to trained models
π Data Management
- Organized storage hierarchy (
data/raw/βdata/processed/) - Query interface for reliable data access
- Dataset versioning and manifest tracking
- Support for JSONL, Parquet, and compressed formats
- Automated feedback loops for data improvement
π Training & Evaluation
- Distributed training with DDP support
- HuggingFace integration
- Automatic checkpointing and resume
- Evaluation with LLM-as-judge and rubric-based scoring
- Real-time monitoring and telemetry
π Visualization & Dashboards
- Streamlit-based UI for data exploration
- Evaluation leaderboards
- Training statistics visualization
- Distillation data inspection
π§ Infrastructure Orchestration
- One-command cluster provisioning
- Automated job launching and management
- Cost tracking and reporting
- Multi-provider support (Lambda, CoreWeave, AWS, etc.)
𧬠Lifecycle Coverage
- Pre-training (roadmap) β large-scale corpus preparation and tokenizer support
- Fine-tuning & SFT β supervised instruction tuning with project-scoped datasets
- RL / RLHF (roadmap) β reward modelling and policy optimisation pipelines
- Mid-training Telemetry β checkpointing, logging, and interactive dashboards
- Post-training & Serving β evaluation, guardrails, and deployment controllers
- Data Management β curated, versioned datasets with manifests and provenance
Architecture
NexaCompute is organized into six distinct modules, each serving a specific purpose in the ML pipeline:Project Organization
- Project assets live under
projects/{project_slug}/ - Guardrails and conventions documented in
docs/conventions/ - Active projects catalogued in
docs/projects/README.md
Quick Start
Turn-Key Setup (Recommended)
1. Configure API KeysLocal Development
Basic Workflow
1. Prepare DataComplete Distillation Pipeline
For a complete example, see the Distillation Guide.Core Modules
nexa_data/ β Data Pipeline
Data preparation, analysis, and automated feedback loops.
- Data Analysis: Jupyter notebooks and query utilities (
nexa_data/data_analysis/) - Feedback Loop: Improve data based on evaluation weaknesses (
nexa_data/feedback/) - Data Loaders: PyTorch DataLoader integrations
- Dataset Registry: Versioned dataset management
nexa_distill/ β Knowledge Distillation
Transform raw data into high-quality training datasets.
- Teacher completion collection
- Quality filtering and inspection
- SFT dataset packaging
- Human-in-the-loop review interface
nexa_train/ β Model Training
Training and fine-tuning with distributed support.
- HuggingFace and custom training backends
- Distributed training (DDP)
- Automatic checkpointing
- Hyperparameter sweeps
- W&B and MLflow integration
nexa_eval/ β Evaluation
Comprehensive evaluation and benchmarking.
- LLM-as-judge evaluation
- Rubric-based scoring
- Metric aggregation
- Leaderboard generation
nexa_ui/ β Visualization
Streamlit dashboards for data and metrics.
- Evaluation leaderboards
- Distillation data visualization
- Training statistics
- Reads from organized
data/processed/structure
nexa_inference/ β Model Serving
Production-ready inference server for trained models.
- FastAPI-based inference server
- REST API for model predictions
- Health checks and model info endpoints
- Docker-ready deployment
nexa_infra/ β Infrastructure
Cluster provisioning, job management, and orchestration.
- Multi-provider cluster provisioning
- Automated job launching
- Cost tracking
- Code synchronization
- One-command bootstrap script
Data Organization
All data follows a clean, organized structure:Turn-Key Solution
NexaCompute is designed as a complete turn-key solution:- Bring Your Own Compute: Works with any GPU provider (Prime Intellect, Lambda Labs, CoreWeave, AWS, etc.)
- One-Command Bootstrap:
bash nexa_infra/Boostrap.shsets up entire environment - API Key Management: Configure once via
.env, use everywhere - Reproducible Docker: Consistent environments across all deployments
- Complete Pipeline: Data β Training β Evaluation β Inference
- Production Ready: Inference server included for model deployment
Documentation
Comprehensive documentation is available indocs/Overview_of_Project/:
- Setup Guide β Complete turn-key setup instructions
- Quick Start β Get started quickly
- Architecture β System design and principles
- Runbook β Operations guide
- Policy β Storage, safety, and cost policies
- Distillation Guide β Complete distillation workflow
- Docker Guide β Docker deployment instructions
Requirements
- Python: 3.11+
- PyTorch: 2.1.0+
- GPU: NVIDIA GPU with CUDA support (recommended)
- Dependencies: See
requirements.txt
Optional
- Jupyter: For data analysis notebooks (
pip install jupyter) - AWS CLI: For S3 storage syncing
- Docker: For containerized deployment
- Streamlit: Already included in requirements for UI dashboards
- W&B Account: For experiment tracking (configure via API key)
Usage Examples
Complete Distillation Workflow
Infrastructure Provisioning
Model Inference
Data Analysis
Project Structure
Contributing
NexaCompute follows a modular architecture where each module is self-contained. To extend:- Register new datasets: Add to
nexa_data/manifest/dataset_registry.yaml - Register new models: Use
nexa_train/models/registry.py - Add evaluation metrics: Extend
nexa_eval/judge.py - Custom training backends: Implement in
nexa_train/backends/
License
[Specify license]Support
For questions, issues, or contributions:- Review Documentation
- Check Runbook for operational procedures
- See Architecture for design details
NexaCompute β ML Lab in a Box. Everything you need to run sophisticated ML workflows, from data to deployment.