title: Policies slug: overview/policies description: Centralized storage, safety, and cost policies for NexaCompute workflows.
NexaCompute Policies
This document consolidates safety, storage, and cost policies for the NexaCompute platform.Storage Policy
Storage Hierarchy
NexaCompute uses a three-tier storage architecture for optimal performance and persistence.1. Ephemeral Storage (/workspace/tmp or /scratch)
Purpose: Fast, temporary working area for live training and data loading.
Location: On the GPU node’s local NVMe or ephemeral disk.
Lifecycle: Dies when the node is torn down — never assume persistence.
Contents:
- Always write intermediate checkpoints and temporary logs here.
- Sync only final artifacts to permanent storage.
- Use fast I/O here to maximize GPU throughput.
2. Permanent / Durable Storage (/mnt/nexa_durable)
Purpose: Long-term storage for model outputs, datasets, manifests, evals.
Location: Can be a mounted S3/Backblaze bucket, persistent volume, or local directory synced via rsync.
Lifecycle: Survives reboots and pod teardown.
Contents:
- Treat this as source of truth for all artifacts.
- Rsync this directory to local machine for archival:
- Never commit it to git — only track manifests and hashes.
3. Shared Storage (/workspace/shared or /mnt/shared)
Purpose: Collaboration and multi-node coordination.
Location: Optional — could be an NFS mount or a Tailscale-shared directory.
Lifecycle: Persistent but shared across multiple GPU nodes.
Contents:
- Use for shared datasets or coordination files between jobs.
- Don’t use for high-speed training I/O; this is for metadata and coordination.
Summary Table
| Type | Path | Persistence | Typical Contents | Notes |
|---|---|---|---|---|
| Ephemeral (Scratch) | /workspace/tmp | ❌ | Temp checkpoints, dataloader shards, logs | Fast local NVMe |
| Durable | /mnt/nexa_durable | ✅ | Checkpoints, datasets, evals, manifests | Synced back to Mac |
| Shared | /workspace/shared | ✅ | Common datasets, coordination files | Multi-node access |
| Local Control (Mac) | ~/nexa_compute/durable/ | ✅ | Mirrored durable storage | Local backup + git manifests |
Local Development Storage
For local development, use thedata/ directory structure:
DATA_ORGANIZATION.md in the project root for detailed organization.
Safety Policy
Model Safety
- Dataset Lineage: Enforce dataset lineage: manifests must include source provenance before promotion.
- Evaluation Rubrics: Maintain evaluation rubrics for bias, toxicity, and hallucination checks.
- Human Review: Require human-in-the-loop sign-off for any model destined for production surfaces.
Operational Safety
- Access Control: Access to infrastructure scripts gated via IAM & audited.
- Secrets Management: Secrets managed through environment variables and not committed to source control.
- Distributed Validation: Distributed launches validate node health before joining the training job to avoid partial failure.
Incident Response
- Automatic Halt: On evaluation failure or key metric regression, automatically halt deployment pipelines.
- Post-Mortem: Archive logs/checkpoints for post-mortem; run
nexa_feedbackworkflow to capture remediation steps.
Cost Policy
Cost Tracking
The cost tracker aggregates run-level compute spend captured per resource type. Inputs:compute_hours: GPU/CPU billing hours multiplied by instance hourly rate.storage_gb_month: Persistent volume usage normalised to monthly cost.egress_gb: Network data egress charges.
runs/manifests/cost_<run_id>.json) record:
summarize_costs aggregates totals for rapid reporting. For production usage integrate with billing APIs (AWS Cost Explorer, GCP Billing) and feed into this structure.
Cost Optimization
- Use ephemeral storage for temporary files to minimize persistent volume costs.
- Archive completed runs to cold storage after retention period.
- Monitor egress charges and prefer local storage when possible.