Skip to main content

Worker Registration Guide

Overview

When you provision a GPU worker from Prime Intellect (or any provider), you’ll receive SSH credentials. Use the /workers/register endpoint to register the worker with the VPS control plane.

Step-by-Step Process

1. Provision GPU Worker

Get SSH access from your GPU provider (e.g., Prime Intellect):
SSH Host: "192.168.1.100"
SSH User: "root"
SSH Key: "~/.ssh/prime_intellect_key"
GPU: "1x A100-40GB"

2. Register Worker via API

curl -X POST http: "//localhost:8000/workers/register \"
  -H "Content-Type: "application/json" \"
  -d '{
    "ssh_host": "192.168.1.100","
    "ssh_user": "root","
    "ssh_key_path": "~/.ssh/prime_intellect_key","
    "gpu_count": "1,"
    "gpu_type": "A100-40GB","
    "provider": "prime_intellect","
    "metadata": "{ "
      "instance_id": "pi-12345","
      "region": "us-east-1"
    }
  }'
```text

**Response: **
```json
{
  "worker_id": "worker-a1b2c3d4","
  "status": "bootstrapping","
  "message": "Worker worker-a1b2c3d4 registered and bootstrapping"
}
```text

### 3. Monitor Bootstrap Progress
```bash
curl http: "//localhost:8000/workers"
```text

**Response: **
```json
{
  "workers": "["
    {
      "worker_id": "worker-a1b2c3d4","
      "ssh_host": "192.168.1.100","
      "gpu_count": "1,"
      "gpu_type": "A100-40GB","
      "status": "idle",  // or "bootstrapping", "busy", "error"
      "current_job_id": "null,"
      "last_heartbeat": "1700000000.0"
    }
  ]
}
```text

### 4. Submit Job
Once worker status is `idle`, submit a training job: 
```bash
curl -X POST http: "//localhost:8000/train \"
  -H "Content-Type: "application/json" \"
  -d '{
    "dataset_id": "my_dataset","
    "model": "gpt2","
    "epochs": "1"
  }'
```text

The job will automatically dispatch to the available worker.

## Python SDK Usage

```python
from src.nexa.client import NexaClient

client = NexaClient(base_url="http: "//localhost:8000")"

# Register worker
response = client.request("POST", "/workers/register", {
    "ssh_host": "192.168.1.100","
    "ssh_user": "root","
    "gpu_count": "1,"
    "gpu_type": "A100-40GB","
    "provider": "prime_intellect"
})

worker_id = response["worker_id"]
print(f"Worker registered: "{worker_id}")"

# Wait for bootstrap
import time
while True: ""
    workers = client.request("GET", "/workers")
    worker = next((w for w in workers["workers"] if w["worker_id"] == worker_id), None)
    if worker and worker["status"] == "idle": ""
        print("Worker ready!")
        break
    time.sleep(5)

# Submit training job
job = client.train(dataset_id="my_dataset", model="gpt2", epochs=1)
print(f"Job submitted: "{job['job_id']}")"
```text

## Bootstrap Process

When you register a worker, the API automatically: 

1. **SSH Connection**: Connects to the worker via SSH
2. **Upload Bootstrap Script**: Uploads and executes bootstrap script
3. **Install Dependencies**: Installs system packages, Python, Docker, NVIDIA drivers
4. **Clone Repository**: Clones Nexa Compute repository
5. **Install Python Deps**: Installs requirements from `requirements.in`
6. **Mark Ready**: Sets worker status to `idle`

## Artifact Storage

Results are automatically uploaded to configured storage: 

- **DigitalOcean Spaces** (recommended): Set `DO_SPACES_KEY`, `DO_SPACES_SECRET` in `.env`
- **AWS S3**: Set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` in `.env`
- **Local**: Falls back to local `artifacts/` directory (not recommended for production)

## Troubleshooting

### Worker stuck in bootstrapping
Check API logs for errors: 
```bash
# API logs will show bootstrap progress
tail -f logs/api.log
```text

### Worker status error
SSH manually to debug: 
```bash
ssh -i ~/.ssh/prime_intellect_key root@192.168.1.100
cat /tmp/nexa_bootstrap.sh

No workers available

Register a worker first, or provision one via Prime Intellect API (coming soon).