Inference Stack#

Overview#

Both helium (RTX 5000s) and lithium (M3 Ultra) run llama.cpp servers for distributed inference.

llama.cpp Server Configuration#

Helium Server (RTX 5000s)#

Installation:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j8

# Or use prebuilt
cargo install llama-cpp-server

Server Configuration#

server.yml:

host: 0.0.0.0
port: 8080
model: /models/qwen3.5-122b-a10b.gguf
n_ctx: 8192
n_batch: 512
n_ubatch: 512
n_threads: 12
n_gpu_layers: 50
flash_attn: true
cache_type_k: q8_0
cache_type_v: f16

Run:

./server -c server.yml

API Endpoints#

Chat completion:

curl http://lithium.mrzk.io:8080/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "user: Hello\nassistant:",
    "n_predict": 512,
    "temperature": 0.7,
    "stop": ["user:", "</s>"]
  }'

Embeddings:

curl http://lithium.mrzk.io:8080/embedding \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Hello world"
  }'

Model Registry#

Model Storage#

Directory structure:

/models/
├── qwen3.5-122b-a10b.gguf    # 122B model, main inference on lithium
├── gemma4-26b-a4b-q4_0.gguf   # 19GB, fallback for helium
├── nomic-embed-text-v1.5.gguf  # 274MB, embeddings
└── models.json                 # Metadata registry

models.json:

{
  "qwen3.5-122b-a10b": {
    "path": "/models/qwen3.5-122b-a10b.gguf",
    "size_gb": 80,
    "quantization": "A10B",
    "context": 16384,
    "parameters": "122B",
    "purpose": "main-inference"
  },
  "gemma4-26b-a4b-q4_0": {
    "path": "/models/gemma4-26b-a4b-q4_0.gguf",
    "size_gb": 19,
    "quantization": "Q4_0",
    "context": 4096,
    "parameters": "26B",
    "purpose": "fallback"
  }
}

Model Download Script#

download-model.sh:

#!/bin/bash
set -e

MODEL_NAME=$1
HUGGINGFACE_REPO=$2

echo "Downloading $MODEL_NAME from $HUGGINGFACE_REPO"

# Use huggingface-cli if available
if command -v huggingface-cli &> /dev/null; then
  huggingface-cli download $HUGGINGFACE_REPO \
    --include "*.gguf" \
    --local-dir /models
else
  # Fallback to wget
  wget -P /models "https://huggingface.co/$HUGGINGFACE_REPO/resolve/main/$MODEL_NAME.gguf"
fi

# Verify checksum if provided
if [ -f "${MODEL_NAME}.sha256" ]; then
  sha256sum -c "${MODEL_NAME}.sha256"
fi

echo "Download complete: /models/${MODEL_NAME}.gguf"

Usage:

./download-model.sh qwen3.5-122b-a10b your-huggingface-repo/qwen3.5-122b-gguf

Janky Configuration#

~/.config/janky/config.toml:

[llm]
base_url = "http://lithium.mrzk.io:8080"
model = "qwen3.5-122b-a10b"
temperature = 0.7
max_tokens = 4096
context_window = 8192

[discord]
enabled = true
channel_id = "1491598715805372416"

[skills]
path = "~/.janky/skills"
auto_load = true

[wiki]
path = "~/.janky/wiki"
auto_commit = true

Lithium Server (M3 Ultra)#

Configuration:

host: 0.0.0.0
port: 8080
model: /models/gemma4-26b-a4b-q4_0.gguf
n_ctx: 8192
n_batch: 512
n_ubatch: 512
n_threads: 12
n_gpu_layers: 999  # Full offload to M3 Ultra unified memory
flash_attn: true
cache_type_k: q8_0
cache_type_v: f16

Run:

./server -c server.yml

Access: http://lithium.mrzk.io:8080

Performance Tuning#

Helium (RTX 5000s - 16GB VRAM each)#

GPU Offloading:

# Check VRAM
nvidia-smi --query-gpu=memory.total --format=csv

Adjust n_gpu_layers:

VRAM 16GB → n_gpu_layers: 50-60
n_ctx: 8192
n_batch: 512
flash_attn: true

Lithium (M3 Ultra - 96GB Unified Memory)#

GPU Offloading:

n_gpu_layers: 999  # Full offload (unified memory)
n_ctx: 16384
n_batch: 1024
flash_attn: true

For large contexts (>8k) on helium, use the Q4_0 quantization to fit more in VRAM.