Inference Stack#
Overview#
Both helium (RTX 5000s) and lithium (M3 Ultra) run llama.cpp servers for distributed inference.
llama.cpp Server Configuration#
Helium Server (RTX 5000s)#
Installation:
# Clone and build
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j8
# Or use prebuilt
cargo install llama-cpp-serverServer Configuration#
server.yml:
host: 0.0.0.0
port: 8080
model: /models/qwen3.5-122b-a10b.gguf
n_ctx: 8192
n_batch: 512
n_ubatch: 512
n_threads: 12
n_gpu_layers: 50
flash_attn: true
cache_type_k: q8_0
cache_type_v: f16Run:
./server -c server.ymlAPI Endpoints#
Chat completion:
curl http://lithium.mrzk.io:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "user: Hello\nassistant:",
"n_predict": 512,
"temperature": 0.7,
"stop": ["user:", "</s>"]
}'Embeddings:
curl http://lithium.mrzk.io:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "Hello world"
}'Model Registry#
Model Storage#
Directory structure:
/models/
├── qwen3.5-122b-a10b.gguf # 122B model, main inference on lithium
├── gemma4-26b-a4b-q4_0.gguf # 19GB, fallback for helium
├── nomic-embed-text-v1.5.gguf # 274MB, embeddings
└── models.json # Metadata registrymodels.json:
{
"qwen3.5-122b-a10b": {
"path": "/models/qwen3.5-122b-a10b.gguf",
"size_gb": 80,
"quantization": "A10B",
"context": 16384,
"parameters": "122B",
"purpose": "main-inference"
},
"gemma4-26b-a4b-q4_0": {
"path": "/models/gemma4-26b-a4b-q4_0.gguf",
"size_gb": 19,
"quantization": "Q4_0",
"context": 4096,
"parameters": "26B",
"purpose": "fallback"
}
}Model Download Script#
download-model.sh:
#!/bin/bash
set -e
MODEL_NAME=$1
HUGGINGFACE_REPO=$2
echo "Downloading $MODEL_NAME from $HUGGINGFACE_REPO"
# Use huggingface-cli if available
if command -v huggingface-cli &> /dev/null; then
huggingface-cli download $HUGGINGFACE_REPO \
--include "*.gguf" \
--local-dir /models
else
# Fallback to wget
wget -P /models "https://huggingface.co/$HUGGINGFACE_REPO/resolve/main/$MODEL_NAME.gguf"
fi
# Verify checksum if provided
if [ -f "${MODEL_NAME}.sha256" ]; then
sha256sum -c "${MODEL_NAME}.sha256"
fi
echo "Download complete: /models/${MODEL_NAME}.gguf"Usage:
./download-model.sh qwen3.5-122b-a10b your-huggingface-repo/qwen3.5-122b-ggufJanky Configuration#
~/.config/janky/config.toml:
[llm]
base_url = "http://lithium.mrzk.io:8080"
model = "qwen3.5-122b-a10b"
temperature = 0.7
max_tokens = 4096
context_window = 8192
[discord]
enabled = true
channel_id = "1491598715805372416"
[skills]
path = "~/.janky/skills"
auto_load = true
[wiki]
path = "~/.janky/wiki"
auto_commit = trueLithium Server (M3 Ultra)#
Configuration:
host: 0.0.0.0
port: 8080
model: /models/gemma4-26b-a4b-q4_0.gguf
n_ctx: 8192
n_batch: 512
n_ubatch: 512
n_threads: 12
n_gpu_layers: 999 # Full offload to M3 Ultra unified memory
flash_attn: true
cache_type_k: q8_0
cache_type_v: f16Run:
./server -c server.ymlAccess: http://lithium.mrzk.io:8080
Performance Tuning#
Helium (RTX 5000s - 16GB VRAM each)#
GPU Offloading:
# Check VRAM
nvidia-smi --query-gpu=memory.total --format=csvAdjust n_gpu_layers:
VRAM 16GB → n_gpu_layers: 50-60
n_ctx: 8192
n_batch: 512
flash_attn: trueLithium (M3 Ultra - 96GB Unified Memory)#
GPU Offloading:
n_gpu_layers: 999 # Full offload (unified memory)
n_ctx: 16384
n_batch: 1024
flash_attn: trueFor large contexts (>8k) on helium, use the Q4_0 quantization to fit more in VRAM.