Homelab#
Documentation for the marzukia homelab infrastructure.
Architecture Diagram#
graph TB
subgraph "External Access"
Internet[Internet]
end
subgraph "Network Layer"
Hydrogen[hydrogen.mrzk.io<br/>Intel NUC<br/>Router/Traefik/DNS]
end
subgraph "Compute Layer"
Helium[helium.mrzk.io<br/>Ryzen 7 9700X<br/>2x RTX 5000 Turing<br/>64GB RAM]
Lithium[lithium.mrzk.io<br/>M3 Ultra<br/>96GB Unified Memory]
end
subgraph "GPU Resources"
GPU1[Quadro RTX 5000 x2<br/>16GB NVLink Turing]
GPU2[M3 Ultra<br/>96GB Unified Memory]
end
subgraph "Services"
LLM[LLM Server<br/>Qwen3.5-122B A10B]
Discord[Discord Bot]
Web[Web Interface]
end
Internet --> Hydrogen
Hydrogen --> Helium
Hydrogen --> Lithium
Helium --> GPU1
Lithium --> GPU2
GPU1 --> LLM
GPU2 --> LLM
LLM --> Discord
LLM --> Web
DevOps Infrastructure Stack#
- code repo: GitHub (external)
- ci/cd: Drone CI (helium)
- container mgmt: Portainer (helium)
- auth: Authentik (helium)
- monitoring: Prometheus + Grafana (helium)
- network: UniFi Controller (helium)
- storage: MinIO (helium)
- paas: Coolify (helium, optional)
- automation: n8n (helium, optional)
- uptime: Uptime Kuma (helium)
- reverse proxy: Traefik (hydrogen)
- dns: Pi-hole (hydrogen)
- security: Fail2Ban (hydrogen)
- inference: llama.cpp (helium + lithium)
Logging Strategy#
Stack: Loki + Promtail + Grafana
loki.mrzk.io: Log aggregation on helium (port 3100)
- Compressed storage, indexed by labels (not full-text)
- Stores logs from all Docker containers via Promtail
- Retention: 15 days default, configurable per-service
promtail: Log collector running on each host
- Scrape Docker container logs via
/var/run/docker.sock - Add labels:
service,container,host,env - Ship to Loki at
http://loki:3100/loki/api/v1/push
- Scrape Docker container logs via
grafana: Unified query interface
- Query logs with LogQL (similar to PromQL)
- Correlate logs with metrics on same dashboards
- Alert on log patterns (errors, warnings, spikes)
integration by service:
- github: Webhook events, action logs (via Drone)
- drone ci: Build logs, deployment output, job failures
- authentik: Security events, authentication failures, user activity
- minio: Access logs, bucket operations, error events
- coolify: Deployment logs, container events, build output
- postgres: Query logs, slow queries, connection events
- traefik: Access logs, routing errors, upstream failures
- prometheus: Scrape errors, rule evaluation failures
- portainer: Container events, deployment changes, user activity
- n8n: Workflow execution logs, webhook events, integration failures
- uptime kuma: Uptime checks, status changes, alert notifications
query examples:
# Filter by service
{service="drone"} |= "error"
# Count errors by container
sum by (container) (count_over_time({env="prod"}[5m]))
# Alert on authentication failures
{service="authentik"} |= "authentication failed"TODO - Documentation Gaps#
- Backup strategy: Automated backup schedule, retention policy, and testing procedures
- Monitoring gaps: Alerting rules, dashboard strategy, and notification channels
- Log aggregation: Loki + Promtail + Grafana stack documented with integration strategy
- Secret management: Vault/sealed-secrets implementation for API keys and credentials
- Disaster recovery: Documented RTO/RPO and recovery procedures
- Database backups: Postgres backup retention policy and recovery testing schedule