Homelab#

Documentation for the marzukia homelab infrastructure.

Architecture Diagram#


graph TB
    subgraph "External Access"
        Internet[Internet]
    end
    
    subgraph "Network Layer"
        Hydrogen[hydrogen.mrzk.io<br/>Intel NUC<br/>Router/Traefik/DNS]
    end
    
    subgraph "Compute Layer"
        Helium[helium.mrzk.io<br/>Ryzen 7 9700X<br/>2x RTX 5000 Turing<br/>64GB RAM]
        Lithium[lithium.mrzk.io<br/>M3 Ultra<br/>96GB Unified Memory]
    end
    
    subgraph "GPU Resources"
        GPU1[Quadro RTX 5000 x2<br/>16GB NVLink Turing]
        GPU2[M3 Ultra<br/>96GB Unified Memory]
    end
    subgraph "Services"
        LLM[LLM Server<br/>Qwen3.5-122B A10B]
        Discord[Discord Bot]
        Web[Web Interface]
    end
    
    Internet --> Hydrogen
    Hydrogen --> Helium
    Hydrogen --> Lithium
    Helium --> GPU1
    Lithium --> GPU2
    GPU1 --> LLM
    GPU2 --> LLM
    LLM --> Discord
    LLM --> Web

DevOps Infrastructure Stack#

  • code repo: GitHub (external)
  • ci/cd: Drone CI (helium)
  • container mgmt: Portainer (helium)
  • auth: Authentik (helium)
  • monitoring: Prometheus + Grafana (helium)
  • network: UniFi Controller (helium)
  • storage: MinIO (helium)
  • paas: Coolify (helium, optional)
  • automation: n8n (helium, optional)
  • uptime: Uptime Kuma (helium)
  • reverse proxy: Traefik (hydrogen)
  • dns: Pi-hole (hydrogen)
  • security: Fail2Ban (hydrogen)
  • inference: llama.cpp (helium + lithium)

Logging Strategy#

Stack: Loki + Promtail + Grafana

  • loki.mrzk.io: Log aggregation on helium (port 3100)

    • Compressed storage, indexed by labels (not full-text)
    • Stores logs from all Docker containers via Promtail
    • Retention: 15 days default, configurable per-service
  • promtail: Log collector running on each host

    • Scrape Docker container logs via /var/run/docker.sock
    • Add labels: service, container, host, env
    • Ship to Loki at http://loki:3100/loki/api/v1/push
  • grafana: Unified query interface

    • Query logs with LogQL (similar to PromQL)
    • Correlate logs with metrics on same dashboards
    • Alert on log patterns (errors, warnings, spikes)

integration by service:

  • github: Webhook events, action logs (via Drone)
  • drone ci: Build logs, deployment output, job failures
  • authentik: Security events, authentication failures, user activity
  • minio: Access logs, bucket operations, error events
  • coolify: Deployment logs, container events, build output
  • postgres: Query logs, slow queries, connection events
  • traefik: Access logs, routing errors, upstream failures
  • prometheus: Scrape errors, rule evaluation failures
  • portainer: Container events, deployment changes, user activity
  • n8n: Workflow execution logs, webhook events, integration failures
  • uptime kuma: Uptime checks, status changes, alert notifications

query examples:

# Filter by service
{service="drone"} |= "error"

# Count errors by container
sum by (container) (count_over_time({env="prod"}[5m]))

# Alert on authentication failures
{service="authentik"} |= "authentication failed"

TODO - Documentation Gaps#

  • Backup strategy: Automated backup schedule, retention policy, and testing procedures
  • Monitoring gaps: Alerting rules, dashboard strategy, and notification channels
  • Log aggregation: Loki + Promtail + Grafana stack documented with integration strategy
  • Secret management: Vault/sealed-secrets implementation for API keys and credentials
  • Disaster recovery: Documented RTO/RPO and recovery procedures
  • Database backups: Postgres backup retention policy and recovery testing schedule