Giter Club home page Giter Club logo

homelab's Introduction

Homelab

Hardware

The homelab is made up of three Beelink Mini SER5 Maxes. Each has 32 GB DDR4 of memory and 500 GB of NVMe storage. I also added 2 TB of SSD storage to each. So in total there's:

  • 24 cores / 48 threads
  • 96 GB memory
  • 1.5 TB NVMe storage and 6 TB SATA storage

The main reason I opt'ed the SER5 Max was they're cheap (~$350 each), have low power draw (advertised at 54 W per node), and they don't take up much space. That said, everything is slower in these compared to a consumer desktop: the 5800H is a laptop CPU, storage is over PCIe 3 not 4, and it's DDR4 memory. That's all fine though since this is to tinker.

Deployment

Terraform

I use Nomad job specs created in Terraform with the nomad_job resource to deploy applications into the homelab. Secrets are populated from 1Password.

# deploy job specs to Nomad, configure Cloudflare, etc
make tf
# tf/nomad.tf
resource "nomad_job" "jobs" {
  for_each = fileset(path.module, "../nomad/*")

  jobspec = file("${path.module}/${each.key}")
}

Consul

Consul is used as a service catalog -- to configure Traefik -- and to bootstrap Nomad.

ansible-consul

I deployed it using the ansible-consul role.

Compared to deploying Consul manually, ansible-consul has more knobs, sanity checks, and baked-in best practices. It's also nice for setting things up once, then automating away the tedious part of SSH'ing into every host. But ansible-consul is also very stale and the defaults are for development Consul clients. I've changed a couple dozen settings in ./ansible/tasks/consul.yaml to enable TLS, gossip encryption, ACLs, and prometheus metrics.

cd ./ansible/roles/ansible-consul/files
ansible-galaxy install -r ../requirements.yml
consul tls ca create
consul tls cert create -server -dc dc1 -domain consul
cd -

ansible-playbook -i hosts.yaml ./tasks/consul.yaml

Nomad

Nomad orchestrates and deploys the rest of the services (besides itself and Consul). When I was deploying it I first wrote an Ansible playbook. But then I found ansible-nomad: https://github.com/ansible-community/ansible-nomad.

ansible-playbook -i hosts.yaml ./tasks/nomad.yaml
PITA bug from Beelink Product ID

While setting up Nomad I hit an annoying bug where 2 of 3 clients were unable to register themselves:

client.rpc: error performing RPC to server: error="rpc error: rpc error: node secret ID does not match. Not registering node." rpc=Node.Register server=192.168.0.138:46>
client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: node secret ID does not match. Not registering node." >

I spent 3-4 hours re-reading the same bug reports (1, 2, 3) and purging + reinstalling Nomad. I got lucky and eventually noticed that the client ids (not secret ids; which I'd checked) were identical between clients:

# host 1
client: started client: node_id=9466df29-582e-42cf-84dc-296acd1886c9
# host 2
client: started client: node_id=9466df29-582e-42cf-84dc-296acd1886c9

I worked around the issue by setting random client IDs via ansible (below) and they all came up fine:

- name: Generate lowercase UUID using Python
  command: python3 -c "import uuid; print(str(uuid.uuid4()).lower())"
  register: generated_uuid

- name: Save the UUID to /var/nomad/client/client-id
  copy:
    content: "{{ generated_uuid.stdout }}"
    dest: /var/nomad/client/client-id
  when: not client_id_file.stat.exists

Curious why I was getting duplicates, I found that Nomad uses a host-level UUID retrieved from /sys/class/dmi/id/product_uuid which is frequently -- and in my case -- bogus. My UUIDs across all machines was whatever hard-coded UUID Beelink had set, so all my node were starting with identical node IDs:

# host 1
root@ser5-1:/home/josh# cat /sys/class/dmi/id/product_uuid
03000200-0400-0500-0006-000700080009
# host 2
root@ser5-2:/home/josh# cat /sys/class/dmi/id/product_uuid
03000200-0400-0500-0006-000700080009

This is still weird though because Nomad defaults to generating a unique client-id randomly on start-up when no_host_uuid is false (which is true by default and v1.6.1 which is what I was running):

	hostInfo, err := host.Info()
	if !conf.NoHostUUID && err == nil {
		if hashed, ok := helper.HashUUID(hostInfo.HostID); ok {
			hostID = hashed
		}
	}

Unfortunately for me, ansible-nomad is stale and has nomad_no_host_uuid: no as a default. I thought I worked around that by setting it to yes during one of the reinstalls, but I didn't delete the stale /var/nomad/client/client-id file so they stuck around.

Access

There's two types of accessibility for the homelab:

  1. HTTP API access to services and Nomad
  2. SSH access to the homelab nodes

I used Cloudflare Tunnels for address both. I create one egress configuration rule pointing at Traefik (for HTTP access to hosted APIs) and another pointing at the SSH server for remote access to Nomad, Consul, etc. I use sshuttle to get remote access to the result of services in the homelab.

Traefik

Traefik routes to the services using Consul's catalog for service discovery.

I haven't spent the time to wire all the services into Traefik (yet). It's tricky with services like Nomad because you can't configure the path the UI queries: (eg Github thread for Nomad). So even if I put Nomad behind /nomad in Traefik the UI will blissfully query the (default) /ui path that 404s.

I also haven't spent the time moving everything into a service mesh. I should -- it would be nice to offload mTLS and intentions to Consul -- but I ran into a issue trying to join all the MinIO instances into a cluster (see: MINIO_VOLUMES). In theory I could use a Nomad template that interpolates Consul minio service instances but found it creates a chicken or the egg issue where MinIO won't start without MINIO_VOLUMES so then there's nothing to populate the minio service in Consul and fill the template. I've since found this {{ service "web|any" }} filter that should return all instances, health or otherwise, but I began to prefer to using static = $port and host networking since it's faster.

Traefik configuration

I create one allocation of Traefik on each node using a system type job. And create a read-only Consul token that I pass to Traefik through a Nomad Variable:

// tf/nomad.tf
resource "nomad_variable" "traefik" {
  path = "nomad/jobs/traefik"
  items = {
    read_token = data.consul_acl_token_secret_id.traefik.secret_id
  }
}

// jobs/traefik.hcl
job "traefik" {
    datacenters = ["dc1"]
    type        = "system

    group "traefik" {
...

      config {
        image        = "traefik:2.10"
        network_mode = "host"
        ports        = ["http", "admin"]
        volumes      = ["local/traefik.yaml:/etc/traefik/traefik.yaml"]
      }

      template {
        # https://developer.hashicorp.com/nomad/tutorials/load-balancing/load-balancing-traefik
        destination = "local/traefik.yaml"
        data        = <<EOF
...
providers:
  consulCatalog:
    exposedByDefault: false
    connectAware: true
    cache: false
    connectByDefault: false

    endpoint:
      address: {{ env "NOMAD_IP_http" }}:8500
      scheme: http
      token: {{ with nomadVar "nomad/jobs/traefik" }}{{ .read_token }}{{ end }}

Cloudflare Tunnel

I want to dispatch Nomad jobs into the homelab from Github Actions. I thought about setting up a daemon that pulls configs from an S3 bucket that I push to Github Actions, to avoid exposing my home network to the internet, but opted instead for a Cloudflare Tunnel (throwing security concerns in the garbage).

curl -H "CF-Access-Client-Id: xx" -H "CF-Access-Client-Secret: xx" -v https://nomad.homelab.com

I didn't choose Tailscale Funnels or Ngrok because Cloudflare has some extra nice-to-have features like domain management, access policies, IdP-support, etc.

Cloudflare configuration

Each node runs cloudflared for a cloudflare_tunnel created with Terraform. The cloudflared process points at the Nomad endpoint on each host:

// creating a tunnel with a secret
resource "cloudflare_tunnel" "auto_tunnel" {
  name       = "homelab"
  secret     = random_id.tunnel_secret.b64_std
}

// configuring egress to Nomad
resource "cloudflare_tunnel_config" "auto_tunnel" {
  tunnel_id  = cloudflare_tunnel.auto_tunnel.id
  account_id = var.cloudflare_account_id

  config {
    ingress_rule {
      service = "http://localhost:4646"
    }
  }
}

// skipping stuff

// making an access policy associated w/ the domain that uses a cloudflare service token
resource "cloudflare_access_policy" "nomad_token" {
  application_id = cloudflare_access_application.nomad.id
  zone_id        = var.cloudflare_zone_id
  precedence     = "1"
  decision       = "non_identity"

  include {
    service_token = [cloudflare_access_service_token.token.id]
  }
}

// make the service token for machine <> machine calls
resource "cloudflare_access_service_token" "token" {
  zone_id = var.cloudflare_zone_id
  name    = "homelab-token"
}

I can then use the access service token (cloudflare_access_service_token.token above) to access Nomad remotely:

# this calls the nomad api
curl -H "CF-Access-Client-Id: xx" -H "CF-Access-Client-Secret: xx" -v https://nomad.homelab.com

sshuttle

I use a combination and sshuttle and cloudflared to access services' UIs when not at home like sshuttle -NHr homelab 0/0.

sshuttle configuration

I do it with an ingress rule for cloudflared using Github as an IdP:

  config {
    // ingress rule pointing at the ssh server
    ingress_rule {
      hostname = "ssh.homelab.com"
      service  = "ssh://localhost:22"
    }

    ingress_rule {
      service = "http://localhost:4646"
    }

A configuration for my client ssh to use cloudflared as a proxy:

Host homelab
    ProxyCommand cloudflared access ssh --hostname ssh.homelab.com
    StrictHostKeyChecking no
    User josh

And can then remotely access any service in the homelab with sshuttle -NHr homelab 0/0.

Storage

MinIO

MinIO is a distributed storage engine with an S3-compatible interface. It means I can write data from one node into MinIO and access it from applications running on other nodes, and have some redundancy during deployments or restarts of any one of the nodes.

MinIO configuration

I tried hard to make Ceph work. I wanted both a distributed filesystem plus an S3-compatible API. I tried cephadm, ceph-ansible, hoping from node to node running systemctl restart ceph-mon... I hit countless bugs and lost two days of my life and learned nothing except that Ceph is a beast. It felt like joining a backend team trying to set up the E2E test environment using a wiki two years out of date.

Installing MinIO was so much simpler that I became even more pissed at Ceph. In 10 minutes I copied their installation guide into an ansible playbook and had it running across all the hosts. I then switched to running it in Nomad after creating and mounting the volume:

# ansible-nomad var
nomad_host_volumes:
  - name: sata
    path: /mnt/sata
    owner: minio-user
    group: minio-user
    mode: "0755"
    read_only: false
// tf/jobs/minio.hcl

job "minio" {
  datacenters = ["dc1"]
  type        = "system"

  group "minio" {
    volume "minio" {
      type      = "host"
      source    = "sata"
      read_only = false
    }
...
    task "minio" {
      driver = "docker"

      config {
        image        = "minio/minio:RELEASE.2023-08-16T20-17-30Z.hotfix.60799aeb0"
        network_mode = "host"
        ports        = ["minio-api", "minio-console"]
        args = ["server",
          "--address", ":${NOMAD_PORT_minio_api}",
          "--console-address", ":${NOMAD_PORT_minio_console}",
        ]
      }

      volume_mount {
        volume           = "minio"
        destination      = "/mnt/sata"
        propagation_mode = "private"
      }

      template {
        destination = ".env"
        env         = true
        data        = <<EOF
MINIO_VOLUMES = '{{ range service "consul" }}http://{{ .Address }}:9000/mnt/sata {{ end }}'
EOF

homelab's People

Contributors

jjti avatar

Stargazers

Steve Man avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.