Talos is the OS that I use to run my Kubernetes cluster on. Its a very minimal distribution dedicated solely to provide the host layer to run Kubernetes. Its only interactable remotely using the talosctl CLI and is configured with yaml files that will feel familiar to typical Kubernetes templates. This makes it a very low maintenance and reliable OS for everything from small homelabs such as mine to very large clusters in the cloud.

Inevitably it still needs to be updated, but is fairly easy to handle. This guide is my standard upgrade process for Talos. Relatively simple, just verify, run commands, and verify again. I'll add a section at the end for Kubernetes upgrades, which I do after each Talos upgrade, but this is entirely automated by Talos and its just a simple command.

Health Check

First to ensure the cluster and its dependents are healthy. Most important are state and data that is also clustered. Since I have only 3 workers I do not have spare capacity and will face some slight interruption as each node is rebooting. And I really do not want any of the other 2 nodes to be having any kind of issue while that is occuring.

Etcd

Check status of etcd, ensure there is a leader and there are no errors.

talosctl -n <ip address of control nodes> etcd status

Rook-Ceph

Either browse to the dashboard if that is enabled, or run the following commands on the tools container

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items\[\*].metadata.name}') -- bash

Inside the rook-ceph-tools container check the status:

ceph status

Should be no errors, warnings, issues, and fully healthy. Especially no OSDs or MONs down.

Cloudnative-PG

Check the status of the Cloudnative-PG clusters to ensure they are all healthy. There is potential data loss if a worker node has a failure or the local volume isn't reattached. In my lab I have a Grafana dashboard as well as Prometheus alerts to inform me of this status.

Here is a link to the Github repo for CNPG's Grafana dashboard.

I follow the Prometheus rules in their cluster Helm Chart for alerts.

Garage

Check the status of the Garage cluster to ensure there is no data loss of the local S3 store. This will result in data loss of short term WALs if this cluster fails.

In my case I use Garage Web UI as a simple monitoring dashboard.

Upgrade

I reference my config repo for the exact commands, links to the factory page, and where I update the image versions. However I keep this private as to ensure I don't leak secrets.

I'm going to consider how to exactly setup configuration patches outside this guide. I reference Talos Docs on reproducible machine configuration on how I set mine up.

The rough outline of what I have can be seen with the following command. I have a common patch, patch for either NUC or RPi, and a patch for control or worker.

talosctl gen config cl01tl <cluster-name> <cluster-endpoint> \
    --with-secrets default-config/secrets.yaml \
    --kubernetes-version <kubernetes-version> \
    --talos-version <talos-version-contract> \
    --config-patch patches/common.yaml \
    --config-patch patches/common-rpi.yaml \
    --config-patch patches/controlplane.yaml \
    --output-types controlplane \
    --output generated/controlplane-rpi.yaml

Once I have the config generated, I usually will upgrade the node first. The following is an example to upgrade a RPi node.

talosctl upgrade --nodes <ip address of RPi> --image factory.talos.dev/metal-installer/495176274ce8f9e87ed052dbc285c67b2a0ed7c5a6212f5c4d086e1a9a1cf614:v1.12.0

However there are times when there might be a breaking or significant change that I will apply the config first. This will be because I need the node to boot on upgrade with that configuration. This was done when I upgraded to 1.12.0 since I used the newer disk management config document that replaced management inside the machine document.

Apply new configuration

Here is an example to apply that generated config to a RPi node:

talosctl apply-config -f generated/controlplace-rpi.yaml -n <ip address of RPi>

Verification

Verify all is healthy on the Talos node dashboard:

talosctl -n 10.232.1.23 dashboard

Once this node becomes healthy, loop back to the start and verify everything is healthy again before proceeding to the next node.

Kubernetes

Talos has an automated process of handling these upgrades, as explained in the docs. This automated process is also why config should be regenerated, as explained in this section, to ensure there is no drift in static files.

talosctl --nodes <controlplane node> upgrade-k8s --to ${k8s_release}

I haven't encountered an issue with this upgrade process. Though I do hold off until major components, such as Cilium, Ceph, and others published support for the latest major release. This means I usually do major upgrades annually.

Future Automation

I have thought about, and its on my personal Kanban board, to create some kind of CLI script to run healthchecks instead of running around to my various dashboards. That way I can do a quick bundle of verify, gen config, upgrade, apply, verify, all at once. At scale, perhaps with hundreds or thousands of nodes, this would be demanded. Though another option would be ephemeral worker nodes or PXE booting.

My current process isn't a terrible amount of work, but if I find the time it would be useful exercise. As it is now its robust and reliable enough for my needs.

Talos Upgrades