Installing the nodes with Ceph and K8S

Network 🌐

In this post, I’ll walk through how I install and prepare the host nodes. Each host is set up with both Ceph and Kubernetes (K8s). The full configuration code on what I base this blog post about can be found at https://codeberg.org/mpiscaer/myOpenstackCluster with GIT tag 0.1.0

As described in my previous blog post, šŸ“” My First Network Connectivity, the network is fully routed.

In most cases environments uses LACP to create a redundant network connection and where multiple hosts are on the same broadcast network. I deliberately chose not to use LACP on the interfaces and instead rely on OSPF and BGP to provide redundancy. In the end I use this setup to evaluate if this a usable setup.

The advantage of using BGP and OSPF over LACP is beter integration of the network and hosts, traffic can flow directly and uses the shortest path. Using a layer 3 only setup eliminates the need for Spanning tree and so all links can stay active.

I have three nodes, all three nodes will and up with the function:


IPv6 Addressing šŸŒ

The following IPv6 ranges are used in the current setup, in this blogpost I will use the 3fff:0:0 prefix. In Git this got the secret variable IPV6_PREFIX and uses the prefix that got assigned by my internet provider.

IPv6 address range Descrition
3fff::/20 Documentation
fd40:10::/64 Kubernetes Pod addresses
fd40:10:100::/112 Kubernetes Service IPs
fc00:0:cef::/48 Ceph network
fc00:0:8:: Kubernetes VIP address

IPv6 Issues Encountered āš ļø

During the installation of Ceph and Kubernetes, I encountered several IPv6-related issues:

  1. GitHub accessibility
    GitHub was not reachable over IPv6 in the target environment, which prevented direct access to repositories. As a workaround, a proxy had to be configured to allow installations and playbook execution to proceed.

  2. quay.io reliability
    Access to quay.io over IPv6 was unreliable, causing intermittent failures when pulling container images. To mitigate this, traffic to quay.io was routed through a proxy, so that it was able to failover to the IPv4 endpoint.

  3. Ceph Ansible playbook
    An IPv6-specific issue was discovered in the Ceph Ansible collection. The problem was analyzed and fixed, and the solution was contributed upstream in the following pull request: https://github.com/vexxhost/ansible-collection-ceph/pull/80

  4. Kubernetes Ansible playbook
    Several IPv6-related configuration issues were identified in the Kubernetes Ansible collection and documented here: https://github.com/vexxhost/ansible-collection-kubernetes/pull/221

    • IPv6 was not enabled in Cilium and kubeadm
    • IPv6 forwarding was not enabled on the hosts
  5. Somehow because the interface itself does not have any address systemd-resolved has a issue in dns-resolving. I ended up in disableing systemd-resolved and let it directly use the DNS resolver. This means that no caching happends on the local node.

Atmosphere

Atmosphere is a set of Ansible collections and Helm charts to install Openstack on top of Kubernetes. To quote the Atmosphere documentation¹:


Atmosphere is an advanced OpenStack distribution that is powered by open source technologies & built by VEXXHOST powered by Kubernetes, which allows you to easily deliver virtual machines, Kubernetes and bare-metal on your on-premise hardware.

The difference between Atmosphere and other deployment tools is that it is fully open source with batteries included. It ships with settings that are curated by years of experience from our team alongside other features such as:


Ceph install

Let's start with installing Ceph. The first step is testing the SD card, followed by the actual Ceph installation.

Once OSPF-based network connectivity was in place, I moved on to setting up storage.

For storage, I use Ceph, a distributed storage system. Each Ceph node is configured with a single OSD, backed by an SD card. Before deploying Ceph, I ran some basic performance tests to understand the characteristics and limitations of the underlying storage².:

Throughput Test šŸš€

root@node1:~# sudo fio --filename=/dev/mmcblk1 --direct=1 --rw=write --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1
throughput-test-job: (groupid=0, jobs=4): err= 0: pid=1674: Tue Dec  9 13:31:13 2025
  write: IOPS=257, BW=64.4MiB/s (67.6MB/s)(7798MiB/121026msec)
...

The results show a sustained write throughput of roughly 64 MiB/s, which is acceptable for this setup, given the constraints of SD-card-based storage.

Latency Test ā±ļø


rwlatency-test-job: (groupid=0, jobs=1): err= 0: pid=1715: Tue Dec  9 13:35:54 2025
  read: IOPS=742, BW=2972KiB/s (3043kB/s)(348MiB/120003msec)
  write: IOPS=740, BW=2963KiB/s (3035kB/s)(347MiB/120003msec)
  ...

Latency remains relatively stable for most operations, though there are occasional high-latency outliers during writes. This behavior is expected with consumer-grade SD cards and reinforces the need to keep expectations realistic.

Ceph Deployment Approach

To deploy Ceph, I build a custom Docker image that is used during the CD process to run an Ansible playbook. This image includes the Ansible Galaxy role from:

Atmosphere uses Ceph Reef, I ended up using Ceph Squid due to an ARM-related issue³, which was resolved in the Squid release. For the Ceph network, I decided to use the ULA range fc00:0:cef::/48⁓.

The Ceph configuration for this setup looks like this:

ceph_version: 19.2.3
cephadm_version: 19.2.3
ceph_mon_public_network: "fc00:0:cef::/48"
ceph_fsid: 91b16323-eee2-4b2e-8416-8b08bc27b463
ceph_osd_devices:
  - /dev/mmcblk1

For each node, I define a Ceph ULA address as a variable in the Ansible host_vars directory:

ceph_ula_address: "fc00:0:cef:11::"

Ceph challenge 🚧

The Vexxhost ceph collection was not IPv6 ready, I created a PR to get this issue fixed⁵

In my setup, the nodes only use the loopback interface⁶ as their primary address, and the physical interfaces do not have IPv6 addresses assigned. As a result, Ceph is unable to automatically determine the correct network interface. Additionally, the vexxhost.ceph playbook does not handle this scenario out of the box.

When I run the playbook I got the error: The public CIDR network xxxx:xxx:xx:xxxx:x::/80 (from -c conf file) is not configured locally⁶

So I ended up using the following procedure:

  1. Bootstrap the first ceph mon and run the command with the option --skip-mon-network:
touch /tmp/ceph_f2gnlv58.conf

cephadm bootstrap --fsid 91b16323-eee2-4b2e-8416-8b08bc27b463 --mon-ip fc00:0:cef:11:: --cluster-network fc00:0000:cef::/48 --ssh-user cephadm --config /tmp/ceph_f2gnlv58.conf --skip-monitoring-stack --skip-mon-network
  1. Run the playbook manual⁷
  2. The ceph mon does not come up so I ended doing the following (make sure you use the righ path and node names in the command):
rmdir /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config/ && touch /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config && chown 167:167 /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config
  1. I fill the config with the same content that is located in the config of node1. But this does not create the Ceph mon database. To get this done I remove the broken mons and add them manual.
  2. Remove the ceph mons from the node2 & node3.
ceph orch daemon rm --force mon.node2
  1. Add the mon on nodes 2 & 3.
ceph orch daemon add mon node2:"fc00:0:cef:12::"
  1. Run the playbook

Kubernetes

After installing Ceph next topic is Kubernetes. Also here Atmosphere is used to install Kubernetes.

The Atmosphere Galaxy role depends on several Ansible collections:

All of these collections are maintained by Vexxhost. However, the Kubernetes collection was not fully IPv6-ready for my environment. To address this, I made several IPv6-related adjustments and am currently using a locally cloned and modified version of the collection, which is included directly in the Docker image.

I have submitted a pull request to incorporate these changes upstream, but at the time of writing this article, it is still in draft status.

This pull request includes the following changes:

The Kubernetes cluster runs on an IPv6-only network. The kube-apiserver is exposed via a virtual IP (VIP) defined by the variable kubernetes_keepalived_vip.

kube-vip is responsible for managing this VIP and ensuring it can failover between control-plane nodes when required.

Failover of the VIP is implemented using BGP. BGP is responsible for advertising the active VIP to the network so that traffic can be routed to the node currently hosting the kube-apiserver.

Current Cluster State

Kubernetes nodes:

root@node1:~# kubectl get nodes
NAME    STATUS   ROLES           AGE   VERSION
node1   Ready    control-plane   28h   v1.28.13
node2   Ready    control-plane   28h   v1.28.13
node3   Ready    control-plane   27h   v1.28.13

Pods:

root@node1:~# kubectl get pods -A -o wide
NAMESPACE     NAME                               READY   STATUS    RESTARTS        AGE   IP                     NODE    NOMINATED NODE   READINESS GATES
kube-system   cilium-98vqp                       1/1     Running   2 (16h ago)     27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   cilium-lck9w                       1/1     Running   1 (16h ago)     27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   cilium-n4bwt                       1/1     Running   1 (16h ago)     27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   cilium-operator-677b74f4db-6tlp7   1/1     Running   2 (16h ago)     27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   cilium-operator-677b74f4db-cgqvx   1/1     Running   3 (16h ago)     27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   coredns-77cccfdc44-5bwtl           1/1     Running   1 (16h ago)     27h   fdac:bb5e:e415::ac  node1   <none>           <none>
kube-system   coredns-77cccfdc44-pbz5j           1/1     Running   1 (16h ago)     27h   fdac:bb5e:e415::63  node1   <none>           <none>
kube-system   etcd-node1                         1/1     Running   57 (16h ago)    27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   etcd-node2                         1/1     Running   68 (16h ago)    27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   etcd-node3                         1/1     Running   2 (16h ago)     27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   kube-apiserver-node1               1/1     Running   82 (16h ago)    27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   kube-apiserver-node2               1/1     Running   339 (16h ago)   27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   kube-apiserver-node3               1/1     Running   416 (16h ago)   27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   kube-controller-manager-node1      1/1     Running   34 (16h ago)    27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   kube-controller-manager-node2      1/1     Running   20 (16h ago)    27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   kube-controller-manager-node3      1/1     Running   17 (16h ago)    27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   kube-proxy-2trcg                   1/1     Running   2 (16h ago)     27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   kube-proxy-fvp4k                   1/1     Running   1 (16h ago)     27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   kube-proxy-vtwpz                   1/1     Running   1 (16h ago)     27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   kube-scheduler-node1               1/1     Running   46 (16h ago)    27h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   kube-scheduler-node2               1/1     Running   21 (16h ago)    27h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   kube-scheduler-node3               1/1     Running   18 (16h ago)    27h   3fff:0:0:40::13     node3   <none>           <none>
kube-system   kube-vip-node1                     1/1     Running   1 (16h ago)     16h   3fff:0:0:40::11     node1   <none>           <none>
kube-system   kube-vip-node2                     1/1     Running   1 (16h ago)     16h   3fff:0:0:40::12     node2   <none>           <none>
kube-system   kube-vip-node3                     1/1     Running   1 (16h ago)     16h   3fff:0:0:40::13     node3   <none>           <none>

(All control-plane components, etcd, Cilium, and kube-vip are running and healthy on all three nodes.)

Problem Description

During installation, the cluster experienced issues related to kube-vip and BGP. Specifically, kube-vip does not advertise the control-plane VIP via BGP.

As a result:

To work around this, I temporarily used bird to advertise the VIP, which I already run to announce node addresses. This made the control plane reachable again.

Once the cluster was fully installed, I filtered the VIP advertisement back out. This made the kube-apiserver unreachable from the rest of the network. However, because all nodes are control-plane nodes with a local etcd instance, Kubernetes continues to function using:

After fully installing the Kubernetes cluster I filtered out the VIP address again. This makes the apiserver unreachable for the rest of the network, but because I only have control nodes with a etcd on it, it can use the apiserver on the localhost wich again uses the etcd server on the same node.

BGP Status and Observations

[admin@homelab-top] > /routing/route/print where bgp

[admin@homelab-top] > /routing/route/print detail where bgp-net
Flags: X - disabled, F - filtered, U - unreachable, A - active; c - connect, s - static, r - rip, b - bgp, n - bgp-net, o - ospf, i - isis, d - dhcp, v - vpn, m - modem, a - ldp-address, l - ldp-mapping, g - slaac, y - bgp-mpls-vpn, e - evpn; H - hw-offloaded;
+ - ecmp, B - blackhole
  n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::3"
       debug.fwp-ptr=0x20382060

  n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::12"
       debug.fwp-ptr=0x20382060

  n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::13"
       debug.fwp-ptr=0x20382060

  n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::11"
       debug.fwp-ptr=0x20382060

No advertised prefixes:

[admin@homelab-top] > /routing/bgp/advertisements/print

[admin@homelab-top] > /routing/bgp/session/print
Flags: E - established
 0 E name="node1-1" instance=bgp_instance
     remote.address=3fff:0:0:40::11 .as=65001 .id=172.17.40.11 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6246 .bytes=118674 .eor=""
     local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6247 .bytes=118693 .eor=""
     output.procid=20 .network=bgp-networks
     input.procid=20 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030400 ebgp
     multihop=yes hold-time=30s keepalive-time=10s uptime=17h21m8s150ms last-started=2025-12-18 21:57:27 last-stopped=2025-12-18 21:56:09 prefix-count=0

 1 E name="homelab-under-1" instance=bgp_instance
     remote.address=3fff:0:0:40::3 .as=65000 .id=172.17.40.3 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .hold-time=30s .messages=6435 .bytes=122265 .eor=""
     local.role=ibgp .address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6435 .bytes=122265 .eor=""
     output.procid=21 .network=bgp-networks
     input.procid=21 ibgp
     multihop=yes hold-time=30s keepalive-time=10s uptime=17h52m30s190ms last-started=2025-12-18 21:25:11 prefix-count=0

 2 E name="node3-1" instance=bgp_instance
     remote.address=3fff:0:0:40::13 .as=65001 .id=172.17.40.13 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6260 .bytes=118940 .eor=""
     local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6261 .bytes=118959 .eor=""
     output.procid=23 .network=bgp-networks
     input.procid=23 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030603 ebgp
     multihop=yes hold-time=30s keepalive-time=10s uptime=17h23m24s120ms last-started=2025-12-18 21:55:11 last-stopped=2025-12-18 21:53:43 prefix-count=0

 3 E name="node2-1" instance=bgp_instance
     remote.address=3fff:0:0:40::12 .as=65001 .id=172.17.40.12 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6262 .bytes=118978 .eor=""
     local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6263 .bytes=118997 .eor=""
     output.procid=22 .network=bgp-networks
     input.procid=22 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030603 ebgp
     multihop=yes hold-time=30s keepalive-time=10s uptime=17h23m47s700ms last-started=2025-12-18 21:54:48 last-stopped=2025-12-18 21:53:26 prefix-count=0

All BGP sessions show ESTABLISHED, but with:

prefix-count=0

This confirms that:

Current Conclusion At this point, the most likely issue is that kube-vip is not injecting the VIP into BGP, despite the sessions being established. The network and BGP configuration itself appear to be functioning correctly.

Probably to next step to troubleshoot this issue is by resetting the Kubernetes cluster and try to build a BGP setup via bird to validate the switch configuration.

Next Steps šŸ› ļø

The next step will likely be to reset the Kubernetes cluster and rebuild it while:

  1. Validating the BGP configuration independently using bird
  2. Confirming correct VIP advertisement and failover behavior
  3. Reintroducing kube-vip only after the BGP setup is proven to work

This should help determine whether the issue lies in kube-vip configuration, bootstrap timing, or an interaction between the control-plane setup and BGP advertisement.


notes

  1. Atmosphere documentation: https://vexxhost.github.io/atmosphere/#atmosphere

  2. FIO tests – https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm

  3. ARM ceph issue: https://github.com/rook/rook/issues/14502

  4. https://en.wikipedia.org/wiki/Reserved_IP_addresses

  5. https://github.com/vexxhost/ansible-collection-ceph/pull/80

  6. Ceph lookback issues: https://docs.clyso.com/docs/kb/cephadm/ipv6-deployment

  7. When I run the playbook manual I do the following:

    1. Start the docker image and make sure I also load the required secrets.
docker run --rm -it -v ${PWD}:/src -e IPV6_PREFIX=3fff:0:0 -e ANSIBLE_VAULT_PASSWORD_FILE=vault -e DOMAINNAME=mydomain.tld mpiscaer/atmosphere:vlatest-hl4

2. Make sure the SSH private key and ansible-vault are on the nodes directory.

3. got the ansible directory

cd /src/nodes
cp id_ed25519 /tmp/
chmod 600 /tmp/id_ed25519

4. Run the Ansible playbook

ansible-playbook -i inventory.yml site.yml
  1. Bird config: https://sapiolab.nl/install-my-first-network-connectivity

#Kubernetes #k8s #Ceph #IPv6 #Homelab #OpenStack #CloudNative #DevOps