Installing the nodes with Ceph and K8S
Network š
In this post, Iāll walk through how I install and prepare the host nodes. Each host is set up with both Ceph and Kubernetes (K8s). The full configuration code on what I base this blog post about can be found at https://codeberg.org/mpiscaer/myOpenstackCluster with GIT tag 0.1.0
As described in my previous blog post, š” My First Network Connectivity, the network is fully routed.
In most cases environments uses LACP to create a redundant network connection and where multiple hosts are on the same broadcast network. I deliberately chose not to use LACP on the interfaces and instead rely on OSPF and BGP to provide redundancy. In the end I use this setup to evaluate if this a usable setup.
The advantage of using BGP and OSPF over LACP is beter integration of the network and hosts, traffic can flow directly and uses the shortest path. Using a layer 3 only setup eliminates the need for Spanning tree and so all links can stay active.
I have three nodes, all three nodes will and up with the function:
- Ceph MON
- Ceph OSD
- Ceph MGR
- Kubernetes control node
- Openstack control node
- Openstack Compute node
IPv6 Addressing š
The following IPv6 ranges are used in the current setup, in this blogpost I will use the 3fff:0:0 prefix. In Git this got the secret variable IPV6_PREFIX and uses the prefix that got assigned by my internet provider.
| IPv6 address range | Descrition |
|---|---|
3fff::/20 |
Documentation |
fd40:10::/64 |
Kubernetes Pod addresses |
fd40:10:100::/112 |
Kubernetes Service IPs |
fc00:0:cef::/48 |
Ceph network |
fc00:0:8:: |
Kubernetes VIP address |
IPv6 Issues Encountered ā ļø
During the installation of Ceph and Kubernetes, I encountered several IPv6-related issues:
GitHub accessibility
GitHub was not reachable over IPv6 in the target environment, which prevented direct access to repositories. As a workaround, a proxy had to be configured to allow installations and playbook execution to proceed.quay.io reliability
Access to quay.io over IPv6 was unreliable, causing intermittent failures when pulling container images. To mitigate this, traffic to quay.io was routed through a proxy, so that it was able to failover to the IPv4 endpoint.Ceph Ansible playbook
An IPv6-specific issue was discovered in the Ceph Ansible collection. The problem was analyzed and fixed, and the solution was contributed upstream in the following pull request: https://github.com/vexxhost/ansible-collection-ceph/pull/80Kubernetes Ansible playbook
Several IPv6-related configuration issues were identified in the Kubernetes Ansible collection and documented here: https://github.com/vexxhost/ansible-collection-kubernetes/pull/221- IPv6 was not enabled in Cilium and kubeadm
- IPv6 forwarding was not enabled on the hosts
Somehow because the interface itself does not have any address systemd-resolved has a issue in dns-resolving. I ended up in disableing systemd-resolved and let it directly use the DNS resolver. This means that no caching happends on the local node.
Atmosphere
Atmosphere is a set of Ansible collections and Helm charts to install Openstack on top of Kubernetes. To quote the Atmosphere documentation¹:
Atmosphere is an advanced OpenStack distribution that is powered by open source technologies & built by VEXXHOST powered by Kubernetes, which allows you to easily deliver virtual machines, Kubernetes and bare-metal on your on-premise hardware.
The difference between Atmosphere and other deployment tools is that it is fully open source with batteries included. It ships with settings that are curated by years of experience from our team alongside other features such as:
- Built on top of Kubernetes for the life-cycle of the OpenStack cloud.
- Native integration with Keycloak for robust identity management.
- Native integration with Cluster API driver for Magnum
- Simplified deployment and management of the cloud using Kubernetes.
- Pre-integrated with many popular OpenStack projects with no additional configuration.
- Native integration with many storage platforms out of the box.
- Full day 2 operations for monitoring, logging and alerting out of the box using Prometheus, AlertManager, Grafana, Loki and more.
Ceph install
Let's start with installing Ceph. The first step is testing the SD card, followed by the actual Ceph installation.
Once OSPF-based network connectivity was in place, I moved on to setting up storage.
For storage, I use Ceph, a distributed storage system. Each Ceph node is configured with a single OSD, backed by an SD card. Before deploying Ceph, I ran some basic performance tests to understand the characteristics and limitations of the underlying storage².:
Throughput Test š
root@node1:~# sudo fio --filename=/dev/mmcblk1 --direct=1 --rw=write --bs=256k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1
throughput-test-job: (groupid=0, jobs=4): err= 0: pid=1674: Tue Dec 9 13:31:13 2025
write: IOPS=257, BW=64.4MiB/s (67.6MB/s)(7798MiB/121026msec)
...
The results show a sustained write throughput of roughly 64 MiB/s, which is acceptable for this setup, given the constraints of SD-card-based storage.
Latency Test ā±ļø
rwlatency-test-job: (groupid=0, jobs=1): err= 0: pid=1715: Tue Dec 9 13:35:54 2025
read: IOPS=742, BW=2972KiB/s (3043kB/s)(348MiB/120003msec)
write: IOPS=740, BW=2963KiB/s (3035kB/s)(347MiB/120003msec)
...
Latency remains relatively stable for most operations, though there are occasional high-latency outliers during writes. This behavior is expected with consumer-grade SD cards and reinforces the need to keep expectations realistic.
Ceph Deployment Approach
To deploy Ceph, I build a custom Docker image that is used during the CD process to run an Ansible playbook. This image includes the Ansible Galaxy role from:
Atmosphere uses Ceph Reef, I ended up using Ceph Squid due to an ARM-related issue³, which was resolved in the Squid release. For the Ceph network, I decided to use the ULA range fc00:0:cef::/48ā“.
The Ceph configuration for this setup looks like this:
ceph_version: 19.2.3
cephadm_version: 19.2.3
ceph_mon_public_network: "fc00:0:cef::/48"
ceph_fsid: 91b16323-eee2-4b2e-8416-8b08bc27b463
ceph_osd_devices:
- /dev/mmcblk1
For each node, I define a Ceph ULA address as a variable in the Ansible host_vars directory:
ceph_ula_address: "fc00:0:cef:11::"
Ceph challenge š§
The Vexxhost ceph collection was not IPv6 ready, I created a PR to get this issue fixedāµ
In my setup, the nodes only use the loopback interfaceā¶ as their primary address, and the physical interfaces do not have IPv6 addresses assigned. As a result, Ceph is unable to automatically determine the correct network interface. Additionally, the vexxhost.ceph playbook does not handle this scenario out of the box.
When I run the playbook I got the error: The public CIDR network xxxx:xxx:xx:xxxx:x::/80 (from -c conf file) is not configured locallyā¶
So I ended up using the following procedure:
- Bootstrap the first ceph mon and run the command with the option
--skip-mon-network:
touch /tmp/ceph_f2gnlv58.conf
cephadm bootstrap --fsid 91b16323-eee2-4b2e-8416-8b08bc27b463 --mon-ip fc00:0:cef:11:: --cluster-network fc00:0000:cef::/48 --ssh-user cephadm --config /tmp/ceph_f2gnlv58.conf --skip-monitoring-stack --skip-mon-network
- Run the playbook manualā·
- The ceph mon does not come up so I ended doing the following (make sure you use the righ path and node names in the command):
rmdir /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config/ && touch /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config && chown 167:167 /var/lib/ceph/91b16323-eee2-4b2e-8416-8b08bc27b463/mon.node2/config
- I fill the config with the same content that is located in the config of
node1. But this does not create the Ceph mon database. To get this done I remove the broken mons and add them manual. - Remove the ceph mons from the
node2&node3.
ceph orch daemon rm --force mon.node2
- Add the mon on nodes 2 & 3.
ceph orch daemon add mon node2:"fc00:0:cef:12::"
- Run the playbook
Kubernetes
After installing Ceph next topic is Kubernetes. Also here Atmosphere is used to install Kubernetes.
The Atmosphere Galaxy role depends on several Ansible collections:
- ansible-collection-ceph
- ansible-collection-containers
- ansible-collection-kubernetes
All of these collections are maintained by Vexxhost. However, the Kubernetes collection was not fully IPv6-ready for my environment. To address this, I made several IPv6-related adjustments and am currently using a locally cloned and modified version of the collection, which is included directly in the Docker image.
I have submitted a pull request to incorporate these changes upstream, but at the time of writing this article, it is still in draft status.
This pull request includes the following changes:
- Cilium: IPv6 was not enabled and the configuration was missing
clusterPoolIPv6PodCIDRList,routingMode, andenableIPv6Masquerade. - Kernel: IPv6 forwarding was not enabled.
- kubeadm: Configuration was missing the
podSubnetandserviceSubnetaddress ranges.
The Kubernetes cluster runs on an IPv6-only network. The kube-apiserver is exposed via a virtual IP (VIP) defined by the variable kubernetes_keepalived_vip.
kube-vip is responsible for managing this VIP and ensuring it can failover between control-plane nodes when required.
Failover of the VIP is implemented using BGP. BGP is responsible for advertising the active VIP to the network so that traffic can be routed to the node currently hosting the kube-apiserver.
Current Cluster State
Kubernetes nodes:
root@node1:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane 28h v1.28.13
node2 Ready control-plane 28h v1.28.13
node3 Ready control-plane 27h v1.28.13
Pods:
root@node1:~# kubectl get pods -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system cilium-98vqp 1/1 Running 2 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system cilium-lck9w 1/1 Running 1 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system cilium-n4bwt 1/1 Running 1 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system cilium-operator-677b74f4db-6tlp7 1/1 Running 2 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system cilium-operator-677b74f4db-cgqvx 1/1 Running 3 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system coredns-77cccfdc44-5bwtl 1/1 Running 1 (16h ago) 27h fdac:bb5e:e415::ac node1 <none> <none>
kube-system coredns-77cccfdc44-pbz5j 1/1 Running 1 (16h ago) 27h fdac:bb5e:e415::63 node1 <none> <none>
kube-system etcd-node1 1/1 Running 57 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system etcd-node2 1/1 Running 68 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system etcd-node3 1/1 Running 2 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system kube-apiserver-node1 1/1 Running 82 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system kube-apiserver-node2 1/1 Running 339 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system kube-apiserver-node3 1/1 Running 416 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system kube-controller-manager-node1 1/1 Running 34 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system kube-controller-manager-node2 1/1 Running 20 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system kube-controller-manager-node3 1/1 Running 17 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system kube-proxy-2trcg 1/1 Running 2 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system kube-proxy-fvp4k 1/1 Running 1 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system kube-proxy-vtwpz 1/1 Running 1 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system kube-scheduler-node1 1/1 Running 46 (16h ago) 27h 3fff:0:0:40::11 node1 <none> <none>
kube-system kube-scheduler-node2 1/1 Running 21 (16h ago) 27h 3fff:0:0:40::12 node2 <none> <none>
kube-system kube-scheduler-node3 1/1 Running 18 (16h ago) 27h 3fff:0:0:40::13 node3 <none> <none>
kube-system kube-vip-node1 1/1 Running 1 (16h ago) 16h 3fff:0:0:40::11 node1 <none> <none>
kube-system kube-vip-node2 1/1 Running 1 (16h ago) 16h 3fff:0:0:40::12 node2 <none> <none>
kube-system kube-vip-node3 1/1 Running 1 (16h ago) 16h 3fff:0:0:40::13 node3 <none> <none>
(All control-plane components, etcd, Cilium, and kube-vip are running and healthy on all three nodes.)
Problem Description
During installation, the cluster experienced issues related to kube-vip and BGP. Specifically, kube-vip does not advertise the control-plane VIP via BGP.
As a result:
- The VIP initially becomes active on the first control-plane node
- Because kube-vip does not advertise the VIP, the rest of the network does not know how to reach it
- Other nodes cannot access the kube-apiserver via the VIP
To work around this, I temporarily used bird to advertise the VIP, which I already run to announce node addresses. This made the control plane reachable again.
Once the cluster was fully installed, I filtered the VIP advertisement back out. This made the kube-apiserver unreachable from the rest of the network. However, because all nodes are control-plane nodes with a local etcd instance, Kubernetes continues to function using:
After fully installing the Kubernetes cluster I filtered out the VIP address again. This makes the apiserver unreachable for the rest of the network, but because I only have control nodes with a etcd on it, it can use the apiserver on the localhost wich again uses the etcd server on the same node.
BGP Status and Observations
- The BGP state is ESTABLISHED
cp_enableistrue, see: https://github.com/vexxhost/ansible-collection-kubernetes/blob/main/roles/kube_vip/templates/kube-vip.yaml.j2#L18-L19- No BGP UPDATE messages are visible in Wireshark and so no network information exchange.
- No routes are learned on the MikroTik router: No BGP routes are present:
[admin@homelab-top] > /routing/route/print where bgp
[admin@homelab-top] > /routing/route/print detail where bgp-net
Flags: X - disabled, F - filtered, U - unreachable, A - active; c - connect, s - static, r - rip, b - bgp, n - bgp-net, o - ospf, i - isis, d - dhcp, v - vpn, m - modem, a - ldp-address, l - ldp-mapping, g - slaac, y - bgp-mpls-vpn, e - evpn; H - hw-offloaded;
+ - ecmp, B - blackhole
n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::3"
debug.fwp-ptr=0x20382060
n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::12"
debug.fwp-ptr=0x20382060
n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::13"
debug.fwp-ptr=0x20382060
n B afi=ipv6 contribution=candidate dst-address=3fff:0:0::/48 routing-table=main immediate-gw="" distance=255 belongs-to="bgp-output-3fff:0:0:40::11"
debug.fwp-ptr=0x20382060
No advertised prefixes:
[admin@homelab-top] > /routing/bgp/advertisements/print
[admin@homelab-top] > /routing/bgp/session/print
Flags: E - established
0 E name="node1-1" instance=bgp_instance
remote.address=3fff:0:0:40::11 .as=65001 .id=172.17.40.11 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6246 .bytes=118674 .eor=""
local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6247 .bytes=118693 .eor=""
output.procid=20 .network=bgp-networks
input.procid=20 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030400 ebgp
multihop=yes hold-time=30s keepalive-time=10s uptime=17h21m8s150ms last-started=2025-12-18 21:57:27 last-stopped=2025-12-18 21:56:09 prefix-count=0
1 E name="homelab-under-1" instance=bgp_instance
remote.address=3fff:0:0:40::3 .as=65000 .id=172.17.40.3 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .hold-time=30s .messages=6435 .bytes=122265 .eor=""
local.role=ibgp .address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6435 .bytes=122265 .eor=""
output.procid=21 .network=bgp-networks
input.procid=21 ibgp
multihop=yes hold-time=30s keepalive-time=10s uptime=17h52m30s190ms last-started=2025-12-18 21:25:11 prefix-count=0
2 E name="node3-1" instance=bgp_instance
remote.address=3fff:0:0:40::13 .as=65001 .id=172.17.40.13 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6260 .bytes=118940 .eor=""
local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6261 .bytes=118959 .eor=""
output.procid=23 .network=bgp-networks
input.procid=23 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030603 ebgp
multihop=yes hold-time=30s keepalive-time=10s uptime=17h23m24s120ms last-started=2025-12-18 21:55:11 last-stopped=2025-12-18 21:53:43 prefix-count=0
3 E name="node2-1" instance=bgp_instance
remote.address=3fff:0:0:40::12 .as=65001 .id=172.17.40.12 .capabilities=mp,rr,as4,fqdn .afi=ipv6 .hold-time=30s .messages=6262 .bytes=118978 .eor=""
local.address=3fff:0:0:40::2 .as=65000 .id=172.17.40.2 .cluster-id=172.17.40.2 .capabilities=mp,rr,enhe,gr,as4 .afi=ip .messages=6263 .bytes=118997 .eor=""
output.procid=22 .network=bgp-networks
input.procid=22 .last-notification=FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF0015030603 ebgp
multihop=yes hold-time=30s keepalive-time=10s uptime=17h23m47s700ms last-started=2025-12-18 21:54:48 last-stopped=2025-12-18 21:53:26 prefix-count=0
All BGP sessions show ESTABLISHED, but with:
prefix-count=0
This confirms that:
- The BGP transport is working
- No prefixes are being advertised by kube-vip
Current Conclusion At this point, the most likely issue is that kube-vip is not injecting the VIP into BGP, despite the sessions being established. The network and BGP configuration itself appear to be functioning correctly.
Probably to next step to troubleshoot this issue is by resetting the Kubernetes cluster and try to build a BGP setup via bird to validate the switch configuration.
Next Steps š ļø
The next step will likely be to reset the Kubernetes cluster and rebuild it while:
- Validating the BGP configuration independently using bird
- Confirming correct VIP advertisement and failover behavior
- Reintroducing kube-vip only after the BGP setup is proven to work
This should help determine whether the issue lies in kube-vip configuration, bootstrap timing, or an interaction between the control-plane setup and BGP advertisement.
notes
Atmosphere documentation: https://vexxhost.github.io/atmosphere/#atmosphere
FIO tests ā https://docs.oracle.com/en-us/iaas/Content/Block/References/samplefiocommandslinux.htm
ARM ceph issue: https://github.com/rook/rook/issues/14502
Ceph lookback issues: https://docs.clyso.com/docs/kb/cephadm/ipv6-deployment
When I run the playbook manual I do the following:
- Start the docker image and make sure I also load the required secrets.
docker run --rm -it -v ${PWD}:/src -e IPV6_PREFIX=3fff:0:0 -e ANSIBLE_VAULT_PASSWORD_FILE=vault -e DOMAINNAME=mydomain.tld mpiscaer/atmosphere:vlatest-hl4
2. Make sure the SSH private key and ansible-vault are on the nodes directory.
3. got the ansible directory
cd /src/nodes
cp id_ed25519 /tmp/
chmod 600 /tmp/id_ed25519
4. Run the Ansible playbook
ansible-playbook -i inventory.yml site.yml
#Kubernetes #k8s #Ceph #IPv6 #Homelab #OpenStack #CloudNative #DevOps