My big mistake with Tanzu Community Edition

Share on:

Hi,

In this post I want to point out the mistakes I have made when deploying Tanzu Community Edition, what I learned from it, how not to make those mistakes and what ended up happening with how I fixed it.

So for those who don’t know Tanzu Community Edition is kind of a bit like Tanzu Kubernetes Grid, vSphere with Tanzu… you get it, it has to do with Kubernetes and they all have Tanzu in them 😉

I use Tanzu Community Edition to easily provision Kubernetes clusters on my vSphere infrastructure. I use this for my Taskcluster ‘prod’ and ‘staging’ deployment, for NSX Application Platform and maybe more in the future.

Now that you know what it is, what did I do wrong? Well, the nodes (= VMs that get deployed) require its IP address to always be the same. When they were provisioned, they received an IP from a DHCP server. However, it for some reason it renews and gets a different IP address, things will break (especially after a reboot!). In my case, break means that some worker nodes in clusters were no longer working, my load balancers (through NSX Advanced Load Balancer) were failing to route intermittently causing all kinds of funky things to happen.

Eventually I found out that I didn’t have any DHCP mappings set as I didn’t know that the node IPs should never change. So that’s what broke it, how did I fix it?

This is what I did:

  • SSH into each node to find out its original IP address
    • SSH into the node ( ssh capv@node-ip )
    • Change to the root user ( sudo su – )
    • List the contents of the right file:
      • Management node:
        • /var/log/containers/kube-apiserver press tab afterwards, there will be two files most likely, pick one of them and it will show the expected node IP address at the top of the file)
      • Workload node:
        • /var/log/containers/kube-proxy (press tab afterwards, there will be two files most likely, pick one of them and it will show the expected node IP address at the top of the file)
  • Create an IP reservation using that IP address and the MAC address of the corresponding virtual machine
  • Once in place; reboot the nodes
  • Attempt to access the cluster again
  • In case of worker nodes if you are getting faults from kubectl saying that a connection to an IP timed out that is in the worker node subnet, but not actively used by a node:
    • Use kubectl describe nodes | grep -B3
    • Use kubectl delete node
    • This will delete the worker node and a new one will be redeployed that will function properly

It was a learning experience for me and I hope it was an interesting read, until next time!

And please don’t make the same mistake as I did, as this did cost me a few hours.