The How and Why Behind Tiller-less Helm
If you use Kubernetes, you’ve probably heard of “Helm” the Kubernetes package manager by now. Helm is very useful for installing packages on a Kubernetes cluster quickly as well as extras such as tracking releases to offer easy rollback. Unfortunately, Helm has an Achilles heel if you want to use it in a shared cluster, that is, its server-side component called “Tiller”. Tiller is a service that essentially accepts manifests from Helm and executes them on the user's behalf.
Having a server-side component is useful because it allows for tasks to be executed asynchronously, for example, you can deploy a chart that uses helm hooks from your laptop, and the hooks can execute even if your laptop is closed or disconnected.
With the default configuration of Tiller, however, it is configured such that Tiller is the cluster administrator, and that anyone with access to the cluster can ask Tiller to do things on their behalf. So, even though an engineer may only have access to staging, they can run a
helm delete production, and Tiller willingly executes it on their behalf. For more details, check out this article. So, how do we fix this?
Step one, set up a local Tiller instance
Like that article says, Helm can run in “Tiller-less” mode, which instead of having a shared, cluster-wide Tiller service that runs as cluster admin, you can have a local Tiller that runs using your credentials!
Here at TenX, we use Concourse CI for our CI system, and subsequently, use the concourse-helm-resource to perform deployments. Since the default image doesn’t use helm, we made some changes that we hope will be accepted. With the first hurdle out of the way, the next problem was the CI didn’t have an account of its own to connect to GCP, so onto the next step
Step two, setup RBAC for CI
Since we use Google Kubernetes Engine to deploy our services, step two was a tad more complicated. As a recap, GKE and Kubernetes both have the concept of Service Accounts, Role Bindings, and Users. So the first thing to do was to create a concourse GKE service account, with
developer permissions. Aside, I recently discovered a tool called rbac-lookup that made debugging and verifying the setup much more comfortable. On GKE, Kubernetes RBAC rules are the union of your GKE service account rules and your Kubernetes RBAC rules, so for example if your service account has
Cluster Admin privileges you can grant yourself whatever privileges you'd like on the cluster. Using
rbac-lookup, you can see the combination of GKE and Kubernetes RBAC rules and confirm the rules are correct.
After creating the IAM account, your permissions should look like this:
rbac-lookup firstname.lastname@example.org - gkeSUBJECT SCOPE ROLEconcourseci@tenx.iam.gserviceaccount.com project-wide IAM/gke-cluster-admin
Then, we could bind the GKE service account to a Kubernetes ServiceAccount using
kubectl create clusterrolebinding <role-binding-name> - clusterrole cluster-admin - user <gke-service-account-name> After creating the service account, the permissions should look like this:
rbac-lookup email@example.com - gkeSUBJECT SCOPE ROLEconcourseci@tenx.iam.gserviceaccount.com cluster-wide ClusterRolefirstname.lastname@example.org project-wide IAM/gke-cluster-admin
Step three, take it for a spin
With that completed, you can use the service account like a developer would:
$ rbac-lookup email@example.com - gkeSUBJECT SCOPE ROLEconcourseci@tenx-production.iam.gserviceaccount.com cluster-wide ClusterRolefirstname.lastname@example.org project-wide IAM/gke-cluster-admin
In our case we tested it, and it worked as expected. Hurray.
Step four, performing the migration
If you installed Tiller the default way described in the article, you’ll need to perform an extra step before switching to “Tiller-less Helm”. The default is to store release information in
ConfigMaps whereas the recommend configuration for production is to use
Secrets Fortunately, I found a tool that can do the conversion for you, aptly named tiller-releases-converter which performs the migration for you in two lines:
Then finally, switch over your deploy pipeline to use the new Tiller-less Helm, and remove Tiller from the cluster.
Step five, get a usable token
It’s important to note that you still need to “login” to get a JWT token that can be used to make requests. The correct thing to do would be to use vault to generate the JWT token on demand, but as we were under time pressure to get this migration done, we opted to use Concourse to fetch a new token automatically every 14 minutes such that jobs requesting the token would work (ignoring the race condition where a job requests a token at the 14 minute mark and takes longer than 1 minute to use it).
For that we came up with a pipeline like this:
- name: cred-timer
- name: update-cred
- get: cred-timer
- task: update-cred
export CLOUDSDK_CONFIG=mktemp -d gcloud-XXXXXX
cat >application_default_credentials.json <<EOF
gcloud auth activate-service-account --key-file=application_default_credentials.json
gcloud container clusters get-credentials production
access_token=gcloud config config-helper --format=json | jq '.credential.access_token' -r
export VAULT_TOKEN=vault write -field=token auth/approle/login
vault write concourse/infra/kubernetes-token value="$access_token"
With that out of the way, we now can remove Tiller from the cluster, and we have much better security and auditing than we did just a few short hours ago.
I hope you enjoyed this look behind the scenes at what goes on in the DevOps team at TenX. If you’re interested in this sort of stuff and are interested in learning more, or better yet want to join us, drop me a line on twitter @edude03 or via email email@example.com