More fun with k8s and python

Yet another adventure in automation land

7 min readApr 20, 2021

In this article we’re going to be using these tools:

AWS — EC2 + ASG and some other things
AWS EKS for our k8s provider
CloudFormation for managing secrets
Terraform for managing clusters
Python as our “glue” language, plus some jinja2 and boto stuff
Rancher as a sort of middleware layer to help us coordinate everything

User Story

Let’s say we have a customer called “SeaView.” SeaView is a company which produces vehicles of all sorts from trucks, to trains and even consumer vehicles like mini-vans. SeaView has an IoT device in each vehicle that sends telemetry to the cloud:

Blob data from several cameras ( in the case of trains, potentially dozens of cameras )
LiDar devices, again, potentially dozens depending on the platform
Direction, GPS, acceleration, cell signal strength, outside temp, etc…
And of course, some metrics on the operation of the cluster itself

All of this data is being stored on the local machine, but transmitted up to the cloud when possible using a data flow coordination system. This system knows about the connection status and the priority of data, which allows us to control what data ends up in the cloud first. All data is important, but the blob data ends up going last because it’s only needed later.

Each on-vehicle device runs a series of workloads to process the data, for instance, tensorflow is used to process objects from the cameras. The results of the tensorflow data can be tagged has highly important to ensure it gets to the cloud sooner than anything else. This is useful for cases like pedestrian detection or Amber Alert catches.

Additionally SeaView has a requirement for something that we’ll call the “white label” feature. They want the ability to create platforms for their various divisions. In most cases with most applications we generally build out a single production instance, then separate groups of things within the production application. In this case we’ll be producing entire production environments via an API call which will setup and update an environment upon request.

We see this model often in industries that require hard separation from groups ( HIPPA is a great example ). SeaView needs to be able to push a button and have a brand new, isolated, fully separate space for a new business unit. A business unit could be the obvious, like the “truck hauling” division, but it could also include something like a group of charging stations in California.

This is nothing short of a herculean task. Most startups would be tempted to solve this kind of a problem once they’ve grown past a certain size, ensuring that resources are devoted to solving the most immediate problems first. This is a good pattern for startups because it ensures that changes can happen quickly. Building something of this scale first before knowing what the customer finds interesting is ambitious to say the least.

Why K8s?

There are two great articles out recently talking about why certain companies didn’t use Kube. ( here’s one of them )

Both make valid arguments, and focus on something very important which is to focus in on the problem you’re actually solving.

In this case we will be using Kube for what Kube is good for, which is separation of things via namespaces. Along with this Rancher has a great UI for helping upgrade helm charts. More on that later. Between these two things I’m thinking I’ll be able to manage deployments easy enough.

What I won’t be focused on here are the things we usually build first in a production app like logging, monitoring, redundancy or security, or managing upgrades to the software. Instead we’re going to build out the “rubber stamp machine” first, then worry about everything else later.

Let’s get this party started

This build also uses both terraform and AWS::CloudFormation, here’s why:

Terraform is going to be great for managing the large EKS cluster for the SeaView organization. We can split this up into a dev/stage/prod profile later if we want, but for now we’re going to stick with building a single, large, ASG-backed EKS cluster for the entire org. K8s namespaces are going to be used to separate orgs inside of SeaView.
CloudFormation is going to be used for dynamically provisioning the secrets and IAM users required by the application. CF works very well for many, large, complicated groups of resources. We could use terraform workspaces, but the work to dynamically provision things isn’t a great use case for TF. CF works well for things like this.

For example, let’s say we have a SeaView group named “Trucking.” If we want to find the resources for the SeaView::Trucking group, we can do a tag search for a stack, then use the same API to find the resources within that stack.

The nature of this project is to create things in a dynamic way, which is the big hang up here. If we were taking a different approach, and we just had a single production environment, none of this would be required. We’d have something to setup the infrastructure ( terraform ) and something to provision and deploy the software ( helm charts with encrypted values files ).

The dynamic “on the fly” production environments is what causes this to be specifically pernicious.

I believe in using the best tool for the job, and in this case, using the best from TF and the best from CF is the way to go. However, this opens us up to the inevitable problem of using two tools to do some complicated things. This entire domain of white labeling software is complicated in general, there isn’t much we can do to avoid that, so let’s embrace it and try to enjoy the ride!

CloudFormation specifics

AWS::CF is going to handle

Secrets for the application, things like private repo keys to pull down images from our internal repos, OAuth secrets, and db username/password pairs.
IAM users which allow some services to talk to S3.

In some cases, people might use vault or something else for this, which is a perfectly valid way to go. We could also make our lives easier by only provisioning a single user and using that to talk to S3.

However, we’re really trying to focus in on the separation of concerns here, so going the extra mile to harden things is our focus here. I’m 100% down with using Vault, but we use AWS::SecretsManager here because it’s one less thing we have to manage right now. Our main focus is getting this product up and running, so the less we have to setup and manage, the better.

Terraform specifics

I’m going to be using workspaces to separate between companies. In this case, SeaView will have its own workspace for handling:

IAM roles for managing ALB’s and other objects required for the services
ASG with user-data that can help us connect to our central Rancher service and some other things
EKS cluster
Not implemented now, but eventually we’ll want CloudWatch triggers for a few things like EKS capacity based on prometheus metrics.

Running terraform will be the only manual step for creating the new company-wide EKS cluster. Once that’s done, we can just use the same cluster but separate things out by namespaces.

Rancher

For the purposes of the MVP, I’ll be using a single non-enterprise Rancher server to connect all of our customer clusters.

I’ll also be using Rancher’s application system, which is like a wrapper for helm charts. This means we’ll have one place to talk to for all customer clusters. I can, if I want, pull the Kube config down from Rancher, however, for the most part this will give me a single unified way of talking to all Kube clusters.

The importance of this becomes more obvious when we look at how the edge nodes will be connecting to the data ingestion services. I have another blog post detailing that.

I might end up creating some TF bits to create a Rancher host for each customer environment, but for now, this should be fine.

The magic sauce

The “glue” code is where we put everything together and is responsible for:

Looking up secrets from AWS::SecretsManager
Using Jinja2 to create the values overrides for the helm charts
Fire off the application manifest to Rancher

I have another post in which I detail using rancher with many edge nodes that use some k3s and Rancher automation. I’ll be using some of those techniques to automate the Kube cluster creation steps with EC2, but I won’t be going over the specifics of that work here.

The focus here is going to be covering how I can deliver a manifest of applications to a namespace using Rancher. I have looked at several of the RancherAPI integrations available in the community, but most of them seemed outdated or not functional, so I wrote my own. Fortunately Rancher makes this process super easy, it’s basically just python requestswith a Barer token authorization header. Nothing fancy. ;)

This is the function that can deploy the application to Rancher, which is basically using helm on the backend.

Now I can cobble together

self.deploy_foundation()
self.deploy_certs()
self.deploy_influxdb()
self.deploy_dsm()

Each method can have its own set of things:

def deploy_influxdb(self):
    values = self.render_template("insight/cloud/influxdb", self.variable_context)
    self.deploy_app(
        values=values,
        app_name="influxdb",
        project_id=self.rancher_project_id,
        helm_chart_url="catalog://?catalog=stable&type=clusterCatalog&template=influxdb&version=4.3.2",
        target_namespace="bkroger-insight"
    )

In some cases we might want to do extra processing like getting a specific config or secret. In these cases we can put the specific code into the method, which passes the values into the template.

I can also put things like wait conditions in place in the case of a specific service needing to wait on something else. This pattern allows me to control the flow of things while maintaining order of operations.

Conclusion

At some point someone got the idea that because Kube supports namespaces we should use that to separate out data instead of doing it at the application level. This decision appears to be the impetus for this project, which was out of my sphere of influence.

Although I have tried to influence things in a different direction we are where we are because of some ambitious thinking around what might be possible with Kube.

I think the lesson to learn here is just because you can do something doesn’t mean you should.