Modernizing our backend with Kubernetes

Throughout the year, we will be sharing insights into how we are continually evolving and improving the Tidepool infrastructure and services. If you are an open source developer who: is curious about how Tidepool leverages open source; have considered contributing software to Tidepool; or are simply curious about how the sausage gets made, then we think that you will find these blog posts enlightening.

Tidepool launched our web service over five years ago! We have continually updated the user interface, but the backend has largely stayed the same. Over that time, Kubernetes has become the de facto standard for distributed systems. So in January of last year, we decided to jump on board! This post describes our exciting journey in migrating to Kubernetes.

Motivation

The Tidepool backend consists of a distributed system of ~18 microservices written in Node.js and Go that are deployed on Amazon EC2 instances. Through 2018, we managed our infrastructure using AWS CloudFormation with Lambda and Ansible.

Included among those 18 microservices, are our own custom API gateway (styx), our own custom service discovery system (hakken), and our own custom load balancing system (shio). The engineers who built these tools are long gone, but we need new features.

Instead of continuing to invest in our custom tools, Tidepool decided to migrate our services to containers in early 2019, and to manage those containers with Kubernetes using the Amazon Elastic Kubernetes Service (EKS). We want to focus our engineering resources on building software for diabetes treatment, not tools or infrastructure.

Kubernetes

Over the last five years, a new paradigm arose for packaging services: containers. To manage a distributed system of containers, Google created and open-sourced Kubernetes in 2015. Kubernetes has become the de facto standard for managing containers.

Kubernetes is managed by the Cloud Native Computing Foundation (CNCF). As of this writing, 77,648 people have contributed to CNCF projects. That’s a lot of people! So last year, Tidepool decided to shift to Kubernetes and other CNCF projects.

Containerization

Our modernization journey actually began prior to 2019 with the containerization of our microservices. This enabled developers to become familiar with containers, and to use containers in their day-to-day development.

Our initial foray into containers was to get our internal developers to use containers in their daily workflow. Using Docker and Docker Compose, we enabled developers to run a Tidepool service on their local machines. We published that configuration on our development repo. Our internal developers used this in their daily workflow.

Kubernetes Services Discovery and Routing

With the Tidepool service running locally using Docker Compose, it was relatively easy to get it running locally using a local Kubernetes (minikube) environment using an ephemeral Mongo database in a container. In January 2019, we converted our Docker Compose scripts to Kubernetes manifests using Kompose. We had achieved the first step in our migration to Kubernetes!

Part of the excitement behind Kubernetes is the way in which it makes common tasks simple and difficult tasks possible. Specifically, Kubernetes allows one to create a set of replicas (called Pods) of a stateless service as easily as specifying the number of replicas desired in a Kubernetes Deployment. Kubernetes Services allow one to use a single DNS name to refer to any of the replicas so constructed.

In our legacy system, we have a service called hakken that provides service discovery and routing. With Kubernetes Services and Deployments, hakken is not needed. To migrate away from hakken, we introduced feature flags to allow the same microservice to work in the Kubernetes environment (in a Docker container) and in the legacy environment (as a raw executable).

Cloud Deployment

Once we had the basic services running locally, we began investigating how best to run Kubernetes in the cloud. Kubernetes consists of a number of processes that manage the Kubernetes control plane.

In the early days of Kubernetes, running Kubernetes meant creating and launching those processes using tools like kops or kubeadm. With this approach, one assumes responsibility for maintaining Kubernetes itself. For a small non-profit, this approach made no sense.

However, by the beginning of our Kubernetes migration, every major cloud vendor had launched a managed Kubernetes service, including Amazon, our hosting provider. Our hosting needs were modest. Amazon is not a leader in the Kubernetes space, but their hosted Kubernetes offering appeared more than adequate.

In February, we brought up our first cloud hosted Kubernetes service on Amazon EKS using eksctl by Weaveworks to create our EKS Kubernetes clusters. An eksctl is a simple command-line tool that creates Amazon Cloud Formation templates, a technology that we had used for years. Moreover, Amazon itself acknowledged eksctl as the official tool for managing EKS clusters.

Using eksctl, we are able to create new Kubernetes clusters, add or remove entire groups of nodes to a cluster, and create or delete Kubernetes service accounts that are bound to Amazon IAM roles. By assigning special IAM roles to specific service accounts and by binding those service accounts to specific pods, we may tightly control which pods have which IAM privileges. We use this capability, for example, to provide our blob, image, and hydrophone services access to the S3 buckets that they need. No other pods have such access. This follows the Principle of Least Privilege.

Resource Constraints

Kubernetes was derived from an Google-internal tool called Borg. Google needs to run millions of containers in a way such that no container could cause other containers to fail. Kubernetes also offers these guarantees, including by limiting the CPU and memory that a container may use.

By contrast, our legacy services run without CPU and memory constraints. An errant service could hog CPU or memory, causing the entire system to become unstable. Running without constraints in such conditions is hazardous. Overuse of a single resource could result in a cascading failure.

In the era of slow user growth, running without resource constraints is an acceptable risk. However, since we announced our new Tidepool Loop project, we have seen significant growth in our user base. To avoid the risk of failure, we have added resource constraints to our services, including CPU limits, memory limits, and HTTP request timeouts.

By adding resource constraints, we immediately discovered that our simple highwater service which forwards metric data to an external BI provider had a bug! Previous modifications that were intended to improve performance had actually increased latencies and increased memory use. Every 6 hours or so, the service ran out of memory and restarted. Without imposing memory limits, we may not have noticed the problem, continued to use excessive memory, and penalized clients of the service with long latencies.

Current State

We now operate 5 Kubernetes clusters on Amazon EKS. Two clusters are used for internal testing, one issued for integration testing with our partners, one is used to administer shared tools, and one is used for production.

On the production cluster, we run our Tidepool production environment that consists of 44 pods that provide our 18 microservices. We run another 66 pods to provide support services such as our API gateway, telemetry, and logging. All of our software is open source and all of our support services are open source.

Conclusion

Our migration to Kubernetes has enabled us to leverage the efforts of far more people than Tidepool could ever employ. We are already seeing benefits! We have been able to retire custom code and use open source alternatives that are supported and offer broad feature sets. This leverage allows us to focus on our efforts on developing diabetes software.

Upcoming

In upcoming blog posts, we will dive into more detail of our backed modernization project. We will explore the open source tools that we use, the tools that we built, and the processes that we put in place to manage this infrastructure.

In our next blog post, we will focus on our new API Gateway.