Tidepool stores most of our data in MongoDB. We host those MongoDB servers ourselves. When our user base was small and load was minimal, this was not a problem. However, our database has grown to over 2B records. Backups using basic mongodump and mongorestore were both slow and unreliable.
We considered whether to continue self-hosting MongoDB. If so, should we run MongoDB within Kubernetes? We experimented with the many ways to configure Mongo in Kubernetes with sharding and replica sets. We got it working, but noticed issues with persistent volumes that troubled us.
For backup, we evaluated several solutions, from creating our own Kubernetes cron jobs using mongodump and mongorestore to using the MongoDB Enterprise Operator. We concluded that having someone else host our Mongo instances and be responsible for creating backups was the best option for such a small company and tiny operations staff. We began an evaluation of the two obvious candidates: AWS DocumentDB and MongoDB Atlas. We discovered that both choices would require us to upgrade our old 3.2 Mongo drivers.
We migrated our Mongo drivers to version 3.6, which is required specifically for Amazon DocumentDB and is the easiest choice as well for our Node microservices. Then, we cloned one of our test databases to a MongoDB Atlas instance, which, incidentally, is actually hosted by Amazon. This made it simple and easy to set up a secure VPC peering relationship between the VPC that hosts our Kubernetes services and the VPC that hosts the Atlas database. We chose not to evaluate Amazon DocumentDB because it does not offer a path forward after Mongo 3.6.
With everything working in Kubernetes with one of our main test databases, we decided in June 2019 to offer our working Kubernetes cluster to our QA team and developers for daily use.
Migrating our 3B Record Production Database
Our initial attempt to migrate our production Mongo database from our self-hosted servers to Atlas MongoDB was a colossal failure. Despite working with Atlas engineers to size our system, we under-provisioned our storage subsystem, which resulted in an immediate failure of the Tidepool service. Our users experienced significant downtime. This is unacceptable.
It did not take long for us to identify the problem. Our legacy database uses SSD drives. However, we have provisioned our new hosted service with standard drives. We did not have nearly the disk IOPs to support our workload!
Fixing this mistake was straightforward, but the damage to our team’s reputation was done. We needed to make sure that our next attempt to migrate was successful.
We spent several weeks attempting to mirror our production traffic to a live mirror of our new hosted database. We used the traffic shadowing feature of our Gloo API Gateway.
However, shadowing traffic at the application level also means that we need to address other unintended side-effects, including requests to third party applications.
We could suppress communications with third party services without causing our services to fail by simply providing an alternative set of Kubernetes secrets that effectively disabled these services.
The problem is that we also needed a way to suppress the writes to the live mirror of our Mongo database. Our first attempt was to configure our Mongo clients to authenticate using a Mongo user with read-only privileges. This would cause writes to fail. This approach was untenable.
We also tried to introduce a proxy between our Mongo clients and the Mongo service that would suppress writes and simply return success codes. We found a 5-year-old Mongo proxy written in Go and a more recent proxy written in Rust. The former project had been abandoned and the latter was brand new. No one on our team has experience with Rust, and though we find it appealing, we did not want to spend another several weeks writing a proxy that we would ultimately abandon.
So, after two months of hand-wringing, we migrated to a properly provisioned, hosted Mongo database. We completed the process in a total of 7 minutes of downtime with our CEO and a dozen other members of the Tidepool staff watching via Zoom. Success!
Performance
One of the great benefits in migrating to Atlas has been the visibility that we get from the dashboards provided. We could visualize memory use, slow queries, connection usage, and numerous other key metrics.
Immediately we realized that we had some serious performance problems! Our slow query log was filling up so fast, that the dashboard could only show a couple of hours worth of data! The Atlas performance advisor recommended an index, which we introduced.
But this one index was not enough. One of my colleagues had already designed a new set of indices to use. We introduced and got real-time visual feedback on their effectiveness.
New indices sped up some queries. However, adding indices also means adding more latency to the write path and demands for memory. We needed to retire some indices as well.
Index manipulation is a great tool, but eliminating inefficient queries is even better!
We have now embarked on a project to improve our queries and the underlying database schemas. This project will take several months. Stay tuned for updates!
Upcoming
In our next blog, we will discuss how we perform logging.