Instant rollbacks without interruption: how we ship new versions of Gearset every day
Gwilym Kuiper on October 12th 2020
Releasing changes to production is scary. All sorts of things can go wrong. What if that new feature you just implemented contains a bug that you didn't spot in testing? Or maybe that bug you fixed causes an issue in an unrelated area? Maybe there's a performance issue that doesn’t show up until you hit production workloads?
At least you know you have a version of the product that works: the previous version. With an instant rollback, you can switch to the old version immediately, and if you’re lucky your customers won't even notice.
Once you've got instant rollbacks in place, you can release more often and with confidence - knowing that if something goes wrong, there's a safety net in place.
What are blue-green deployments?
Blue-green deployments are a way to release your product without downtime. Rather than having a single production instance, you have two, traditionally labelled 'blue' and 'green'. At any time, only one colour is active and serving traffic.
These two production instances might be two processes running on the same machine, or two entirely separate clusters of virtual machines. This has the downside that you effectively need to double up on all infrastructure, while only half of it is being used at full capacity at any one time - increasing both complexity and cost.
In order to deploy a new version, you tear down the inactive colour and replace it with the newer version. After it has started up, you switch all user traffic to the newly started instance. The new colour is now active, and the old colour is inactive.
In theory, this switchover can be gradual, but at Gearset we currently switch all user traffic at once. By doing this, it's much easier to know what state all your users are in at any one time, and that users are always being served with the latest version. However, with a gradual switchover, you can monitor production workloads running on the new version and abort the rollout with fewer impacted customers if there is something wrong.
How do blue-green deployments help with instant rollbacks?
Once you have blue-green deployments in place, instant rollbacks are easy (in theory). All you need to do is switch which colour is currently active, without replacing the inactive colour as you would with part of a release.
There are things to consider when doing this. Speaking in the context of a web application like Gearset, the user's web browser will have loaded the new version of the application, but requests will be handled by the previous version of the web server.
This can be a major issue if the release added endpoints that the front-end relies on, since these will no longer exist. Or, you may have updated existing endpoints to return additional fields, which again will no longer be there. We'll cover solutions to these problems later. Fortunately, users tend to refresh their web browser if they find things aren't working, but ideally we’d like to avoid this as much as possible.
Another thing to consider is that although you have rolled back your application, you haven't rolled back your database - you need to make sure your database changes are compatible. And finally, the new version of your application is still running (as the inactive colour). Any long-running tasks that were kicked off in the new version will still be running in the now inactive instance.
How do you structure your code to handle blue-green deployments correctly?
Since the front end and back end of your application update independently, all developers need to keep in mind both forwards and backwards compatibility. In general, customers will be using the current version of the front end. However, just after a release, customers will be using the previous version of the front end but their requests will be going to the new back-end version. In Gearset, a popup appears letting customers know that there is an update available and they should refresh their browsers (bringing their front-end version in line with the current back-end version).
If the next Gearset version adds some functionality that requires the front end to send extra information in a request, it cannot assume this information will always be included, because customers might still be running an old front-end version. Any new endpoint needs to be able to handle this. Similarly, if the new back end wants to stop sending some information to the front end, the front end needs to be able to handle this information being missing before the release.
This discrepancy between the front end and back end generally only exists for one release. Customers rarely use Gearset for so long that their browser session isn’t refreshed after two releases. Having said that, the faster we get our release cadence, the more likely it is!
If a rollback happens, you get the opposite situation. Customers will have a new version of the Gearset front end communicating with the old version of the back end. Because of all of this, you can find yourself in a situation where one version of the back end is simultaneously handling requests from three different versions of the front end.
To ensure that users won't be affected during a release or after a rollback, we follow these rules:
- Release new endpoints first without using them yet. This way, if you need to roll back, requests made to the new endpoints won’t fail. Often, we'll write the front end anyway and then hide it from users using feature flagging. This makes testing the changes in the staging environment much easier.
- When deleting endpoints, make sure that the front end doesn’t rely on their existence for at least two releases. Doing this makes sure that customers never end up hitting the endpoints with an old version of the front end.
- If you are updating an existing endpoint, allow the old form of the request for at least one release, since the old version of the front end won't be sending the new version. Also, ensure the old endpoint can parse the format of the new endpoint to handle rollbacks gracefully. This can be done by allowing empty values in new request objects.
Since everyone on the engineering team takes part in releases, everyone thinks about how their changes will be released. And because of that, these rules become second nature.
What about database changes?
It's unlikely that your database is deployed with blue-green deployments, because having two copies of your database introduces a lot of complexity when making sure that the data stays consistent between them. As a result, database migrations need to be thought about carefully. At Gearset, we always run pending database migrations in advance of the release. This means the new database version will be used across three different versions of the product at some point: in the previous version, the current version and the next version. Because of this, you have to structure your database changes carefully.
If you are only adding a new table to the database, then you can add it in one migration and immediately start using it. However, if you're altering an existing table, some extra thought will need to go into the release strategy.
Adding a new column to a table
Suppose you want to add a new required column to a table in the database. You can’t add it in immediately, because any rows inserted into the table before releasing the new version will fail due to a missing required field. Adding the column over the course of three releases allows this to be done safely.
In the first release, you add the new column but allow it to be null (or set a sensible default). You can start writing to this column in the same release that the change is introduced, but you cannot yet guarantee that there is always a value for that column, because there may be a gap between adding the column and releasing the new version of the code, and any rows written in that gap won’t have a value for the new column. Also, in case of a rollback, any data written after the rollback will also result in null values.
After two releases, both the active and inactive colour will be writing to this new field. So now you can use a third release to run an update migration that will fill in the remaining null fields, and you can start reading from the new column. Since we release two to three times per day, this is't a very long wait.
Almost all database changes can be done using this pattern, and the few that can't will need special consideration to handle the fact that three versions of Gearset need to be able to read and write to any one database version.
An example of a release strategy for a new feature
Suppose a Gearset customer wants to include some free-form notes to go along with their Gearset deployment, in order to keep track of what they’ve changed. In order to add this, we need to be able to collect this information from the user and store it in the database for later retrieval.
We already have a
deployments table in the database, and endpoints to both create and read these deployments. However, storing this free-form text is not currently implemented.
In the first release, add a
notes column to the
deployments table which is optionally null (this is a case where we can never make the column not nullable because we have old deployments that don't have notes). We can also add support for receiving a
notes field in the 'create deployment' request, which at the moment we allow not to exist at all. Finally, we return the deployment notes if they exist when getting a deployment.
This can be released, and is entirely compatible with the current state of the front-end because we didn't change it.
Suppose there was a bug either in the writing of the deployment notes, or in something unrelated included in the same release, which made us roll back. There are no issues with the deployment notes writing because the column is allowed to be null, and the client doesn’t expect deployment notes to be included in the 'get deployment' request.
For the front-end client changes, when a customer creates their Gearset deployment, we include deployment notes in the 'create deployment' request that gets sent to the back-end server, and also display deployment notes if we receive them from the back-end.
These changes can then be released, as they are entirely compatible with the back-end changes - provided they have already been released.
Again, if there was an issue with these changes that forced us to roll back, then customers with the new front end would enjoy deployment notes until they refreshed their client, and customers with the old front end would never have seen them, making instant rollbacks seamless to the customer.
Instant rollbacks and downtime-free deployments are both easy with blue-green deployments. Getting them to work effectively requires some discipline, and how to release your change must always be on your mind. However, the benefits for you and your customers are massive. Blue-green deployments make your releases less scary, they allow you to innovate and deliver value to your customers quickly, and they provide a safety net for when things go wrong.