Breaking the Monolith at Twitch: Part 2
The Story So Far:
This is Part Two, the second and final post chronicling Twitch’s journey from monolithic architecture to microservices. In Part One, we talked about Twitch’s early days, relying on a monolithic architecture to grow quickly and efficiently, but we also discovered areas where it became more of a bottleneck than a benefit. With the rise of Go, we began our first experiments with microservices and eventually decided to switch our entire tech stack to a microservice architecture. This post will cover the long transition.
My name is Mario Izquierdo, and I’ve been a Full Stack Engineer at Twitch since 2017. I’ve worked in a variety of systems in my time here, including the public API, infrastructure, and tools for services.
The Massive Migration: Wexit
Back in 2015, a new company-wide campaign to move all of Twitch’s backend code into Go microservices started. The codename was “Wexit.”
With the exception of a few microservices like Jax, the vast majority of the site was still working inside the Rails monolith. Developers would highlight various challenges when working with the existing codebase. For example:
- Using a single deploy pipeline for everyone
- Using spreadsheets to coordinate staging environments
- Running Puppet in one of the production boxes to test infra changes
- Waiting for increasingly slow builds
- Struggling to trace errors back to their owners
- Slow routing due to too many API endpoints
- Slow database queries
- Constantly dealing with merge conflicts
With a growing group of developers, the freedom offered by the Ruby language was no longer an advantage.
Multiple efforts were executed in-parallel to improve the situation. One of those efforts was to unravel the spaghetti of Rails controllers, in order to decouple the code inside the monolith.
This helped us identify potential services that could be extracted, but was ultimately too difficult to refactor while other teams were constantly making changes on the same codebase.
Perhaps the most important effort was the addition of an extra layer in front of the backend API: an NGINX reverse proxy, used to redirect a percentage of traffic to a new API edge. The new API edge was written in Go, using common middleware for validation, authorization, and rate limiting, but delegating all the business logic to the underlying microservices.
With the new proxy in place, the process to migrate endpoints was defined:
- Develop your new microservice.
- Replicate the old endpoint in the new edge.
- Update the NGINX config file to send a percentage of traffic to the new endpoint.
- Verify metrics and keep increasing traffic until 100%.
Teams were encouraged to build new functionality this way, effectively making the monolith a big deprecated piece of software. The campaign was huge, it became a difficult move that engaged all our backend engineers in one way or another from 2016 to 2018.
The tool chain changed drastically. It may be hard to envision, but developers had to write code in multiple repositories with different deploy cycles to deliver a single feature, all while learning how to provision resources in a new cloud service (AWS) using a new programming language (Go). Calling a module inside the monolith means calling a function or a method, but calling another service requires a lot more: error handling, monitoring, circuit breakers, rate limiting, request tracing, API versioning, s2s security, restricted authorization, reliable integration tests, and many more. When someone figured something out, their solution was often copy-pasted at an alarming rate, replicating issues faster than they could be fixed. There was a lot of duplication.
Despite all the chaos, useful frameworks emerged out of necessity. Some became popular inside the company, and some became popular open source projects, like Twirp.
When companies like Twitch grow so rapidly, the resulting organizational structure also needs to change fast. This adds an extra level of difficulty to the ongoing migrations.
Organizations tend to mirror their communication structure into their services (Conway’s law), so developers enjoy independence managing their own infrastructure, but if they only focus on their area of responsibility, who is looking out for blind spots?
An outage that happened in 2017 is a good example of this. The mobile team released a new version of the Twitch mobile client that caused the API servers to overload and crash. The root cause was a new retry behavior on the mobile client that was intended to make the app feel more stable, but ended up causing a request storm on the API. Both mobile and API teams did plenty of testing before the release, but no one had the responsibility to verify that the two parts worked well together.
As an attempt to address blind spots like this one, companies perform the so-called “reorgs”. When a reorg happens, the services that were perfectly aligned with their org structure are no longer well aligned, effectively creating tech debt. After a reorg, it is common to have engineers reconsidering ongoing migrations, splitting services into new services, merging other services, updating access privileges, refactoring code, and doing knowledge transfers to other teams. This problem is unavoidable, but it can be minimized by making sure that services are modeled after the end-user experience (Amazon Leadership Principles like Customer Obsession can provide guidance here).
The Long Tail
Every long migration suffers from the “pareto” rule, where the last 20% of the work takes 80% of the time. It was no different at Twitch. The endpoints that were harder to migrate lingered until the end. Other functionality remained happily unowned, working as expected from the business’ point of view, but falling behind on security and quality best practices. To address the long tail, some employees formed an internal working group with the specific goal of monitoring and migrating those last services.
Leftover endpoints were still getting traffic from unknown sources, maybe bots or scripts, and it was unclear what those endpoints were supposed to do. The only way to tell if it was safe to shut them down was to force 10 minutes of downtime and see if anyone would complain.
By the end of 2018, when all the traffic was moved away from the monolith, it was finally time to stop all the EC2 instances on the main AWS account. It would still take another year to fully clean up the account of dependencies and access roles (work that was completed through company-wide security audits and technology upgrades), but for all intents and purposes, the Wexit campaign was completed. Everyone in the working group got trophies, and there was a good party to celebrate!
At the time of writing, most teams at Twitch own multiple services. The language of choice for server code is Go (Twitch also uses TypeScript, Python, C++, Java/Kotlin and ObjectiveC/Swift on other platforms). Each service environment is isolated in a separate AWS account, a pattern that we call “Tiny Bubbles.”
Different product teams can operate independently, managing their own infrastructure and tweaking their services as needed. Strong foundational teams provide frameworks and specific tools to help standardize the engineering stack and operational practices. Of course, code migrations are still happening as new technology becomes available, but the distributed nature of microservices allows teams to contribute at scale.
**Make sure to read Part One of this series to get the full story!
Thanks for reading! This blog post was developed with the help of many individuals, most of whom were directly involved in the migration. Thank you: Angelo Paxinos, Daniel Lin, Jordan Potter, Jos Kraaijeveld, Marc Cerqueira, Matt Rudder, Mike Gaffney, Nicholas Ngorok, Raymon Lin, Rhys Hiltner, Ross Engers, Ryan Szotak, Shiming Ren, Tony Ghita.
Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site, LinkedIn, and Instagram, or check out our Job Openings and apply.