Twitch Data Analysis – Part 2 of 3: The Architectural Decision Making ProcessApr 16 2014 · 2 comments · Engineering, Tech
This is part two in a three part series on our statistics pipeline. The first part described our pipeline. In this part, we’ll go into detail concerning particular design decisions. The third part will cover the history of analytics at Twitch. Head over to our team page to get to know more about what we do.
When bringing together our Data Science team we established a clear goal:
Make our co-workers want to give us their data.
A desirable “3rd party system” (i.e. non-mission critical and not built by the core engineering team) must be rock solid and fail predictably. Those that violate these criteria tend not to be used for long. With this in mind, we created our “trifecta rubric”:
- Ease of integration
Each of our key architectural decisions are motivated by at least two of these principles.
As detailed in the first post, our clients send base64 encoded json data using HTTP GETs. At the core of this decision was ease of integration; by choosing to stay Mixpanel compatible we were able to use our existing Mixpanel client libraries. This permitted our engineering team to continue to use the clients they trust and removed the need for us to build and maintain a new set of clients in various languages. The use of GETs is driven by the same-origin policy that browsers enforce. We’ve a far wider range of clients than browsers and as such it is on our short term roadmap to support POSTs. As CORS becomes more widely supported, we’ll embrace POSTs there also.
Storage & Querying
Simplicity and scale form the basis of our choice for RedShift. We needed a system that could store many terabytes of data while permitting fast and simple fetching of that data. Any time you talk about huge volumes of data, Hadoop and Cassandra are natural contenders. RedShift was the standout option for us on account of it being PostgreSQL based and, as such, having great client support along with the ability to support arbitrary joins. When coupled with the COPY command for bulk inserts (we get 100k inserts a second, and haven’t begun optimizing yet) RedShift is a tectonic level play by Amazon; our buddies over at Pinterest and Airbnb agree and both have great write-ups on how they use it.
Core Pipeline Implementation
The decision for how to transform data from our clients into rows in RedShift was influenced by all three elements of our rubric. We knew we would need something fast with good concurrency primitives, and support for the libraries that we’d want to use. We ended up with Clojure and Golang as our main candidates. Java and C ruled themselves out for not having simple concurrency primitives. Clojure, on the other hand, has great concurrency primitives and can interop with Java very elegantly.
In the end, Golang won out over Clojure due to institutional knowledge; we’ve a number of engineers at Twitch that are deeply into Golang. It is not unusual to stumble into a lunchtime conversation covering the pros and cons of the new precise garbage collector and the various patches to Golang we need due to page size allocation changes introduced as part of Go 1.2. We have reached this point through being an early adopter of Golang and running it under some incredibly high loads, oftentimes discovering aspects of the runtime that were interesting to the Golang core development team. Golang’s approach to concurrency is beautiful and the decision to build statically linked binaries makes deployment very easy. As a result almost all green field projects at Twitch are in Golang. I can honestly say, if you want to work with Go there is nowhere better than Twitch.
Given we’re hosted on Amazon we could also have opted to use Data Pipeline or Kinesis. Kinesis didn’t exist when we first started work on this project, though with it being so highly Java orientated we may have still opted not to use it. One of the things we’re super excited about is open sourcing all of our work, as we firmly believe that making good product decisions shouldn’t be something that is bound by available technology. We’re currently sweeping through code replacing AWS credentials with IAM roles from the UserData API (we recently patched goamz to fix role timeout issues); when we’re done we’ll release everything! Expect more on this in a couple of weeks.
Personally one of the most significant technology decisions I think we’ve made is to host our pipeline on AWS. We have historically bought our own hardware and co-located it at various datacenters. This works wonders for managing the cost of a world wide video CDN; our Ops and Video engineers are constantly finding ways to squeeze more out of our current set of machines, while ensuring our next generation of machines can achieve the best QoS for our users. When it came to our data pipeline, we really had no idea what we would be bound by; we initially opted for two beefy boxes in our SFO datacenter. The first weekend taking significant traffic we exhausted our available RAM and our pipeline ground to a halt. Faced with the option of finding significant memory improvements or provisioning more capacity, we chose to move to the cloud as it would remove the burden we’d put on our Ops team by iterating on our stack. When it came to cloud partners we had a couple of options: Digital Ocean, Rackspace, or AWS. Since we were already using S3 and RedShift, AWS made the most sense.
Our decisions to make things simple, scalable, and easy to integrate has permitted us to build a very efficient pipeline. If you share our passion for making smart decisions and being a member of a high leverage team, then let us know as we’re currently looking for our 5th team member.