Thanks for visiting Twitch’s Engineering Blog! This article is mainly written for folks with a basic understanding of Databases and Data Lakes. If you aren’t sure what those are, click here to learn more.

In this piece, you’ll learn how Twitch has used Views with our Data Lake as quick, two-way doors to minimize data downtime, empower developer workflows, quickly adapt to new challenges, and reduce costs. This is also an endorsement of AWS Glue Data Catalog Views which was released on August 8, 2024.

Overview

When I started building our Data Lake back in early 2017, I never imagined it would grow to be over 100 petabytes, over 8,000 datasets, and consumed by half of Twitch’s staff. At the time, I was just trying to keep our single 96 ds2.16xlarge Redshift cluster’s disk below the red line and keep its users from clogging the query queues. Maintaining a reliable service while also swapping out all of its internals is challenging, and it didn’t always go well. It was a rough time for my team, and we have hundreds of related memes to prove it.

The largest datasets in our cluster were clickstream data from our inhouse version of Mixpanel. I started building our Data Lake by offloading the older data as Parquet onto S3, and reattaching that data to the cluster as External Tables queryable by Redshift Spectrum. This initial implementation succeeded at offloading data, but it required users to know which interface they were querying and Redshift Spectrum queries took longer to execute.

Thankfully, Late Binding Views released quickly thereafter and I was able to leverage them for unioning the recent and trailing datasets into a consistent data interface. At this point, we felt we had a handle on horizontally scaling our storage, but we were running out of compute capacity and needed to shift out of a single cluster.

To accomplish this, I moved all data loads to our Data Lake and pointed each View to just the Data Lake Tables. I also built a service to keep each attached cluster in sync with all of the Tables and Views in the Data Lake. This yielded our current data architecture where each dataset is composed of at least one External Table fronted by a View. Today, this service supports hundreds of Redshift clusters (both provisioned and serverless) to meet the needs of current Twitch staff.

Through this effort, this service needed to support existing rename (altering a Table by giving a column a new name) and varchar resize (altering a Table by changing the size of a string column) operations on all datasets. These use cases aren’t supported well by classic Data Lake technologies, so you will read below how they were easily accomplished using Views.

Use Cases

A View provides logical data independence through a consistent, adaptable data interface and has enabled us to realize a number of unexpected benefits over the years in the areas of data agility, downtime, and reprocessing. The following sections elaborate on some specific situations.

Development workflows

A critical capability of our ETL system involves being able to stage changes to a daily partitioned dataset in parallel to the live version before promotion. This allows backfilling, soak time, and comparison testing. We do this by setting up versions of a dataset each with their own View, and the act of promotion is changing the live View to point to the staged External Table (much like an atomic swap). While it is possible to do this same workflow by modifying the partitions of the live External Table, it involves data downtime.

Column Renames

Imagine you constructed the perfect dataset only to realize you misspelled a column name, or after you built your dataset the company rebranded the product. In a classic Data Lake, changing the name of your column would require you to reprocess all of your data with the new column name. Even though none of the underlying data has changed, it is expensive to fetch, modify and store a completely new copy on S3. In many cases, companies accept the ongoing cognitive burden of using the original names instead of making the data interface more usable. We are able to handle this without restating data nor imposing data downtime by adding a simple column alias to the View.

SELECT "referer" as "referrer" ...

VARCHAR Resizes

The size of VARCHAR columns in data can have a significant impact on query results spilling to disk (link), but truncating data when we write it to S3 is a one-way door. Our Data Lake writes the full length string data in Parquet on S3 while limiting the length of it via a typecast in the Views. If we made a mistake and needed longer lengths, we can modify it with a simple View change, and all the historical data is immediately available at that increased length. As far as I know, this feature is unique to our Data Lake.

SELECT channel_name::varchar(25) ...

This setup also allowed us to work around a longstanding Redshift Spectrum bug where multi-byte characters spanning the configured length of a column would cause simple string functions (like LEN) to throw an error. In our Data Lake, all External Tables are defined at varchar(max) and the Views themselves perform truncation.

Rapid Response

Views have allowed us to respond rapidly to data quality issues that otherwise would be more difficult to quickly resolve like broken transformations in our pipelines or clients sending malformed, duplicated, or incorrect telemetry data.

To address these, I would either add predicate logic to the View query or add custom transformation logic on the columns. Once service was restored, I could schedule when to clean up the underlying data and eventually undo the View changes.

View changes are Two-way doors

Changing a View is easy, quick to rollout, and quick to rollback. They are ideal for risky changes that require a two-way door. In addition to the operations defined in the previous sections, we have used them to gradually roll out breaking performance changes to core datasets and are also exploring using them for implementing soft retention periods on datasets.

Rainbows and Unicorns

Building out and coordinating External Tables with late binding Views has come at the cost of replicating the View definitions across our fleet of Redshift clusters. View modifications competing for cluster resources is a constant source of synchronization delays resulting in inconsistent query results across clusters. We would like to prevent unapproved, accidental External Table usage; and, finally, adoption of third-party tools is slow because they often need a translation layer to work with our Views.

These limitations are why I’m super excited about transitioning our setup to utilize AWS Glue Data Catalog Views. Once completed, users will no longer have access to the underlying Table definitions, and replication is handled for us by AWS Resource Access Manager. We can also define a different View with reduced scope for each use case on top of a single Table, whereas today we have to duplicate the data to achieve the same result. Finally, the dialects on these Views will allow us to easily support Athena and, eventually, Spark.

Conclusion

In nearly all complex situations, Views have allowed us to proceed with speed and confidence. They provide a better, more durable interface to your data than the conventional Data Lake Tables. I hope I have convinced you that you should be standardizing on Views as your data interface, and that you utilize AWS Glue Data Catalog Views when you do so.

About Me

My name is Sean Lydon and I am a Principal Engineer in the Data Platform team. I have worked at Twitch since February 2017. I am passionate about building large data systems that enable my coworkers and the products they build. When I’m not wrangling my two little boys, I like to play Guild Wars 2, and my streamer handle is vacuuna.

Views pwn Tables as data interfaces

Overview

Use Cases

Rainbows and Unicorns

Conclusion

About Me

Introducing Shared Viewership: The Next Step in Making Collaborations Work Better on Twitch

The 2024 Game Awards are live on Twitch December 12th