By: Jeff Gong, Software Engineer, firstname.lastname@example.org Sahil Dhanju, Software Engineer Intern Chih-Chiang Lu, Senior Software Engineer, email@example.com Yueshi Shen, Principal Research Engineer, firstname.lastname@example.org
Special thanks go to Christopher Kennedy, Staff Video Engineer at Crunchyroll/Ellation John Nichols, Principal Software Engineer at Xilinx, email@example.com for their information on FFmpeg and reviewing this article.
Note: This is the first part of a 2-part series. Please read part 2 after finishing this article.
Twitch is the world’s leading live streaming platform for video games, esports, and other emerging creative content. Every month, more than 2.2 million unique content creators stream or upload video on our website. At its peak, Twitch ingests tens of thousands of concurrent live video streams and delivers them to viewers across the world.
Figure 1 depicts the architecture of our live video Content Delivery Network (CDN) which delivers tens of thousands of concurrent live streams internationally.
Twitch, like many other live streaming services, receives live stream uploads in Real-Time Messaging Protocol (RTMP) from its broadcasters. RTMP is a protocol designed to stream video and audio on the Internet, and is mainly used for point to point communication. To then scale our live stream content to countless viewers, Twitch uses HTTP Live Streaming (HLS), an HTTP-based media streaming communications protocol that most video websites also use.
Within the live stream processing pipeline, the transcoder module is in charge of converting an incoming RTMP stream into the HLS format with multiple variants (e.g., 1080p, 720p, etc.). These variants have different bitrates so that viewers with different levels of download bandwidth are able to consume live video streams at the best possible quality for their connection.
Figure 2 depicts the input and output of the transcoder module in our live video CDN.
In this article, we will discuss
how FFmpeg satisfies most of the live transcoding requirements,
what features FFmpeg doesn’t provide, and
why Twitch builds its own in-house transcoder software stack.
Using FFmpeg Directly
FFmpeg (https://www.ffmpeg.org/) is a popular open-source software project, designed to record, process and stream video and audio. It is widely deployed by cloud encoding services for file transcoding and can also be used for live stream transmuxing and transcoding.
Suppose we are receiving the most widely used video compression standard of H.264 in RTMP at 6mbps and 1080p60 (resolution of 1920 by 1080 with a frame rate of 60 frames per second). We want to generate 4 HLS variants of:
One solution is to run 4 independent instances of FFmpeg, each of them processing one variant. Here we set all of their Instantaneous Decoding Refresh (IDR) intervals to 2 seconds and turn off the scene change detection, thus allowing the output HLS segments of all variants are perfectly time aligned, which is required by the HLS standard.
FFmpeg 1-in-1-out Sample Commands:
ffmpeg -i -c:v libx264 -x264opts keyint=
:no-scenecut -s -r -b:v -profile:v -c:a aac -sws_flags -hls_list_size
All parameters surrounded by ‘<>’ require inputs from the user. Some of these are described in greater detail below
c:v specifies the video codec to use, in our case, libx264
x264opts specifies libx264 specific options. Here, IDR interval should be 2 * your desired FPS, so 720p60 would yield an IDR interval of 120 while 720p30 would require an IDR interval of 60. No-scenecut is used to disable scene change detection
s specifies video size which can be in the form of “width x height” or the name of a size abbreviation
r specifies the FPS
b:v specifies a target video bitrate, which is most useful for streaming when there are bandwidth targets and requirements; there is a companion b:a for audio
profile refers to H.264 profile
sws_flags determines what scaling algorithm should be used
hls_list_size is used to determine the maximum number of segments in the playlist (e.g., we can use 6 for live streaming or set it equal to 0 to have a playlist of all the segments). The segment duration (the optional hls_time flag) will be same as the IDR interval, in our case is 2 seconds.
Since H.264 is a lossy compression standard, transcoding will inevitably trigger video quality degradation. Moreover, encoding is a very computationally expensive process, particularly for high resolution and high frame rate video. Due to these two constraints, we would ideally like to transmux rather than transcode the highest variant from the source RTMP to save the computational power and preserve the video quality.
In the above example, if we want to transmux an input 1080p60 RTMP source to HLS, we can actually use the above commands without specifying a size or target FPS and specifying copy for the codecs (to avoid decoding and re-encoding the source):
ffmpeg -i -c:v copy -c:a copy -hls_list_size
Transmuxing the source bitstream is an effective technique, but might cause output HLS to lose its spec compliancy, causing it to be unplayable on certain devices. We will explain the nature of the problem and its ramifications in the next section.
On the other hand, FFmpeg does have the functionality of taking in 1 input and producing N outputs, which we demonstrate with the following FFmpeg command.
FFmpeg 1-in-N-out Sample Commands (use Main Profile, x264 veryfast preset, and the bilinear scaling algorithm):
ffmpeg -i \ -c:v libx264 -x264opts keyint=120:no-scenecut -s 1920x1080 -r 60 -b:v
-profile:v main -preset veryfast -c:a libfdk_aac -sws_flags bilinear -hls_list_size
If we wanted to transmux instead of transcode the highest variant while transcoding the rest of the variants, we can replace the the first output configuration with the previously specified copy codec:
-c:v copy -c:a copy -hls_list_size
From the above command, we are transcoding multiple variants out of a single input file. Each “\” denotes a new line, in which we can specify a different combination of flags as well as a unique output name. Each command is independent of the other and can use any other combination of flags.
The primary differences for each command here can be seen in the s and r flags which are explained earlier in the article.
An alternative to running the following transcodes in a single FFmpeg instance is to run multiple instances, one for each desired output in parallel. The 1-in-N-out FFmpeg is a computationally cheaper process, whose reason we will explain next.
A Few Technical Issues
The previous section demonstrated how FFmpeg can be used to generate HLS for live streams. While useful, a few technical issues make FFmpeg a less-than-ideal solution.
Transmux and Transcode
In HLS, a variant is comprised of a series of segments, each starting with an IDR frame. The HLS spec requires that IDR frames of a variants’ corresponding segments be aligned, such that they will have the same Presentation Timestamp (PTS). Only in this way can an HLS Adaptive Bitrate (ABR) player seamlessly switch among the variants when the viewer’s network condition changes (see Figure 3).
If we transcode both the source and the rest of the variants, we will get perfectly time-aligned HLS segments since we force the FFmpeg to encode IDRs precisely every 2 seconds. However, we don’t have control over the IDR intervals in the source RTMP bitstream, which is completely determined by the broadcast software’s configuration. If we transmux the source, the segments of the transmuxed and transcoded variants are not guaranteed to align (see Figure 4). This misalignment can cause playback issues. For example, we have noticed that Chromecast constantly exhibits playback pauses when it receives HLS streams with misaligned segments.
For the source RTMP stream with variable IDR intervals, we ideally want the output HLS to look aligned like in Figure 5:
However, in both the 1-in-1-out and the 1-in-N__-out FFmpeg instances, the N encoders associated with the N output variants are independent. As explained above, the result IDRs will not be aligned (see Figure 4) unless all variants are transcoded (i.e. the highest variant is also transcoded not transmuxed from the source).
As discussed in Figure 2, our RTMP-HLS transcoder takes in an input of 1 stream and produces an output of __N__ streams (N = the number of HLS variants. E.g., N = 4 in Figure 5). The simplest way to achieve this output is to create N independent 1-in-1-out transcoders, with each generating 1 output stream. The FFmpeg solution described above utilizes this model and has N FFmpeg instances.
There are 3 components within a 1-in-1-out transcoder, namely decoder, scaler, and encoder (see Figure 6). Therefore, for __N__ FFmpeg instances, we will have N decoders, N scalers, and N encoders altogether.
Since the N decoders are identical, the transcoder should ideally eliminate the redundant N-1 decoders and feed decoded images from the only decoder to the N downstream scalers and encoders (see Figure 7).
We talked about the decoder redundancy above. Now, let’s take another look at the four-variant example.
720p30 HLS/H.264, and
The two transcoded variants 720p60 and 720p30 in this example can share one scaler. As gathered from experiments, scaling is a very computationally expensive step in the transcoding pipeline. Avoiding the unnecessary repeated scaling process altogether can significantly optimize the performance of our transcoder. Figure 8 depicts a threading model of combining the scalers of 720p60 and 720p30 variants.
Besides the decoder and scaler sharing, a more important feature is multithreading. Since both the encoder and scaler are very computationally expensive processes, it is critical to utilize the modern multi-core CPU architecture and let the multiple variants be processed simultaneously. From our experiment, we find multithreading very useful for achieving higher density and is critical for certain applications like 4K.
FFmpeg is a versatile video processing software supporting various video/audio formats for the standard ABR transcoding workflow. However, it cannot handle a number of technical requirements that are specific to Twitch’s operation. For example,
1) Frame rate downsampler
Our typical incoming bitstreams have 60fps (frames per second), and we transcode them to 30fps for lower bit-rate variants (e.g., 720p30, 480p30, etc.). On the other hand, since Twitch is a global platform, we often receive incoming streams of 50fps, most of which are from PAL countries. In this case, the lower bit-rate variant should be downsampled to 25fps, instead of 30fps.
Simply deleting every second frame is not a good solution here. Our downsampler needs to behave differently for the two different kinds of incoming bitstreams. One kind has constant frame rates less than 60fps, and the other has irregular frame dropping, which makes its average frame rates less than 60fps.
2) Metadata insertion
Certain information needs to be inserted into the HLS bitstreams to enhance the user experience on the player side. By building our own transcoder and player, Twitch can control the full end-to-end ingest-transcode-CDN-playback pipeline. This allows us to insert proprietary metadata structures into the transcoder output, which are eventually parsed by our player and utilized to produce effects unique to Twitch.