Language not available Language not available Language not available

This post is not available in your language. Here are some other options:

Live Video Transmuxing/Transcoding: FFmpeg vs TwitchTranscoder, Part I

Oct 10 2017 - By Yueshi Shen

By: Jeff Gong, Software Engineer, jeffgon@twitch.tv Sahil Dhanju, Software Engineer Intern Chih-Chiang Lu, Senior Software Engineer, chihchil@twitch.tv Yueshi Shen, Principal Research Engineer, yshen@twitch.tv

Special thanks go to Christopher Kennedy, Staff Video Engineer at Crunchyroll/Ellation John Nichols, Principal Software Engineer at Xilinx, jnichol@xilinx.com for their information on FFmpeg and reviewing this article.

Note: This is the first part of a 2-part series. Please read part 2 after finishing this article.

Background

Twitch is the world’s leading live streaming platform for video games, esports, and other emerging creative content. Every month, more than 2.2 million unique content creators stream or upload video on our website. At its peak, Twitch ingests tens of thousands of concurrent live video streams and delivers them to viewers across the world.

Figure 1 depicts the architecture of our live video Content Delivery Network (CDN) which delivers tens of thousands of concurrent live streams internationally.

Twitch, like many other live streaming services, receives live stream uploads in Real-Time Messaging Protocol (RTMP) from its broadcasters. RTMP is a protocol designed to stream video and audio on the Internet, and is mainly used for point to point communication. To then scale our live stream content to countless viewers, Twitch uses HTTP Live Streaming (HLS), an HTTP-based media streaming communications protocol that most video websites also use.

Within the live stream processing pipeline, the transcoder module is in charge of converting an incoming RTMP stream into the HLS format with multiple variants (e.g., 1080p, 720p, etc.). These variants have different bitrates so that viewers with different levels of download bandwidth are able to consume live video streams at the best possible quality for their connection.

Figure 2 depicts the input and output of the transcoder module in our live video CDN.

In this article, we will discuss

Using FFmpeg Directly

FFmpeg (https://www.ffmpeg.org/) is a popular open-source software project, designed to record, process and stream video and audio. It is widely deployed by cloud encoding services for file transcoding and can also be used for live stream transmuxing and transcoding.

Suppose we are receiving the most widely used video compression standard of H.264 in RTMP at 6mbps and 1080p60 (resolution of 1920 by 1080 with a frame rate of 60 frames per second). We want to generate 4 HLS variants of:

One solution is to run 4 independent instances of FFmpeg, each of them processing one variant. Here we set all of their Instantaneous Decoding Refresh (IDR) intervals to 2 seconds and turn off the scene change detection, thus allowing the output HLS segments of all variants are perfectly time aligned, which is required by the HLS standard.

FFmpeg 1-in-1-out Sample Commands:

_ffmpeg -i -c:v libx264 -x264opts keyint=:no-scenecut -s -r -b:v -profile:v -c:a aac -sws_flags -hls_list_size .m3u8_

Notes:

Since H.264 is a lossy compression standard, transcoding will inevitably trigger video quality degradation. Moreover, encoding is a very computationally expensive process, particularly for high resolution and high frame rate video. Due to these two constraints, we would ideally like to transmux rather than transcode the highest variant from the source RTMP to save the computational power and preserve the video quality.

In the above example, if we want to transmux an input 1080p60 RTMP source to HLS, we can actually use the above commands without specifying a size or target FPS and specifying copy for the codecs (to avoid decoding and re-encoding the source):

_ffmpeg -i -c:v copy -c:a copy -hls_list_size .m3u8_

Transmuxing the source bitstream is an effective technique, but might cause output HLS to lose its spec compliancy, causing it to be unplayable on certain devices. We will explain the nature of the problem and its ramifications in the next section.

On the other hand, FFmpeg does have the functionality of taking in 1 input and producing N outputs, which we demonstrate with the following FFmpeg command.

FFmpeg 1-in-N-out Sample Commands (use Main Profile, x264 veryfast preset, and the bilinear scaling algorithm):

_ffmpeg -i _

_-c:v libx264 -x264opts keyint=120:no-scenecut -s 1920x1080 -r 60 -b:v -profile:v main -preset veryfast -c:a libfdk_aac -sws_flags bilinear -hls_list_size .m3u8 \_

_-c:v libx264 -x264opts keyint=120:no-scenecut -s 1280x720 -r 60 -b:v -profile:v main -preset veryfast -c:a libfdk_aac -sws_flags bilinear -hls_list_size .m3u8 \_

_-c:v libx264 -x264opts keyint=60:no-scenecut -s 1280x720 -r 30 -b:v -profile:v main -preset veryfast -c:a libfdk_aac -sws_flags bilinear -hls_list_size .m3u8 \_

_-c:v libx264 -x264opts keyint=60:no-scenecut -s 852x480 -r 30 -b:v -profile:v main -preset veryfast -c:a libfdk_aac -sws_flags bilinear -hls_list_size .m3u8_

If we wanted to transmux instead of transcode the highest variant while transcoding the rest of the variants, we can replace the the first output configuration with the previously specified copy codec:

_-c:v copy -c:a copy -hls_list_size .m3u8_

Notes:

A Few Technical Issues

The previous section demonstrated how FFmpeg can be used to generate HLS for live streams. While useful, a few technical issues make FFmpeg a less-than-ideal solution.

Transmux and Transcode

In HLS, a variant is comprised of a series of segments, each starting with an IDR frame. The HLS spec requires that IDR frames of a variants’ corresponding segments be aligned, such that they will have the same Presentation Timestamp (PTS). Only in this way can an HLS Adaptive Bitrate (ABR) player seamlessly switch among the variants when the viewer’s network condition changes (see Figure 3).

If we transcode both the source and the rest of the variants, we will get perfectly time-aligned HLS segments since we force the FFmpeg to encode IDRs precisely every 2 seconds. However, we don’t have control over the IDR intervals in the source RTMP bitstream, which is completely determined by the broadcast software’s configuration. If we transmux the source, the segments of the transmuxed and transcoded variants are not guaranteed to align (see Figure 4). This misalignment can cause playback issues. For example, we have noticed that Chromecast constantly exhibits playback pauses when it receives HLS streams with misaligned segments.

For the source RTMP stream with variable IDR intervals, we ideally want the output HLS to look aligned like in Figure 5:

However, in both the 1-in-1-out and the 1-in-N__-out FFmpeg instances, the N encoders associated with the N output variants are independent. As explained above, the result IDRs will not be aligned (see Figure 4) unless all variants are transcoded (i.e. the highest variant is also transcoded not transmuxed from the source).

Software Performance

As discussed in Figure 2, our RTMP-HLS transcoder takes in an input of 1 stream and produces an output of __N__ streams (N = the number of HLS variants. E.g., N = 4 in Figure 5). The simplest way to achieve this output is to create N independent 1-in-1-out transcoders, with each generating 1 output stream. The FFmpeg solution described above utilizes this model and has N FFmpeg instances.

There are 3 components within a 1-in-1-out transcoder, namely decoder, scaler, and encoder (see Figure 6). Therefore, for __N__ FFmpeg instances, we will have N decoders, N scalers, and N encoders altogether.

Since the N decoders are identical, the transcoder should ideally eliminate the redundant N-1 decoders and feed decoded images from the only decoder to the N downstream scalers and encoders (see Figure 7).

We talked about the decoder redundancy above. Now, let’s take another look at the four-variant example.

The two transcoded variants 720p60 and 720p30 in this example can share one scaler. As gathered from experiments, scaling is a very computationally expensive step in the transcoding pipeline. Avoiding the unnecessary repeated scaling process altogether can significantly optimize the performance of our transcoder. Figure 8 depicts a threading model of combining the scalers of 720p60 and 720p30 variants.

Besides the decoder and scaler sharing, a more important feature is multithreading. Since both the encoder and scaler are very computationally expensive processes, it is critical to utilize the modern multi-core CPU architecture and let the multiple variants be processed simultaneously. From our experiment, we find multithreading very useful for achieving higher density and is critical for certain applications like 4K.

Custom Features

FFmpeg is a versatile video processing software supporting various video/audio formats for the standard ABR transcoding workflow. However, it cannot handle a number of technical requirements that are specific to Twitch’s operation. For example,

1) Frame rate downsampler

Our typical incoming bitstreams have 60fps (frames per second), and we transcode them to 30fps for lower bit-rate variants (e.g., 720p30, 480p30, etc.). On the other hand, since Twitch is a global platform, we often receive incoming streams of 50fps, most of which are from PAL countries. In this case, the lower bit-rate variant should be downsampled to 25fps, instead of 30fps.

Simply deleting every second frame is not a good solution here. Our downsampler needs to behave differently for the two different kinds of incoming bitstreams. One kind has constant frame rates less than 60fps, and the other has irregular frame dropping, which makes its average frame rates less than 60fps.

2) Metadata insertion

Certain information needs to be inserted into the HLS bitstreams to enhance the user experience on the player side. By building our own transcoder and player, Twitch can control the full end-to-end ingest-transcode-CDN-playback pipeline. This allows us to insert proprietary metadata structures into the transcoder output, which are eventually parsed by our player and utilized to produce effects unique to Twitch.

In other news