By: Akrum Elkhazin, Video Algorithm Architect, NGCodec, firstname.lastname@example.org Avinash Ramachandran, Video Software Architect, NGCodec, email@example.com Roshan Baliga, Product Manager, Google, firstname.lastname@example.org Jai Krishnan, Product Manager, Google, email@example.com Tarek Amara, Senior Video Specialist, Twitch, firstname.lastname@example.org Alex Converse, Senior Software Engineer, Twitch, email@example.com Yueshi Shen, Principal Research Engineer, Twitch, firstname.lastname@example.org
Video compression is the key to successful delivery of digital video across various applications like broadcast, teleconference, surveillance, and online streaming services. Since 2003 (i.e., 15 years ago), H.264 has been the state-of-the-art video compression format and has enabled HDTV, Blu-ray Disc, Internet video websites (e.g., YouTube, Twitch), and so on. Nevertheless, according to Twitch’s recent analysis, H.264 has reached its compression performance limit, particularly for real-time encoding of gaming content at the HD resolution (1080p60). On the other hand, newer-generation video standards, namely VP9, HEVC, and AV1, show significant compression gain, which can bring considerable commercial benefits to content platforms (e.g., offering viewers better video quality, reducing the video loading time and the buffering rate, increasing the customer reach, decreasing the IP transit cost).
Currently, although decoding and playback of VP9 video are widely supported on devices and browsers used by Twitch’s audience, encoding gaming video content with a high efficiency and real-time performance is a substantial challenge due to the high complexity of VP9. Through a rigorous feasibility study, we have eventually selected FPGA as the hardware platform for real-time VP9 encoding and are deploying it to broadcast our premium eSports and partner channels in the near future (please watch the presentation given by Twitch’s Principal Research Engineer, Dr. Yueshi Shen and Xilinx’s CEO, Victor Peng, during the keynote talk of Xilinx Developer Forum 2018).
In this article, we will show that the FPGA-based real-time VP9 encoding can deliver at least 25% bitrate savings compared to the highest-quality H.264 encoders deployed in Twitch’s production today. Also, we will deep dive into VP9’s compression tools to explain how these features are used in the encoder implementation to realize the compression performance improvement promised by the simulations during the standardization process.
In the following study, we are not running an academic evaluation of the VP9 compression standard, but comparing a practical 1080p60 real-time VP9 encoder implementation against a commercial state-of-the-art real-time H.264 encoder.
For the anchor, we picked the open-source x264 encoder running on Intel Xeon E5–2697 V4 (18 cores, 2.3GHz, 145W) CPU. The configuration of x264 is:
constant bitrate (CBR) mode,
GOPs of 2 seconds,
lookahead of 1 second, and
VBV buffer of 1 second.
Based on our knowledge on encoder’s performance and Twitch’s content/operational point (i.e., resolution, frame rate, bitrate), we find the above x264 configuration can deliver a comparable or even higher video quality, compared to most professional broadcast quality video encoders in the market. Furthermore, running a higher quality x264 preset (like slow, veryslow) can neither hit the 25% bitrate saving goal nor achieve the real-time encoding (see our experiment result below).
In our comparison, we used the following five 1080p60 gaming clips:
The uncompressed raw video can be downloaded from the official webpage of the Alliance for Open Media (AOM)’s test sequences (search “Twitch”). These five titles cover a wide set of content characteristics (e.g., fast motion, high texture, saturated colors, and large contrast) and are very challenging for video encoding, in fact, much harder than camera-filmed video. Below is a screenshot from the game Witcher 3, and you can find the picture has high contrast, sharp edges, and a lot of details.
With a subjective viewing test at Twitch’s operational point, we concluded that VP9 can deliver the same or better video quality than H.264 at 6Mbps while only using 4.5Mbps. Below is a screenshot of H.264 6Mbps (left) vs. VP9 4.5Mbps (right). The VP9 has a cleaner road surface and has less mosquito noise around the edge of the text, but the H.264 has slightly more texture on the tree and the grass.
You can download the demo video of 1080p60 H.264 6Mbps vs. VP9 4.5Mbps here.
For the objective analysis, we use both PSNR and VMAF and the video bitrate are swept from 2Mbps to 12Mbps. The average PSNR and VMAF values over the 5 titles are provided in the table below. Both metrics show that in the 4Mbps to 6Mbps range, VP9 provides a bitrate savings of approximately 25% over H.264. Additionally, VP9 maintains significant compression efficiency advantage over H.264 for all bitrates.
Here are the PSNR and VMAF curves for x264 and NGCodec VP9.
Using VMAF, at a score of 80 or higher, Twitch’s video quality is comparable to broadcast quality video. Some analysis suggests that in general, a score of 90 or higher indicates a high video quality. However, due to the high texture nature of gaming content (as explained in the previous section of “Test Content”), the score of 80 indeed reflects a very good quality for Twitch’s titles.
The compression gain claimed in the previous section stems from a number of new or improved coding tools defined in the VP9 standard. In this section, we will demonstrate the effect of these new tools that have been implemented in our FPGA VP9 encoder through sophisticated video compression algorithms.
VP9 divides a picture into 64x64 regions known as superblocks that can be further subdivided in a quadtree structure into smaller regions down to 4x4 for prediction. Rectangular partitions such as 32x16 or 4x8 are also supported. Larger prediction block sizes are particularly useful for saving inter-frame bits on predictable content (e.g., flat area). By minimizing signaling overhead, larger block sizes also provide good compression efficiency for higher resolution content.
As can be seen below for a sample frame in the EurotruckSimulator clip, the VP9 encoder uses larger block sizes for relatively flat areas like the sky, roads and pavements and smaller block sizes in highly textured areas to preserve fine details. While H.264 uses 16x16 macroblock for the entire picture, which wastes bits in flat areas and sacrifices video quality in highly textured areas.
Along with larger prediction block sizes, VP9 also supports transform sizes up to the prediction block size or 32x32, whichever is less. Therefore, the transform size can be 4x4, 8x8, 16x16 and 32x32. In comparison, H.264 only supports 4x4 transform throughout (8x8 transform in High Profile only), and a special two-step transform for 16x16 intra prediction. The figures below show the transform sizes using a blue grid from the EurotruckSimulator clip. Larger transform sizes reaching 32x32 lead to better preservation of detail in the smooth areas such as the sky and regular textures such as the road. On the other hand, smaller transform sizes are better able to capture the fine details of the road, lampposts, and building in the distance, as well as the changing reflections on the side of the truck.
In order to fully take advantage of larger prediction and transform sizes, the VP9 encoder uses dedicated hardware FPGA acceleration to compute the most optimal RDO mode decision through evaluating the options of all intra modes and a comprehensive list of inter mode candidates, as well as all prediction and transform block sizes.
The exhaustive RDO evaluation ensures that the best prediction and transform modes are selected, which directly accounts for the video compression efficiency gain described in the previous section of “Subjective and Object Test Results”. For example, in the figure below, the VP9 output is consistently sharper on the road surface because of the large prediction and transform sizes, and has finer details on the side of the truck because of a mix of large and small transform sizes. All of these mode decisions are optimally chosen through the exhaustive RDO mode search, which is only made possible by FPGA.
Besides finding the most efficient way to predict pixels, another key aspect of improving encoder compression is to plan and execute the bit expenditure in a smart way, e.g., to give more bits to areas that human eyes are more sensitive to, to avoid video quality fluctuation that can be annoying to viewers. In this section, we show two powerful implementations in the NGCodec VP9 encoder that budgets and controls the bit allocation among blocks within a frame and across multiple frames in a video sequence.
The NGCodec VP9 encoder has advanced spatial and temporal AQ algorithms coupled with a dual-pass process for scene content and lighting condition analysis. These technologies help to calculate and optimally allocate the bits at the block level, based on scene complexity. In the example illustrated below, flat areas like the sky or the side of the truck do not cost as many bits as complex areas like the road, wall, or building. However, these areas can exhibit blocking artifacts that human eyes are more sensitive to, particularly when they are not allocated with sufficient bits.
In VP9, quant offsets are mapped to one of eight segments which allow a balanced spatial quality throughout the picture. The AQ mapping is depicted in the above figure, where the segment values are mapped to a luma heat map. As explained above, flatter areas like the sky and the truck (shown in a lighter shade) are given more bits with a negative quant offset, avoiding visual artifact. On the other hand, higher textured areas like the logos and the side of the building (shown in a darker shade) are given positive quant offsets.
Aside from the block-level adaptive quantization within a frame, bit allocation across multiple frames are even more critical for achieving decent compression performance. The goal of a rate control algorithm is to avoid violating the Video Buffer Verifier (VBV) model and to maximize the overall quality of a video sequence by allocating the proper amount of bits to different video frames (e.g., reference/non-reference frames, frames at scene changes). In other words, VBV is the mathematical model that defines how the bitrate of a video sequence can be regarded as constant.
The NGCodec VP9 encoder’s robust rate control system uses Machine-Learning techniques to realize more consistent video quality than x264, especially after scene changes, the most challenging situation for rate control. The algorithm has been validated on a wide range of content and encoding parameters and is particularly optimized for gaming content.
The two figures above show the NGCodec VP9 encoder’s and x264’s average frame quantization parameter (QP) and their corresponding VBV buffer level, during a scene change situation in the video sequence of Witchers 3. Note that the VP9 quantizer levels are mapped to the equivalent H.264 QP values in the first diagram above. We can see that around frame 35, the x264’s rate control unnecessarily panics which pushes its QP 51 leading to very poor visual quality (see the left side of the figure below). On the other hand, the NGCodec VP9 encoder’s rate control chooses to keep the QP stable and avoid the VBV buffer level hitting 0% (i.e., buffer underflow).