Inside Facebook’s VR Audio Initiative

Facebook engineers have started to reveal for the first time some of the technical details behind the platform’s suite of 360-degree audio tools.
Author:
Publish date:
Social count:
0

MENLO PARK, CA—Facebook engineers have started to reveal for the first time some of the technical details behind the platform’s suite of 360-degree audio tools. Key to them is “something that we call hybrid higher order ambisonics,” says software engineer Varun Nair, whose title is Facebook audio, VR.

Initially, the first peek behind the curtain regarding Facebook’s VR audio tools came in a blog post on its Code site, “Spatial audio—bringing realistic sound to 360 video,” which revealed some of the key features of the Spatial Workstation tool suite as well as the technical challenges encountered by the team in the search for a suitable audio codec to deliver a consistent experience across all viewing platforms. “New 360 audio encoding and rendering technology makes maintaining high-quality spatial audio throughout the pipeline—from editor to user—possible for large-scale consumption for the first time,” wrote Nair and fellow software engineer Hans Fugal.

Nair was co-founder of UK-based immersive and interactive audio technology company Two Big Ears, which Facebook acquired in 2016. Facebook immediately rebranded the developer’s software and made the suite available for free, providing producers with the first end-to-end 360 audio solution.

Speaking to PSN, Nair explains, “We optimize the system in a way that we can derive a lot of the benefits of higher-order ambisonics, which is more accurate positioning, higher spatial resolution and better timbral quality. And we ensure that it’s built in such a way that it’s completely fine-tuned to our rendering technology.”

Hybrid higher-order ambisonics is “an 8-channel system with rendering optimizations to incorporate the quality of higher-order ambisonics with fewer channels, ultimately saving on bandwidth,” states the blog post. The Spatial Workstation can additionally output two channels of head-locked audio, enabling a static stereo track for music or voiceover, for example. “Rendering with hybrid higher-order ambisonics and head-locked audio simultaneously is a first for the industry,” wrote Fugal and Nair.

The high bit rate Spatial Workstation Encoder marries the 10 audio channels with the video into a single file for uploading to Facebook. The platform’s native format—MP4 with H.264 video and AAC audio—is used to minimize subsequent quality loss in transcoding.

But AAC supports only eight channels, in a 7.1 configuration, and applies processing to the expected LFE channel. To avoid those limitations, the Facebook team transports the eight ambisonics tracks across two channels, with no LFE information, with the head-locked tracks on a third channel. Metadata defines the channel layout for each track.

Fugal has been with Facebook for more than seven years, working in infrastructure, networking, distributed systems and video, before volunteering to join the 360 audio and video team. “We took all the great things that Two Big Ears had done and helped to bring them onto mobile devices and all of our Facebook players on Android and iOS and Gear VR and even on the web,” he says.

The goal, says Nair, is to ensure end-to-end quality, from mixing and editing to playout, across a wide range of iOS and Android mobile devices, VR headsets and web browsers. The rendering SDK supports Facebook and Oculus.

Once it is uploaded, Facebook prepares the file containing 360 video and 360 audio for delivery to a variety of client devices, typically using multiple encoder settings to ensure the best-possible listener experience. A stereo binaural rendering is also prepared that will play out on legacy devices and offers a fallback on other platforms.

Because of the limitations of AAC on various platforms, Facebook elected to use the open-standard Opus codec, which is used by other spatial audio platforms, for client delivery. Facebook can optimize the codec’s behavior and performance since the company controls the encoding and decoding processes. The blog reports that Facebook is interested in the work to improve spatial audio compression in Opus.

Facebook’s 360 audio solution is efficient, the blog also notes. “The spatial audio renderer supports real-time experiences with less than half a millisecond of latency…an order of magnitude lower than most renderers, making it more than ideal for real-time experiences such as head-tracked videos.”

As the technology behind VR evolves, so too will Facebook’s spatial audio tool suite. Commenting on future directions, the blog notes, “Currently, we are working toward supporting an upload file format that can store all audio in one track, and potentially use a lossless encoding…. We are interested in exploring adaptive bitrate and perhaps adaptive channel layout to improve the experience for people with limited bandwidth, or enough bandwidth to receive even higher quality.”

Facebook’s spatial audio platform supports a degree of interoperability, as Nair explains: “If you’ve got a recording from an ambisonic microphone or a video that was uploaded to YouTube, we can ingest those formats as well. Because at no point do we want to create friction during the process.”

But it’s too early in the current development of VR to expect standards-based interoperability. As Fugal observes, “We have been working in a regime where the limitations of existing formats and software have caused us the need to work around things. As those barriers start to break down—and they are beginning to do so—that will make standards feasible.”

Facebook Code
code.facebook.com