Ambisonics in Wwise: Overview
Released with Wwise 2016.1, the key feature of the ambisonics pipeline consists of being able to set the channel configuration of busses to 1st, 2nd, or 3rd-order ambisonics. From there, any non-ambisonic signal routed to such a bus is automatically encoded to ambisonics, and any ambisonic signal routed to a non-ambisonic bus is automatically decoded, according to the chosen positioning settings. Finally, ambisonic signals routed to ambisonic busses are either passed unchanged (2D positioning) or rotated according to the relative orientation of the game object and listener (3D positioning), using the properties of ambisonics. In complement, you may:
- Import and play back B-format assets (up to 3rd order);
- Use Effect plug-ins to customize decoding;
- Use Effect plug-ins such as Auro Headphone to convert ambisonics to binaural;
- Record an ambisonic bus to disk and re-import it using the Wwise Recorder;
- Use your favorite Wwise plug-ins* to process ambisonics as they would process other formats.
*Except Stereo Delay and Matrix Reverb.
Using Ambisonics for VR
It is commonly agreed that games for Virtual Reality (VR) should render a binaural mix, which is produced by passing sounds through position-dependent filters (called Head-Related Transfer Functions or HRTF). These filters model the interaction of sounds with the head, and may succeed in giving the impression that sounds of the game actually come from outside the player's head, instead of inside the head, like standard intensity-stereo does when used with headphones.
Intermediate spatial representation for the purpose of binauralization
In order to apply proper HRTFs, binaural processing needs to know the wavefronts' direction of arrival. One way to do this is to filter each positioned source independently. However, it has been shown that panning and applying binaural filtering all at once has drawbacks (see: Object Based Audio blog). In particular, it prevents sound designers from submixing groups of sounds for applying audio Effects, such as dynamic range compression. To counteract these drawbacks, we need to decouple the panning stage with the binaural processing stage. In the middle, we need a signal format that carries some information about the spatiality of the sound constituting the submixed audio. We call this format an intermediate spatial representation.
It has been suggested that fixed objects be used as an intermediate spatial representation. The concept of fixed objects is quite simple. It consists of a multi-channel format, where each channel represents an object (or loudspeaker) with a fixed and known position. Sounds are thus panned and mixed onto this multi-channel format using standard panning laws, as if the channels represented virtual loudspeakers with known position. Later on, the signal of each channel is filtered independently by the HRTF corresponding to the position of the virtual loudspeaker.
In fact, the fixed-objects representation is not different than a classic multi-channel signal. The difference here is the context in which it is used. When mixing for a traditional 5.1 setup, sounds would be panned and mixed directly onto a 5.1 bus, and each channel of this bus would be used to drive a loudspeaker. In our case, however, we are panning onto a configuration that purposefully has more channels than the output (which is two). The directional information implicitly conveyed in the fixed-object channels ends up being embedded into the binaural signal by way of filtering.
Ambisonics can be viewed as another intermediate spatial representation. It models a sound field, which is sound coming from all directions at once. Each channel represents a so-called spherical harmonic, and they all work together in approximating this sound field with increasing angular precision. First-order ambisonics consists of 4 channels, while 2nd- and 3rd-order ambisonics consist of 9 and 16 channels respectively, and represent increasingly precise sound fields from a spatial standpoint. In ambisonic terminology, panning to ambisonics corresponds to encoding, while converting ambisonics to a binaural setup (or a loudspeaker feed) corresponds to decoding. Compared to typical fixed-object representation, such as 7.1.4, ambisonics is symmetrical and regular and can represent wavefronts coming from under the listener. It also exhibits constant spread, whereas panning onto a fixed object is more precise when the virtual source falls exactly on an object than when it falls exactly in the middle of the three surrounding objects. On the other hand, ambisonics is blurrier than the fixed-object representation with an equal number of channels.
At the moment, some VR vendors already provide SDKs that accept ambisonic signals and perform binaural virtualization. Others don't. When available, Wwise will be capable of feeding the said VR devices directly with ambisonics. Otherwise, it allows decoding ambisonics and converting it to binaural using your favorite 3D audio plug-in. Auro Headphone, distributed with Wwise, is one of them.
In order to do so, create a bus under the Master Audio Bus, which should act as our master 3D bus. All audio requiring binaural processing should eventually be routed there. This bus should be set to Ambisonics N-N, where N is the order and affects precision (2 or 3 is recommended). The output of this bus should be fed to your favorite ambisonics-capable binaural virtualizer. Since Wwise does not currently allow metering of audio signals in-between bus Effects, you should place this Effect on a separate parent bus you will need to create with the same config.
How does it rotate with respect to the head mounted device?
Normally, the head mounted device (HMD) head-tracking data should be continuously passed to Wwise by the game engine as the listener orientation, via the SetListenerPosition() API. Sounds whose positioning is set to 3D are attached to game objects. When they are mixed into an ambisonic bus, the encoding angles depend on the game object position relative to the listener’s orientation. Thus, the ambisonic downmix is made of sound sources that are already rotated with respect to the HMD.
You may also use 3D positioning with B-format sources. In such a case, they are considered as sound fields and are rotated according to the (fully qualified) game object vs listener relative orientation (see Ambisonics as sound fields, below). 2D sounds are encoded with fixed angles and, as such, always follow the listener.
Ambisonics can be rotated with minimal CPU and memory usage by computing rotation matrices in the ambisonics domain. This makes it an ideal format for exchanging audio for VR. For example, one may build a complete auditory scene with sources coming from any direction, encode them to an ambisonic signal, and store it to disk using an ambisonics-capable DAW (such as Wwise). At the moment of playing back in the VR device, the playback engine just has to read the head tracking coordinates, rotate the ambisonic signal in the opposite direction, and then decode/virtualize it to a binaural signal for headphones.
In the previous section, rotation was achieved interactively. You could instead use Wwise to produce non-interactive (cinematic) content for VR, apart from rotation due to head tracking. To do so, you simply need to replace the aforementioned binaural virtualizer Effect with a Wwise Recorder Effect on the ambisonic master 3D bus The recorded file will be a compatible AMB file (FuMa ordering, MaxN normalization), with the same order as that of the bus. You can then embed this file into your 360 video and let the player rotate the sound field using the process described above.
More or less the same considerations apply for Auxiliary Busses used for environmental Effects. Currently, the RoomVerb and the Convolution Reverb support ambisonics natively, however the MatrixReverb does not. You may use them on a stereo or 4.0 (or more) bus, whether you want to use the front-back delays or not, and route them to ambisonic busses. Their output will be encoded to ambisonics following the same rules that apply to 2D sounds. While you can do the same with RoomVerb, you may also use it directly on ambisonic busses. The directional channels will consist of decorrelated signals, similar to standard multichannel configurations. The higher orders will result in more decorrelated channels and will, therefore, require more processing. We encourage you to experiment to find the desired balance between quality and performance.
Ambisonics Panning versus VBAP
Encoding of mono sources into an ambisonics bed may be used for aesthetic reasons. The default panning algorithm of 3D sounds implemented in Wwise is based on the ubiquitous VBAP algorithm, which maximizes sound accuracy with constant overall energy, at the expense of variability of the energy spread. That is, the energy spread is minimal when the virtual source is aligned with a loudspeaker and is maximal when it is exactly in the middle of a loudspeaker arc (for a 2D config such as 7.1) or triangle (for a 3D config such as 7.1.4). Ambisonics, on the other hand, has a constant energy spread regardless of the source direction and loudspeaker layout. Its spread is inversely proportional to the order. First order ambisonics is therefore very blurry. Since ambisonics submixes are decoded automatically to the standard configuration of their parent bus (such as 5.1 or 7.1.4), using the all-around ambisonics decoding technique, all-round ambisonics _panning_  may be implemented in Wwise by routing 3D sounds to an ambisonics bus and then routing this bus to a parent bus that has a standard config.
Ambisonics as sound fields
B-format sources set to 3D positioning should be viewed as sound fields, with their orientation given by the associated game object's orientation. When routed to an ambisonic bus, the mixing matrix is computed by Wwise in order to perform rotation of the sound field with respect to the listener. This makes ambisonic recordings (or synthesized sources - see below) a great way to implement dynamic ambiances in games. Play them back on game objects whose orientation is constant and points to some reference. For example, if there is an auditory element with clear directionality in a recording, which maps to a visual element in the game, you will want to orient the game object such that they align. When the listener navigates in the space and its orientation changes, the ambisonic sound field is automatically rotated while being mixed into the bus. Consequently, the sound field is coherent with the visuals.
You can import ambisonic recordings in Wwise that you capture using an appropriate coincident microphone and convert to B-format (*most microphones come with their own software to transform the captured format, typically A-format, to B-format). You may also synthesize B-format ambiances in Wwise by using discrete mono sources with 3D user-defined positioning, which are played using a Soundcaster session and routed to an ambisonics bus with a Wwise Recorder plug-in applied to it. You may then re-import the file right away using the Recorder's controls. See more details on the Wwise Recorder plug-in.
Ambisonics represents wavefronts coming towards the listener (*although it can also be used to represent audio radiating from an object), so the sources that constitute the sound field are always away. It is thus awkward to use this representation to translate within the sound field. Put in other words, it is awkward to walk around these sources. This has two consequences on using ambisonics for ambiances in Wwise. First, if you set an attenuation curve, the volume of the sound field diminishes evenly in all directions when you move away from its central position (the position of the associated game object). This might complicate things when trying to get in and out of the sound field. Likewise, if you move to one side of the sound field - away from the center in one direction - elements in this direction will not sound louder than the elements in the opposite direction.