The Recent “OGG Opus” Codec

One of the uses which I’ve had for OGG Files has been, as a container-file for music, which has been compressed using the lossy “Vorbis” Codec. This has given me superior sound to what MP3 Files once delivered, assuming that I’ve set my Vorbis-encoded streams to a higher bit-rate than what most people set, that being 256kbps, or, Quality Level 8.

But the same people who invented the Vorbis Codec, have embarked on a more recent project, which is called “OGG Opus”, which is a Codec that can switch back and forth seamlessly, between a lossy, Linear Predictive Coding mode (“SILK”), and a mode based on the Type 4 Discrete Cosine Transform (‘DCT’), the latter of which will dominate, when the Codec is used for high-fidelity music. This music-mode is defined by “The CELT Codec”, which has a detailed write-up dating in the year 2010 from its developers, that This Link points to.

I have read the write-up and offer an interpretation of it, which does not require as much technical comprehension, as the technical write-up itself requires, to be understood.

Essentially, the developers have made a radical departure from the approaches used previously, when compressing audio in the frequency domain. Only the least of the changes is, that shorter sampling windows are to be used, such as the 512-sample window which has been sketched, as well as a possible 256-sample window, which was mentioned as well. In return, both the even and odd coefficients of these sampling windows – aka Frames – are used, so that only very little overlap will exist between them. Hence, even though there will still be some overlap, these are mainly just Type 4 Discrete Cosine Transforms.

The concept has been abandoned, that the Codec should reconstruct the spectral definition of the original sound as much as possible, minus the fact that it has to be simplified, in order to be represented with far fewer bits, than the original sound was defined as having. A 44.1kHz, 16-bit, stereo, uncompressed Wave-File consumes about 1.4Mbps, while compressed sampling rates as low as 64kbps are achievable, and music will still sound decently like music. The emphasis here seems to be, that only the subjective perception of the sound is supposed to remain accurate.

(Updated 8/03/2019,16h00 … )

(As of 8/01/2019, 17h10 : )

Essentially, the way in which ‘OGG Opus’ / ‘CELT’ simplifies its transmission of the spectral nature of sound granules is, that given the entire audible spectrum, a fixed number of sub-bands is defined, which is equal to (19) for a Frame-Size of 256 samples, and the spacing of the borders between these bands is almost logarithmic, since Human hearing is largely logarithmic in nature. I.e., given an index number for any DCT coefficient, it’s possible to decide unambiguously, which of these “Critical Sub-Bands” it belongs to.

The sub-bands could hypothetically be determined, by repeatedly dividing 20kHz by the square root of two, until the size of the Frames causes the remaining sub-bands only to have 6 coefficients each, after which they would go down to the minimum frequency linearly. The minimum frequency, because the Type 4 DCT applies a half-sample shift to the frequencies, that actual coefficient-indexes refer to

For each of these sub-bands, the amount of signal energy is computed and encoded in a way not quantized, and kept accurate. I.e., that signal-energy is converted into an equivalent amplitude, by computing its square root. But then, the result is used to normalize the spectral information also transmitted for each sub-band.

This precise spectral information is no longer to be entropy-coded. Instead, each sub-band is to be defined by a Code-Word of fixed length, making this scheme a truly Constant-Bit-Rate scheme.

The sub-bands share a parameter which the linked article names (K), and which was arbitrarily set to (4). This is the total number of amplitude peaks that are to exist in each sub-band, for different frequencies, that are all quantized as having an amplitude of ± (1.0/K), instead of zero. Thus, the size of each Code-Word depends on what (K) is, and states the frequency-coefficient, which I’ve been naming an ‘index’, of each of these (K) peaks, which the linked article names “Frequency Pulses”. They are not arranged according to time.

(Update 8/03/2019, 16h00 … )

The reason for which the linked article refers to a hyper-sphere is a kind of Math which they do, according to which the individual coefficients before quantization are elements of a vector, and divided by the norm of this vector. The result is a unit vector, which also corresponds to a common-sense definition of a sphere. Only, the dimensionality of this sphere corresponds to the number of coefficients per sub-band, which may in fact be what the variable (N) refers to.

A hypothetical code-word would consist of a series of actual bytes, followed by a sequence of raw bits. The bytes would state the index of one coefficient each, between (0) and (N-1) inclusively, that has been quantized to non-zero, while an equal number of raw bits would state the sign of the selected, quantized coefficients… The total number of bits needed would get rounded up to an integer number of bytes.

If a higher bit-rate is desired, the spectrum can be subdivided into a larger number of sub-bands, each made narrower. Arithmetically, this would be a question of computing the anti-logarithm of the negative of the reciprocal, of the required quality-factor, and thereby obtaining a multiplier different from the square root of one half, but that can again be applied until the subdivision of the spectrum into bands needs to become linear.

The reason for which transparency can eventually be reached is the combination of the facts:

  • The sub-bands could be made so narrow, that each only needs to have (1) peak in principle, even though it will receive (4) regardless,
  • Even demanding music does not typically require that the peaks are close together, as in belonging to adjacent coefficients, in the higher sub-bands.

According to what I recently read, the Codec is capable of assigning zero corresponding bits to a sub-band, in which case its quantized elements are “folded” from lower sub-bands when decoding, in a way that resembles Spectral Band Replication, after which the folded elements are nevertheless multiplied by the square root of the spectral energy recorded for the band.

There is no Scientific, Mathematical reason to think that sound was ever organized in this way. It just sounds right, when played back for Human Hearing that way. Well almost sounds right. There are basically two advantages which can be gained, from doing things that way:

  1. The shorter sampling-windows – Frames – used in full, imply better temporal resolution, to the point where this Codec can be used in live performances, for lip-syncing, for Karaoke, etc.,
  2. Bit-Rates lower than those obtained through the use of the ‘ACC’ Codec, can be used, and still for music.

However, advantage (1) will generally be in-step with a disadvantage, which is, that better temporal resolution will bring poorer spectral resolution, in this case, to the point where the exact note of the music could not be held, if it was not for a pre-filter, that is also applied. What ‘CELT’ does is to determine one fundamental frequency, accurately, which could also be seen as ‘the main note being played or sung at any one time’, and it tunes a comb-filter designed to bias the Codec towards encoding that one frequency, as well as all the harmonics of that frequency. This actually happens before the DCT.

What exactly a comb filter is, is beyond the scope of this article to explain. However, This Link points to an educational site, which explains comb filters in detail. I think that one mistake which some people make is, to expect that the effects of a comb filter are always audible. It just happens that, If the specific time-delay that defines the comb filter is changed over time, the notches in its frequency response move along the spectrum, and at that point, the listener notices a “flanger effect”. However, comb filters also exist all the time in room acoustics, in which case most residents spend much their time ‘tuning them out’ of their awareness.

What the developers of ‘CELT’, who are therefore also developers of ‘OGG Opus’, shrugged off in 2010, was the fact that with ‘OGG Vorbis’ at least, if the music was both very tonal and polyphonic, all the main frequencies were reproduced accurately, while ‘CELT’ could no longer pull that off. This can be phrased less technically. When different people listen to their music, the emphasis could be on differing aspects of the music. Some people might enjoy the very-high-frequency parts of the sound more, while some people, being more tonal, tend to enjoy more, that an entire cord of lower-midrange frequencies can be played, and that every note in that cord will be highly accurate. That second capability was lacking in earlier implementations of ‘CELT’.

People who, for sport, want to compress their music down to 64kbps or less, and still be able to listen to it, ‘OGG Opus’ is for you. What I am now hoping is, that at a bit-rate of 96kbps, I will have come close to transparency.

(Important : )

There is a detail of how ‘CELT’ works which I had overlooked. It can vary the trade-off between temporal resolution and spectral resolution dynamically, not just by splitting down the size of actual Frames mid-stream, but also, by computing a Hadamard Transform, for one coefficient from granule to granule, to increase spectral resolution, or for multiple coefficients belonging to one long granule, to increase temporal resolution. Either way, this Hadamard Transform achieves a power of (2) as its boost.

This added feature was only documented in 2013, and is referred to in This linked document, as Variable Time-Frequency Resolution (section 5.2) . For each sub-band, ‘CELT’ will select to what level the Hadamard Transform is to be applied, when encoding.

Whether or not one specific implementation of ‘CELT’ achieves everything that the standard allows is questionable. Debian / Stretch supports ‘OGG Opus’ v1.2 . And I think that the way OGG Opus will work is such, that later decoders may be required, for the latest features to play.

OGG Opus additionally has the ability to apply both Linear Predictive Coding on the frequencies for speech, up to 8kHz, in the form of the ‘SILK’ Codec, and ‘CELT’ on frequencies between 8kHz and 20kHz, simultaneously. This is substantially better than the operation would be, if Opus was simply to switch back and forth.

Dirk

 

Print Friendly, PDF & Email

One thought on “The Recent “OGG Opus” Codec”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.