## An observation about how the OGG Opus CODEC may do Stereo.

One of the subjects which I’ve written about before, is the fact that the developers of the original OGG Vorbis form of music compression, have more recently developed the OGG Opus CODEC, which is partially the CELT CODEC. And, in studying the manpage on how to use the ‘opusenc’ command (under Linux), I ran across the following detail:


--no-phase-inv
Disable use of phase inversion for intensity stereo. This trades
some stereo quality for a higher quality mono  downmix,  and  is
useful when encoding stereo audio that is likely to be downmixed
to mono after decoding.



What does this mean? Let me explain.

I should first preface that with an admission, of the fact that an idea which was true for the original version of the Modified Discrete Cosine Transform, as introduced by MP3 compression and then reused frequently by other CODECs, may not always be the case. That idea was that, when defining monaural sound, each frequency coefficient needed to be signed. Because CELT uses a form of the Type 4 Discrete Cosine Transform which is only partially lapped, it may be that all the coefficients are assumed to be positive.

This will work as long as there is no destructive interference between the same coefficient, in the overlapping region, from one frame to the next, in spite of the half-sample shift of each frequency-value. Also, a hypotenuse function should be avoided, as that would present itself as distortion. One explicit way to achieve this could be, to rotate the reference-waves (n)·90° + 45° for coefficient (n):

Where ‘FN‘ refers to the current Frame-Number.

In general, modern compressed schemes will subdivide the audible spectrum into sub-bands, which in the case of CELT are referred to as its Critical Bands. And for each frame, the way stereo is encoded for each critical band, switches back and forth between X/Y intensity stereo, and Mid/Side stereo, which also just referred to as M/S stereo. What will happen with M/S stereo is, that the (L-R) channel has its own spectral shape, independent of the (L+R) channel’s, while with X/Y stereo, there is only one spectral pattern, which is reproduced by a linear factor, as both the (L+R) component, and the (L-R) component.

Even if the (L+R) is only being recorded as having positive DCT coefficients, with M/S stereo, the need persists for the (L-R) channel to be signed. Yet, even if M/S stereo is not taking place, implying that X/Y stereo is taking place, what can happen is that:

|L-R| > (L+R)

This would cause phase-inversion to take place between the two channels, (L) and (R). Apparently, a setting will prevent this from happening.

Further, because CELT has as its main feature, that it first states the amplitude of the critical band, and then a Code-Word which identifies the actual non-zero coefficients, which may only number 4, the setting may also affect critical bands for which M/S stereo is being used during any one frame. I’m not really sure if it does. But if it does, it will also make sure that the amplitude of the (L+R) critical band exceeds or equals that of the (L-R) critical band.

The way in which the CODEC decides, whether to encode the critical band using X/Y or M/S, for any one frame, is to detect the extent to which the non-zero coefficients coincide. If the majority of them do, encoding automatically switches to X/Y… Having said that, my own ideas on stereo perception are such that, if none of the coefficients coincide, it should not make any difference whether the specific coefficients belonging to the (L-R) channel are positive or negative. And finally, a feature which CELT could have enabled constantly, is to compute whether the (L-R) critical band correlates positively or negatively with the (L+R), independently of what the two amplitudes are. And this last observation suggests that even when encoding in M/S mode, the individual coefficients may not be signed.

(Update 10/03/2019, 9h30 … )

## The Recent “OGG Opus” Codec

One of the uses which I’ve had for OGG Files has been, as a container-file for music, which has been compressed using the lossy “Vorbis” Codec. This has given me superior sound to what MP3 Files once delivered, assuming that I’ve set my Vorbis-encoded streams to a higher bit-rate than what most people set, that being 256kbps, or, Quality Level 8.

But the same people who invented the Vorbis Codec, have embarked on a more recent project, which is called “OGG Opus”, which is a Codec that can switch back and forth seamlessly, between a lossy, Linear Predictive Coding mode (“SILK”), and a mode based on the Type 4 Discrete Cosine Transform (‘DCT’), the latter of which will dominate, when the Codec is used for high-fidelity music. This music-mode is defined by “The CELT Codec”, which has a detailed write-up dating in the year 2010 from its developers, that This Link points to.

I have read the write-up and offer an interpretation of it, which does not require as much technical comprehension, as the technical write-up itself requires, to be understood.

Essentially, the developers have made a radical departure from the approaches used previously, when compressing audio in the frequency domain. Only the least of the changes is, that shorter sampling windows are to be used, such as the 512-sample window which has been sketched, as well as a possible 256-sample window, which was mentioned as well. In return, both the even and odd coefficients of these sampling windows – aka Frames – are used, so that only very little overlap will exist between them. Hence, even though there will still be some overlap, these are mainly just Type 4 Discrete Cosine Transforms.

The concept has been abandoned, that the Codec should reconstruct the spectral definition of the original sound as much as possible, minus the fact that it has to be simplified, in order to be represented with far fewer bits, than the original sound was defined as having. A 44.1kHz, 16-bit, stereo, uncompressed Wave-File consumes about 1.4Mbps, while compressed sampling rates as low as 64kbps are achievable, and music will still sound decently like music. The emphasis here seems to be, that only the subjective perception of the sound is supposed to remain accurate.

(Updated 8/03/2019,16h00 … )

## Some Trivia about Granules of Sound

One of the subjects which I’ve blogged about often, is the compression of sound, including Codecs which are based in the frequency-domain, rather than in the time-domain. What I’ve basically written is that in such cases, the time-domain samples of sound generate granules of frequency-domain coefficients, which are then in turn quantized. What tends to happen is that a new granule of sound is encoded every 576 time-domain samples, but that each time, a 1152-sample sampling window is used, and that due to the application of the “Modified Discrete Cosine Transform” (the ‘MDCT’), what amounts to all the odd coefficients of the Type 2 ‘DCT‘ are encoded, resulting in 576 coefficients being encoded each time. The present sampling window’s cosine function corresponds to the previous and next sampling window’s sine function, so that in a way that is orthogonal, these overlapping sampling windows also have the potential to preserve phase-information.

One observation which my readers may have about this, is the fact that while it does a good job at maintaining spectral resolution, this granule-size does not provide good temporal resolution. Therefore, a mechanism which MP3 compression introduced already, was ‘transient detection’. This feature can arbitrarily replace one of these full-length granules with 3 granules that only generate 192 frequency coefficients, and that recur as frequently.

The method by which transients are detected may be simple. For example, these short granules may tentatively have the stream subdivided all the time, but if any one of them contains more than average variance – which corresponds to signal energy – for example, if one shorter granule contains 1.5 times the average signal energy between the current 3, then this switch can take place.

What I do know is that when granules of sound – or rather, the quantized spectral information from granules of sound – are included in the stream, they include two extra bits each time, that define what the “Zone” of the present granule is. This can be one of four zones:

• A full-sized granule belonging to a stream of them,
• A shortened granule, belonging to a stream of them,
• A shortened granule, that precedes a full-sized granule,
• A shortened granule, that follows a full-sized granule.
• Because it’s inherent in MP3 compression that the entire current sampling window must overlap, partially with the preceding, and partially with the following one, there may be no special rule for how to shape a sampling window, that corresponds to a long granule, both preceded and followed by shortened ones. However, when that happens, both the preceding and following shortened granules will be encoded, to be followed and preceded respectively, by a long granule, for which reason those granules will already have long overlap-portions. Therefore, the current granule in such a case can be encoded as though it was just part of a sequence of long granules.

This information is ultimately non-trivial because it also affects the computation of sampling windows, i.e., it also affects the exact windowing function to be used when encoding. If the granule is followed or preceded by short granules, then either side of the windowing function must also be shortened. (:1)

Now, in the case of other Codecs, such as ‘OGG Vorbis’, a similar approach is taken. But I can well imagine that if specific ideals were simply implemented exactly as they were with MP3 sound, then eventually, the owners of the MP3 Codec might cry foul, over software patent violations. And yet, this problem can easily be sidestepped, let’s say by deciding that the shortened granules be made 1/2 the length of the full-sized granule, instead of 1/3 that length. And at that point the implementation would be sufficiently different from the original idea, that it would no longer constitute a patent violation.

## Which of my articles might paraphrase frequency-domain-based sound compression best.

I have written numerous postings about sound-compression, in which I did acknowledge that certain forms of it are based on time-domain signal-processing, but where several important sound-compression techniques are based in the frequency-domain. Given numerous postings from me, a reader might ask, ‘Which posting summarizes the blogger’s understanding of the concept best?’

And while many people directly pull up a posting, which I explicitly stated, describes something which will not work, but displays that concept as a point-of-view, to compare working concepts to, instead of recommending that posting again, I would recommend this posting.

Dirk