An observation about how the OGG Opus CODEC may do Stereo.

One of the subjects which I’ve written about before, is the fact that the developers of the original OGG Vorbis form of music compression, have more recently developed the OGG Opus CODEC, which is partially the CELT CODEC. And, in studying the manpage on how to use the ‘opusenc’ command (under Linux), I ran across the following detail:


--no-phase-inv
Disable use of phase inversion for intensity stereo. This trades
some stereo quality for a higher quality mono  downmix,  and  is
useful when encoding stereo audio that is likely to be downmixed
to mono after decoding.



What does this mean? Let me explain.

I should first preface that with an admission, of the fact that an idea which was true for the original version of the Modified Discrete Cosine Transform, as introduced by MP3 compression and then reused frequently by other CODECs, may not always be the case. That idea was that, when defining monaural sound, each frequency coefficient needed to be signed. Because CELT uses a form of the Type 4 Discrete Cosine Transform which is only partially lapped, it may be that all the coefficients are assumed to be positive.

This will work as long as there is no destructive interference between the same coefficient, in the overlapping region, from one frame to the next, in spite of the half-sample shift of each frequency-value. Also, a hypotenuse function should be avoided, as that would present itself as distortion. One explicit way to achieve this could be, to rotate the reference-waves (n)·90° + 45° for coefficient (n):

Where ‘FN‘ refers to the current Frame-Number.

In general, modern compressed schemes will subdivide the audible spectrum into sub-bands, which in the case of CELT are referred to as its Critical Bands. And for each frame, the way stereo is encoded for each critical band, switches back and forth between X/Y intensity stereo, and Mid/Side stereo, which also just referred to as M/S stereo. What will happen with M/S stereo is, that the (L-R) channel has its own spectral shape, independent of the (L+R) channel’s, while with X/Y stereo, there is only one spectral pattern, which is reproduced by a linear factor, as both the (L+R) component, and the (L-R) component.

Even if the (L+R) is only being recorded as having positive DCT coefficients, with M/S stereo, the need persists for the (L-R) channel to be signed. Yet, even if M/S stereo is not taking place, implying that X/Y stereo is taking place, what can happen is that:

|L-R| > (L+R)

This would cause phase-inversion to take place between the two channels, (L) and (R). Apparently, a setting will prevent this from happening.

Further, because CELT has as its main feature, that it first states the amplitude of the critical band, and then a Code-Word which identifies the actual non-zero coefficients, which may only number 4, the setting may also affect critical bands for which M/S stereo is being used during any one frame. I’m not really sure if it does. But if it does, it will also make sure that the amplitude of the (L+R) critical band exceeds or equals that of the (L-R) critical band.

The way in which the CODEC decides, whether to encode the critical band using X/Y or M/S, for any one frame, is to detect the extent to which the non-zero coefficients coincide. If the majority of them do, encoding automatically switches to X/Y… Having said that, my own ideas on stereo perception are such that, if none of the coefficients coincide, it should not make any difference whether the specific coefficients belonging to the (L-R) channel are positive or negative. And finally, a feature which CELT could have enabled constantly, is to compute whether the (L-R) critical band correlates positively or negatively with the (L+R), independently of what the two amplitudes are. And this last observation suggests that even when encoding in M/S mode, the individual coefficients may not be signed.

(Update 10/03/2019, 9h30 … )

Modern Consumer Sound Appreciation

Over recent months, I have been racking my brain, trying to answer questions I have, about how sound that was compressed in the frequency-domain, may or may not be able to preserve phase-information. This does not mean that I, personally, can hear phase-information, nor that specific MP3 Files I have been listening to, would even be good examples of how well modern MP3s compress sound. I suspect that in order to stay in business, the developers of MP3 have in fact been improving their codec, so that when played back correctly, the quality of MP3s will stay in line with more-recent formats that exist, such as OGG Vorbis…

But I think that people under-appreciate my intellectual point of view.

For many months and years, I had my doubts, that MP3 Files can in fact encode ± 180⁰ phase-shifts, i.e. a stereo-difference channel that has the correct polarity with respect to the stereo-sum channel, over a range of frequencies. What my own musings have only taught me in recent days, is that in fact, MP3 is capable of ± 180⁰ phase-separation.

Further, similar types of compression should be capable of better phase-separation than that, If their bit-rates are set high enough, that not too many of their frequency-coefficients get chopped down – according to what I have reasoned out today.

What I also know, is that the sound-formats AC3 and AAC have as an explicit feature, to store surround-sound. MPEG-2 Video Files more-or-less require the use of the AC3 codec for sound, and MP4 Files absolutely require the use of the AAC codec. And, stored in its compressed format, the surround-effect only requires ± 180⁰ phase-accuracy.

This subject is orthogonal to debate which exists, about whether it is of benefit to human listeners, to have sound reproduced at very high sample-rates, or at great bit-depths. Furthermore, I do not fully know what good a very high sample-rate – such as “192kHz” – is supposed to do any listener, if his sound has been MP3-compressed. As far as I am concerned, ultra-high sample-rates have to do with lossless compression, or no compression, which also happen to produce the same file-sizes at that signal-format.

What I did was just check, in what format iTunes downloads music by default. And it downloads its music in AAC Format. All this does for me, is corroborate a claim a friend of mine made, that he can hear his music with full positioning, since that is also the main feature of AAC, and not of MP3.

A Thought on SRS

Today, when we buy a laptop, we assume that its internal speakers offer inferior sound by themselves, but that through the use of a feature named ‘SRS’, they are enhanced, so that sound which simply comes from two speakers in front of us, seems to fill the space around us, kind of how surround-sound would work.

The immediate problem with Linux computers is, that they do not offer this enhancement. However, technophiles have known for a long time that this problem can be solved.

The underlying assumption here is, that the stereo being sent to the speakers should act as if each channel was sent to one ear in an isolated way, as if we were using headphones.

The sound that leaves the left speaker, reaches our right ear with a slightly longer time-delay, than the time-delay with which it reaches our left ear, and a converse truth exists for the right speaker.

It has always been possible to time-delay and attenuate the sound that came from the left speaker in total, before subtracting the result from the right speaker-output, and vice-verso. That way, the added signal that reaches the left ear from the left speaker, cancels with the sound that reached it from the right speaker…

The main problem with that effect, is that it will mainly seem to work when the listener is positioned in front of the speakers, in exactly one position.

I have just represented a hypothetical setup in the time-domain. There can exist a corresponding representation in the frequency-domain. The only problem is, that this effect cannot truly be achieved just with one graphical equalizer setting, because it affects (L+R) differently from how it affects (L-R). (L+R) would be receiving some recursive, negative reverb, while (L-R) would be receiving some recursive, positive reverb. But reverb can also be expressed by a frequency-response curve, as long as that has sufficiently fine resolution.

This effect will also work well with MP3-compressed stereo, because with Joint Stereo, an MP3 stream is spectrally complex in its reproduction of the (L-R) component.

I expect that when companies package SRS, they do something similar, except that they may tweak the actual frequency-response curves into something simpler, and they may also incorporate a compensation, for the inferior way the speakers reproduce frequencies.

Simplifying the curves would allow the effect to break down less, when the listener is not perfectly positioned.

We do not have it under Linux.

(Edit 02/24/2017 : A related effect is possible, by which 2 or more speakers are converted into an effectively-directional speaker-system. I.e., the intent could be, that sound which reaches our filter as the (L) channel, should predominantly leave the speaker-set at one angle, while sound which reaches our filter as the (R) channel, should leave the speaker-set at an opposing angle.

In fact, if we have an entire array of speakers – i.e. a speaker-bar – then we can apply the same sort of logic to them, as we would apply to a phased-array radar system.

The main difference with such a system, as opposed to one based on the Inter-Aural Delay, is that this one would absolutely require we know the distance between the speakers. And then we would use that distance, as the basis for our time-delays… )

Emphasizing a Presumed Difference between OGG and MP3 Sound Compression

In this posting from some time ago, I wrote down certain details I had learned about MP3 sound compression. I suppose that while I did write, that the Discreet Cosine Transform coefficients get scaled, I may have missed to mention in that same posting, that they also get quantized. But I did imply it, and I also made up for the omission in this posting.

But one subject which I did mention over several postings, was my own disagreement with the practice, of culling frequency-coefficients which are deemed inaudible, thus setting those to zero, just to reduce the bit-rate in one step, hoping to get better results, ‘because a lower initial bit-rate also means that the user can select a higher final bit-rate…’

In fact, I think that some technical observers have confused two separate processes that take place in MP3:

1. An audibility threshold is determined, so that coefficients which are lower than that are set to zero.
2. The non-zero coefficients are quantized, in such a way that the highest of them fits inside a fixed maximum, quantized value. Since a scale-factor is computed for one frequency sub-band, this also implies that close to strong frequency coefficients, weaker ones are just quantized more.

In principle, concept (1) above disagrees with me, while concept (2) seems perfectly fine.

And so based on that I also need to emphasize, that with MP3, first a Fast-Fourier Transform is computed, the exact implementation of which is not critical for the correct playback of the stream, but the only purpose of which is to determine audibility thresholds for the DCT transform coefficients, the frequency-sub-bands of which must fit the standard exactly, since the DCT is actually used to compress the sound, and then to play it back.

This FFT can serve a second purpose in Stereo. Since this transform is assumed to produce complex numbers – unlike the DCT – it is possible to determine whether the Left-Minus-Right channel correlates positively or negatively with the Left-Plus-Right channel, regarding their phase. The way to do this effectively, is to compute the dot-product between two complex numbers, and to see whether this dot-product is positive or negative. The imaginary component of one of the sources needs to be inverted for that to work.

But then negative or positive correlation can be recorded once for each sub-band of the DCT as one bit. This will tell, whether a positive difference-signal, is positive when the left channel is more so, or positive if the right channel is more so.

You see, in addition to the need to store this information, potentially with each coefficient, there is the need to measure this information somehow first.

But an alternative approach is possible, in which no initial FFT is computed, but in which only the DCT is computed, once for each Stereo channel. This might even have been done, to reduce the required coding effort. And in that case, the DCT would need to be computed for each channel separately, before a later encoding stage decides to store the sum and the difference for each coefficient. In that case, it is not possible first to determine, whether the time-domain streams correlate positively or negatively.

This would also imply, that close to strong frequency-components, the weaker ones are only quantized more, not culled.

So, partially because of what I read, and partially because of my own idea of how I might do things, I am hoping that OGG sound compression takes this latter approach.

Dirk