An observation about how the OGG Opus CODEC may do Stereo.

One of the subjects which I’ve written about before, is the fact that the developers of the original OGG Vorbis form of music compression, have more recently developed the OGG Opus CODEC, which is partially the CELT CODEC. And, in studying the manpage on how to use the ‘opusenc’ command (under Linux), I ran across the following detail:

 


       --no-phase-inv
              Disable use of phase inversion for intensity stereo. This trades
              some stereo quality for a higher quality mono  downmix,  and  is
              useful when encoding stereo audio that is likely to be downmixed
              to mono after decoding.

 

What does this mean? Let me explain.

I should first preface that with an admission, of the fact that an idea which was true for the original version of the Modified Discrete Cosine Transform, as introduced by MP3 compression and then reused frequently by other CODECs, may not always be the case. That idea was that, when defining monaural sound, each frequency coefficient needed to be signed. Because CELT uses a form of the Type 4 Discrete Cosine Transform which is only partially lapped, it may be that all the coefficients are assumed to be positive.

This will work as long as there is no destructive interference between the same coefficient, in the overlapping region, from one frame to the next, in spite of the half-sample shift of each frequency-value. Also, a hypotenuse function should be avoided, as that would present itself as distortion. One explicit way to achieve this could be, to rotate the reference-waves (n)·90° + 45° for coefficient (n):

MDCT_2

Where ‘FN‘ refers to the current Frame-Number.

In general, modern compressed schemes will subdivide the audible spectrum into sub-bands, which in the case of CELT are referred to as its Critical Bands. And for each frame, the way stereo is encoded for each critical band, switches back and forth between X/Y intensity stereo, and Mid/Side stereo, which also just referred to as M/S stereo. What will happen with M/S stereo is, that the (L-R) channel has its own spectral shape, independent of the (L+R) channel’s, while with X/Y stereo, there is only one spectral pattern, which is reproduced by a linear factor, as both the (L+R) component, and the (L-R) component.

Even if the (L+R) is only being recorded as having positive DCT coefficients, with M/S stereo, the need persists for the (L-R) channel to be signed. Yet, even if M/S stereo is not taking place, implying that X/Y stereo is taking place, what can happen is that:

|L-R| > (L+R)

This would cause phase-inversion to take place between the two channels, (L) and (R). Apparently, a setting will prevent this from happening.

Further, because CELT has as its main feature, that it first states the amplitude of the critical band, and then a Code-Word which identifies the actual non-zero coefficients, which may only number 4, the setting may also affect critical bands for which M/S stereo is being used during any one frame. I’m not really sure if it does. But if it does, it will also make sure that the amplitude of the (L+R) critical band exceeds or equals that of the (L-R) critical band.

The way in which the CODEC decides, whether to encode the critical band using X/Y or M/S, for any one frame, is to detect the extent to which the non-zero coefficients coincide. If the majority of them do, encoding automatically switches to X/Y… Having said that, my own ideas on stereo perception are such that, if none of the coefficients coincide, it should not make any difference whether the specific coefficients belonging to the (L-R) channel are positive or negative. And finally, a feature which CELT could have enabled constantly, is to compute whether the (L-R) critical band correlates positively or negatively with the (L+R), independently of what the two amplitudes are. And this last observation suggests that even when encoding in M/S mode, the individual coefficients may not be signed.

 

(Update 10/03/2019, 9h30 … )

There exists a related question, which I did not know the answer to when I was beginning to muse over the subject of compressed sound, but which in a roundabout way, my blog has given an answer to. The problem with this situation is, that few or no readers actually read my whole blog. Therefore, I will state and answer this question here, to the best of my present knowledge:

‘Given that any Discrete Cosine Transform only computes the product of an original signal with a reference wave, which happens to be a cosine wave, is it still possible to reconstruct the original signal, thereby inverting the transform losslessly, if the coefficients resulting from the first application have not been quantized?’

The answer is ‘Yes, as long as both the even and the odd coefficients are being taken into account.’

However, one observation which I’ve already made about the CELT CODEC is, that it has abandoned any potential to reconstruct the original wave, instead only aiming to make sure that the sound, the way Humans perceive it, ‘seems unchanged’, after decoding. Yet, every time their DCT ‘flips the sign’ of a frequency-domain coefficient, this also results in some phase-shift, within the reconstructed signal.

Yet, the way in which the DCT is being applied by the CELT CODEC seems to be such, that it does in fact take both the even and the odd coefficients into account. The reason for which MP3 does not, has to do with the notion that the goal is data-reduction, and that if 1152-sample windows were merely repeated every 576 samples, there would be an initial doubling of the number of resulting, frequency-domain samples. With the additional assumption that MP3 needs to encode at least one bit, for every two coefficients, this might not be optimal for overall data-reduction.

But, if CELT uses a 20 millisecond, therefore a 960-sample frame-size, and repeats it every 480 samples, the radical way in which individual coefficients are not being encoded, seems to be enough to result in eventual data-reduction.

If a 576-sample, Type 4 DCT is being computed, that results in 576 frequency-domain samples, but, if the length over which it’s to be computed is merely doubled, then, because of the half-sample shift in the frequency-values of the resulting coefficients, the result is equivalent to computing a 1152-sample, Type 2 DCT, but only keeping the odd coefficients.

Hence, what MP3 and its family of CODECs needed to do, in order even to ensure that all the frequency components are being detected by the MDCT, is to rely on the overlap between consecutive granules of sound, so that they complement each other, and to allow the reconstruction of both amplitude and phase-position, before quantization is applied. And, if indeed only the odd coefficients are non-zero, the result is, that the reference waves will be phase-shifted 90° from one granule to the next.


 

This line of reasoning could be taken one step further, and the hypothesis could be analyzed that with MP3 compression, the (L+R) coefficients could also be unsigned, so that the signs of the (L-R) coefficients would be multiplied by those of the (L+R) … , before the (L+R) coefficients were ‘made positive’.

What I wrote above would seem to suggest that, because the reference-waves between granules are 90° out of phase, the resulting amplitudes would still be correct, just not the final phase-positions of the reconstructed (L+R) channel. One side effect of this might be, that not only a phase-shift is being applied to the result, but that this phase-shift will also change for every granule, and therefore change 80 times per second.

There are specific situations in which the listener can hear a phase-shift. If this phase-shift changes quickly over time, it turns into a frequency-shift, because phase modulation also implies frequency modulation. Eventually, listeners would be able to hear frequency modulation.

However, the result might be, that the same base-vector is being reversed, because the change in sign is being applied consistently to every second granule of sound. If that is the case, then the reconstructed (L+R) channel will simply end up with a constant phase-shift, which should not be audible.

Dirk

 

Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.