An observation about how the OGG Opus CODEC may do Stereo.

One of the subjects which I’ve written about before, is the fact that the developers of the original OGG Vorbis form of music compression, have more recently developed the OGG Opus CODEC, which is partially the CELT CODEC. And, in studying the manpage on how to use the ‘opusenc’ command (under Linux), I ran across the following detail:

 


       --no-phase-inv
              Disable use of phase inversion for intensity stereo. This trades
              some stereo quality for a higher quality mono  downmix,  and  is
              useful when encoding stereo audio that is likely to be downmixed
              to mono after decoding.

 

What does this mean? Let me explain.

I should first preface that with an admission, of the fact that an idea which was true for the original version of the Modified Discrete Cosine Transform, as introduced by MP3 compression and then reused frequently by other CODECs, may not always be the case. That idea was that, when defining monaural sound, each frequency coefficient needed to be signed. Because CELT uses a form of the Type 4 Discrete Cosine Transform which is only partially lapped, it may be that all the coefficients are assumed to be positive.

This will work as long as there is no destructive interference between the same coefficient, in the overlapping region, from one frame to the next, in spite of the half-sample shift of each frequency-value. Also, a hypotenuse function should be avoided, as that would present itself as distortion. One explicit way to achieve this could be, to rotate the reference-waves (n)·90° + 45° for coefficient (n):

MDCT_2

Where ‘FN‘ refers to the current Frame-Number.

In general, modern compressed schemes will subdivide the audible spectrum into sub-bands, which in the case of CELT are referred to as its Critical Bands. And for each frame, the way stereo is encoded for each critical band, switches back and forth between X/Y intensity stereo, and Mid/Side stereo, which also just referred to as M/S stereo. What will happen with M/S stereo is, that the (L-R) channel has its own spectral shape, independent of the (L+R) channel’s, while with X/Y stereo, there is only one spectral pattern, which is reproduced by a linear factor, as both the (L+R) component, and the (L-R) component.

Even if the (L+R) is only being recorded as having positive DCT coefficients, with M/S stereo, the need persists for the (L-R) channel to be signed. Yet, even if M/S stereo is not taking place, implying that X/Y stereo is taking place, what can happen is that:

|L-R| > (L+R)

This would cause phase-inversion to take place between the two channels, (L) and (R). Apparently, a setting will prevent this from happening.

Further, because CELT has as its main feature, that it first states the amplitude of the critical band, and then a Code-Word which identifies the actual non-zero coefficients, which may only number 4, the setting may also affect critical bands for which M/S stereo is being used during any one frame. I’m not really sure if it does. But if it does, it will also make sure that the amplitude of the (L+R) critical band exceeds or equals that of the (L-R) critical band.

The way in which the CODEC decides, whether to encode the critical band using X/Y or M/S, for any one frame, is to detect the extent to which the non-zero coefficients coincide. If the majority of them do, encoding automatically switches to X/Y… Having said that, my own ideas on stereo perception are such that, if none of the coefficients coincide, it should not make any difference whether the specific coefficients belonging to the (L-R) channel are positive or negative. And finally, a feature which CELT could have enabled constantly, is to compute whether the (L-R) critical band correlates positively or negatively with the (L+R), independently of what the two amplitudes are. And this last observation suggests that even when encoding in M/S mode, the individual coefficients may not be signed.

 

(Update 10/03/2019, 9h30 … )

Continue reading An observation about how the OGG Opus CODEC may do Stereo.

Different types of music, for testing Audio Codecs – Or Maybe Not.

One of my recent activities has been, to start with Audio CDs from the 1980s and 1990s, the encoding of which was limited only by the 44.1kHz sample rate, and the bit-depth, as well as by whatever type of Sinc Filter was once used to master them, but not limited by any sort of lossy compression, and to “rip” those, but into different types of lossy compression, in order to evaluate that. The two types of compression I recently played with were ‘AAC’ (plain), and ‘OGG Opus’, at 128kbps both times.

One of the apparent facts which I learned was that Phil Collins music is not the best type to try this with. The reason is the fact that much of his music was recorded using electronic instruments of his era, the main function of which was, to emulate standard acoustical instruments, but in a way that was ‘acoustically pure’. The fact that Phil Collins started his career as a drummer, did not prevent him from releasing later, solo albums.

If somebody is listening to an entire string section of an orchestra, or to a brass section, then one factor which contributes to the nature of the sound is, that every Violin is minutely off-pitch, as would be every French Horn. But what that also means is that the sound which results, is “Thick”. Its spectral energy is spread in subtle ways, whereas if somebody mixes two sine-waves that have exactly the same frequency, then he obtains another sine-wave. If one mixes 10 sine-waves that have exactly the same frequency, then he still obtains one sine-wave.

Having sound to start from which is ‘Thick’, is a good basis to test Codecs. Phil Collins does not provide that. Therefore, if the acoustical nature of the recording is boring, I have no way to know, whether it’s the Codec that failed to bring out greater depth, or whether that was the fault of Phil Collins.

(Update 8/03/2019, 12h55 : )

Since the last time I edited this posting, I learned, that Debian / Stretch, the Linux version I have installed on the computer I name ‘Phosphene’, only ships with ‘libopus v1.2~alpha2-1′ from the package repositories. Apparently, when using this version, the best one can hope for is an equivalent, to 128kbps MP3 -quality. This was the true reason, for which I was obtaining inferior results, along with the fact that I had given the command to encode my Opus Files using a version of ‘ffmpeg’, that just happened to include marginal support for ‘Opus’, instead of using the actual ‘opus-toolkit’ package.

What I have now done was, to download and custom-compile ‘libopus v1.3.1′, as well as its associated toolkit, and just to make sure that the programs work. Rumours have it, that when this version is used at a bit-rate of 96kbps, virtual transparency will result.

And I’ve written quite a long synopsis, as to why this might be so.

(Update 8/03/2019, 15h50 : )

I have now run an altered experiment, by encoding my Opus Files at 96kbps, and discovered to my amazement, that the sound I obtained seemed better, than what I had already obtained above, using 128kbps AAC-encoded Files.


 

(Update 10/02/2019, 12h25 : )

When I use the command ‘opusenc’, which I’ve custom-compiled as written above, it defaults to a frame-size of 20 milliseconds. Given a sampling rate of 48kHz, this amounts to a frame-size – or granule – of 960 samples. This is very different from what the developers were suggesting in their article in 2010 (see posting linked to above). With that sampling interval, the tonal accuracy will be approximately twice as good as it was with MP3 encoding, or with AAC encoding, without requiring that any “Hadamard Transforms” be used. With the default setting, Opus will generally be able to distinguish frequencies that are 25Hz apart, while it was in the nature of MP3 only to be able to distinguish between frequencies that are 40Hz apart, except for the fact that the lowest distinguishable frequency was near 20Hz.

If Hadamard Transforms are in fact used, then it now strikes me as more likely that they will be used to increase temporal resolution at the expense of spectral resolution, not the reverse. And the reason I think this is the following paragraph in the manpage, for the ‘opusenc’ command-line:

 


       --framesize N
              Set maximum frame size in milliseconds (2.5, 5, 10, 20, 40,  60,
              default: 20)
              Smaller  framesizes  achieve lower latency but less quality at a
              given bitrate.
              Sizes greater than 20ms  are  only  interesting  at  fairly  low
              bitrates.

 

 

 

(Update 8/12/2019, 6h00 : )

I have now listened very carefully to the Phil Collins Music encoded with AAC at 128kbps, and the exact same songs, encoded with Opus at 96kbps. And what I’ve come to find, is that Opus seems to preserve spectral complexity better than AAC. However, the AAC encoded versions of the same music, seem to provide slightly better perception of positioning all around the listener, of instruments and voices in the lower-mid-range frequencies, as a result of Stereo. And this would be, when the listener is using a good set of headphones.

Dirk

 

A Gap in My Understanding of Surround-Sound Filled: Separate Surround Channel when Compressed

In This earlier posting of mine, I had written about certain concepts in surround-sound, which were based on Pro Logic and the analog days. But I had gone on to write, that in the case of the AC3 or the AAC audio CODEC, the actual surround channel could be encoded separately, from the stereo. The purpose in doing so would have been, that if decoded on the appropriate hardware, the surround channel could be sent directly to the rear speakers – thus giving 6-channel output.

While writing what I just linked to above, I had not yet realized, that either channel of the compressed stream, could contain phase information conserved. This had caused me some confusion. Now that I realize, that the phase information could be correct, and not based on the sampling windows themselves, a conclusion comes to mind:

Such a separate, compressed surround-channel, would already be 90⁰ phase-shifted with respect to the panned stereo. And what this means could be, that if the software recognizes that only 2 output channels are to be decoded, the CODEC might just mix the surround channel directly into the stereo. The resulting stereo would then also be prepped, for Pro Logic decoding.

Dirk