A Gap in My Understanding of Surround-Sound Filled: Separate Surround Channel when Compressed

In This earlier posting of mine, I had written about certain concepts in surround-sound, which were based on Pro Logic and the analog days. But I had gone on to write, that in the case of the AC3 or the AAC audio CODEC, the actual surround channel could be encoded separately, from the stereo. The purpose in doing so would have been, that if decoded on the appropriate hardware, the surround channel could be sent directly to the rear speakers – thus giving 6-channel output.

While writing what I just linked to above, I had not yet realized, that either channel of the compressed stream, could contain phase information conserved. This had caused me some confusion. Now that I realize, that the phase information could be correct, and not based on the sampling windows themselves, a conclusion comes to mind:

Such a separate, compressed surround-channel, would already be 90⁰ phase-shifted with respect to the panned stereo. And what this means could be, that if the software recognizes that only 2 output channels are to be decoded, the CODEC might just mix the surround channel directly into the stereo. The resulting stereo would then also be prepped, for Pro Logic decoding.



Why Humans Can Hear the Inter-Aural Delay

I’ve given this subject some attention in the past. It seems to be a fact, that as long as a sound stream is temporally complex, Humans can use the Inter-Aural Delay, as one of several hints as to the direction, which the sound came from. But as soon as the sound is temporally uniform, we cannot.

The way I’d explain this is without physical controversy. The way neurons fire, they can either be seen to carry binary or analog information. In short, one firing of a neuron could be like a ‘1’ as opposed to a ‘0’. Or, the steady rate at which a different type of neuron is firing, could encode an analog level. Well, some neurons seem to operate in both modes. At the onset of a signal, they could fire a short burst, after which a steady rate indicates a sustained amplitude.

The length of the path, which signals from the left auditory nerve need to take, to reach the left auditory cortex, may be exactly the same, as the length of the path, with which signals from the right auditory nerve take, to reach the left auditory cortex.

Therefore, the auditory cortex should be in a good position to discern in which order pulses of sound, or onsets of sound, reach it, as part of its information to determine direction, hence, to perceive the IAD.

But AFAICT, if the amplitude of a sine-wave is constant, then there is no real way in which our cortex can discern in what relative phase position it has reached our two ears.

(Updated 11/22/2018, 18h55 … )

Continue reading Why Humans Can Hear the Inter-Aural Delay

Some Specific Detail, about MP3 Compression of Sound

In This Posting, I wrote at length, about a weakness that exists in MP3-compressed sound, basing my text on the Discreet Cosine Transform, and what some of its implications are. I wrote about ‘a rational approach’, according to which it might make sense, to use a sampling interval of 1024 samples. But I do know in fact that with MP3 compression, each sampling interval has 1152 samples, and the length of each frame is 576 samples. Somebody please correct me, if I have this wrong.

But there is a certain aspect to MP3 encoding which I did not mention, that has to do with the actual representation of the coefficients, and that has implications for what can and cannot be done in the realm of the Fourier Transform used. A Fourier Transform by itself, does not succeed at compressing data. It only alters the representation of the data, from the time-domain into the frequency-domain, which is useful in sound compression, because to alter the data in the frequency-domain does not damage its suitability for listening, the same way that altering its representation in the time-domain would damage it.

I.e., We can quantize the signal after having performed the Fourier Transform on it, but not before.

One of the aspects of MP3 compression which truly reduces the bit-rates obtained substantially, is called “Entropy Encoding”. This is an encoding scheme, by which a finite number of symbols are assigned a set of bits to represent them in a data stream, which invert the frequency of occurrence, to result in the shortest possible bit-stream.

  1. One aspect of Entropy Encoding which I do not see mentioned often enough, is the fact that the symbols need to repeat themselves, in order for this scheme to achieve any compression. Hence, if the coefficients used in sound compression were to consist of floating-point numbers, the probability that any one of them would actually occur twice in the data stream would be small, and  Entropy Encoding would not be a suitable means to reduce the bit-rate.
  2. Further, traditionally, in order for Entropy Encoding to be decoded, a data stream needed to be accompanied with a decoding table, that defines each of the variable-bit-length codes, into the intended symbol. In sound compression, even if we needed to state what the exact 15-bit value was, for each variable-bit-length encoding, doing so would nevertheless require that we state the 15-bit value once, in the header of each frame. And having to do so, would result in unacceptably high bit-rates overall.

And so both of these limitations of Entropy Encoding had to be surmounted, in order for MP3 compression to exist as we have it today.

(As of 05/23/2016, I have learned the following about this scheme: )

What happens with MP3, at the encoding level, after they have passed filtering through the psychoacoustic criteria, is that coefficients are scaled. The scale-factor is written once for each of 22 bands of frequencies, before Huffman Codes are written, that state all the frequency coefficients.

Further, because Huffman Encoding by itself does not yield enough compression, pairs of coefficients are encoded instead of single coefficients. Somehow, the Statistics of this yield better compression.

What also happens with MP3, is that this fixed table (for pairs of integers) is assumed by the standard.

(What had caused me to follow a misconception until 05/23/2016 :

Apparently, a Huffman Code for 15 signals that a full-precision, ‘big value’ is written, following that Huffman Code, with a precision of 13 bits.

The crucial note to myself here is, that the Entropy Encoding table is specifically the Huffman Coding Table, and that for this reason, integers greater than 15 could also be encoded. But by that time, we would have reached the point of diminishing return. And more precisely, it is the Huffman Coding Table, modified to encode Pairs of integers, so that a maximum compression down to 12.5% becomes possible, instead of merely 25%. )

(Edit 06/06/2016 : ) It should be noted, that the practice of Huffman Encoding pairs of values is really only advantageous, if at least one of them was equal to zero, often. Otherwise, it would work just as well to encode them individually.

(Edit 05/28/2016 : ) What strikes me as most plausible, is that with MP3, initially the odd-numbered DCT coefficients are computed, to avoid missing out-of-phase sine-waves. But then, even-numbered coefficients may be derived from them, so that the stream can be decoded again efficiently. The even-numbered coefficients will have as property, that they are 180 degrees out of phase, between two 50% overlapping sampling intervals / frames. This can make playback easier, in that the decoder only needs to keep track, of even-numbered and odd-numbered frames / granules.

Now, I would not say that people should never use MP3. It has its uses. But it also has drawbacks, which are usually correlated with the use that MP3 was originally designed to fill. It was designed for listening to music over limited, early Internet data-connections, and may be just as useful for compressing speech, if the aim is to reduce the bit-rate strongly, and to accept some level of information-loss.

At the bit-rates used today, it leaves the user with a sound quality superior to what the old tape cassettes offered, but inferior to what Raw CDs offered.

It was never really intended to encode movie sound tracks, especially since those often involve ‘Surround Sound’. MP3 generally does not capture surround sound. Yet, I can see myself using it to encode the audio portion of certain video-clips myself, if I know that those clips do not include surround sound. An example might be a rock concert, or some random clip I was experimenting with, but for which my original production never even included any surround information.

There exist numerous alternatives to MP3, that are also available to ordinary users today.


(Edit 05/24/2016 : ) There are some other idiosyncrasies in real MP3 compression, which I had noted at some earlier point in time, but which I had since forgotten:

One of them is, that because it is popular right now to refer to the ‘Discreet Fourier Transform’, as a “Fast Fourier Transform”, the DFT is actually computed in order to derive the psychoacoustic parameters. In this transform, there are 32 frequency sub-bands. But then the DCT gets used, actually to compress the sound.

Another idiosyncrasy is, that MP3 will use discreet transient detection, to replace one granule that had a length of 576, with 3 granules that have a length of 192, thus implying a new sampling interval of 384. This defines 4 regions into which any granule can belong, a ‘start’, a ‘normal’, and an ‘end’ region, as well as a ‘fast’ region. Each region has its own sampling window defined.

(Edit 06/06/2016 : ) There was an interesting detail I read about, according to which, the scale factor of each of the 22 encoded sub-bands is stored in the per-granule information, with the exclusion of the highest-frequency sub-band. Apparently, to have the encoder compute a scale factor for all the sub-bands would have implied, that a balanced amount of information is to be allocated to each one.

However, the highest sub-band was thought by the designers, to contain less-pleasant information than the others, which is not supposed to take up as many bits necessarily. Therefore, the decoder is expected to reuse the scale factor of the second-highest sub-band, as the one for the highest.

The highest sub-band will then store many bits, if its amplitudes were quite high during encoding.

Also, whether the Fourier Transform used to derive the psychoacoustic parameters is an ‘FFT’ or a ‘DFT’, is a decision left to the programmers of the codec, since this transform is not used actually to encode the granules. If there was a programmer who wanted to use a DFT here, with 32 sub-bands of its own, then that programmer was recognizing the fact that today, CPUs have far more power than older ones did, and he was trying to improve the quality with which the granules are encoded.

By default, an FFT is used as the first transform, simply because doing so follows the general principal, of trying to reduce the total number of computations needed by the encoder. Its purpose is to determine the audibility thresholds, according to which some of the coefficients of the DCT are set to zero, on the grounds that those should be inaudible

This was also why a ‘DCT’ was used for the actual sound information. That could also have been a DFT, but with the phase information later ignored…