Some Specific Detail, about MP3 Compression of Sound

In This Posting, I wrote at length, about a weakness that exists in MP3-compressed sound, basing my text on the Discreet Cosine Transform, and what some of its implications are. I wrote about ‘a rational approach’, according to which it might make sense, to use a sampling interval of 1024 samples. But I do know in fact that with MP3 compression, each sampling interval has 1152 samples, and the length of each frame is 576 samples. Somebody please correct me, if I have this wrong.

But there is a certain aspect to MP3 encoding which I did not mention, that has to do with the actual representation of the coefficients, and that has implications for what can and cannot be done in the realm of the Fourier Transform used. A Fourier Transform by itself, does not succeed at compressing data. It only alters the representation of the data, from the time-domain into the frequency-domain, which is useful in sound compression, because to alter the data in the frequency-domain does not damage its suitability for listening, the same way that altering its representation in the time-domain would damage it.

I.e., We can quantize the signal after having performed the Fourier Transform on it, but not before.

One of the aspects of MP3 compression which truly reduces the bit-rates obtained substantially, is called “Entropy Encoding”. This is an encoding scheme, by which a finite number of symbols are assigned a set of bits to represent them in a data stream, which invert the frequency of occurrence, to result in the shortest possible bit-stream.

  1. One aspect of Entropy Encoding which I do not see mentioned often enough, is the fact that the symbols need to repeat themselves, in order for this scheme to achieve any compression. Hence, if the coefficients used in sound compression were to consist of floating-point numbers, the probability that any one of them would actually occur twice in the data stream would be small, and  Entropy Encoding would not be a suitable means to reduce the bit-rate.
  2. Further, traditionally, in order for Entropy Encoding to be decoded, a data stream needed to be accompanied with a decoding table, that defines each of the variable-bit-length codes, into the intended symbol. In sound compression, even if we needed to state what the exact 15-bit value was, for each variable-bit-length encoding, doing so would nevertheless require that we state the 15-bit value once, in the header of each frame. And having to do so, would result in unacceptably high bit-rates overall.

And so both of these limitations of Entropy Encoding had to be surmounted, in order for MP3 compression to exist as we have it today.

(As of 05/23/2016, I have learned the following about this scheme: )

What happens with MP3, at the encoding level, after they have passed filtering through the psychoacoustic criteria, is that coefficients are scaled. The scale-factor is written once for each of 22 bands of frequencies, before Huffman Codes are written, that state all the frequency coefficients.

Further, because Huffman Encoding by itself does not yield enough compression, pairs of coefficients are encoded instead of single coefficients. Somehow, the Statistics of this yield better compression.

What also happens with MP3, is that this fixed table (for pairs of integers) is assumed by the standard.

(What had caused me to follow a misconception until 05/23/2016 :

Apparently, a Huffman Code for 15 signals that a full-precision, ‘big value’ is written, following that Huffman Code, with a precision of 13 bits.

The crucial note to myself here is, that the Entropy Encoding table is specifically the Huffman Coding Table, and that for this reason, integers greater than 15 could also be encoded. But by that time, we would have reached the point of diminishing return. And more precisely, it is the Huffman Coding Table, modified to encode Pairs of integers, so that a maximum compression down to 12.5% becomes possible, instead of merely 25%. )

(Edit 06/06/2016 : ) It should be noted, that the practice of Huffman Encoding pairs of values is really only advantageous, if at least one of them was equal to zero, often. Otherwise, it would work just as well to encode them individually.

(Edit 05/28/2016 : ) What strikes me as most plausible, is that with MP3, initially the odd-numbered DCT coefficients are computed, to avoid missing out-of-phase sine-waves. But then, even-numbered coefficients may be derived from them, so that the stream can be decoded again efficiently. The even-numbered coefficients will have as property, that they are 180 degrees out of phase, between two 50% overlapping sampling intervals / frames. This can make playback easier, in that the decoder only needs to keep track, of even-numbered and odd-numbered frames / granules.

Now, I would not say that people should never use MP3. It has its uses. But it also has drawbacks, which are usually correlated with the use that MP3 was originally designed to fill. It was designed for listening to music over limited, early Internet data-connections, and may be just as useful for compressing speech, if the aim is to reduce the bit-rate strongly, and to accept some level of information-loss.

At the bit-rates used today, it leaves the user with a sound quality superior to what the old tape cassettes offered, but inferior to what Raw CDs offered.

It was never really intended to encode movie sound tracks, especially since those often involve ‘Surround Sound’. MP3 generally does not capture surround sound. Yet, I can see myself using it to encode the audio portion of certain video-clips myself, if I know that those clips do not include surround sound. An example might be a rock concert, or some random clip I was experimenting with, but for which my original production never even included any surround information.

There exist numerous alternatives to MP3, that are also available to ordinary users today.

Dirk

(Edit 05/24/2016 : ) There are some other idiosyncrasies in real MP3 compression, which I had noted at some earlier point in time, but which I had since forgotten:

One of them is, that because it is popular right now to refer to the ‘Discreet Fourier Transform’, as a “Fast Fourier Transform”, the DFT is actually computed in order to derive the psychoacoustic parameters. In this transform, there are 32 frequency sub-bands. But then the DCT gets used, actually to compress the sound.

Another idiosyncrasy is, that MP3 will use discreet transient detection, to replace one granule that had a length of 576, with 3 granules that have a length of 192, thus implying a new sampling interval of 384. This defines 4 regions into which any granule can belong, a ‘start’, a ‘normal’, and an ‘end’ region, as well as a ‘fast’ region. Each region has its own sampling window defined.

(Edit 06/06/2016 : ) There was an interesting detail I read about, according to which, the scale factor of each of the 22 encoded sub-bands is stored in the per-granule information, with the exclusion of the highest-frequency sub-band. Apparently, to have the encoder compute a scale factor for all the sub-bands would have implied, that a balanced amount of information is to be allocated to each one.

However, the highest sub-band was thought by the designers, to contain less-pleasant information than the others, which is not supposed to take up as many bits necessarily. Therefore, the decoder is expected to reuse the scale factor of the second-highest sub-band, as the one for the highest.

The highest sub-band will then store many bits, if its amplitudes were quite high during encoding.

Also, whether the Fourier Transform used to derive the psychoacoustic parameters is an ‘FFT’ or a ‘DFT’, is a decision left to the programmers of the codec, since this transform is not used actually to encode the granules. If there was a programmer who wanted to use a DFT here, with 32 sub-bands of its own, then that programmer was recognizing the fact that today, CPUs have far more power than older ones did, and he was trying to improve the quality with which the granules are encoded.

By default, an FFT is used as the first transform, simply because doing so follows the general principal, of trying to reduce the total number of computations needed by the encoder. Its purpose is to determine the audibility thresholds, according to which some of the coefficients of the DCT are set to zero, on the grounds that those should be inaudible

This was also why a ‘DCT’ was used for the actual sound information. That could also have been a DFT, but with the phase information later ignored…