Why the Temporal Resolution of MP3s is Poor.

I have spent a lot of my private time, thinking about lossy sound compression, and then, simplifying my ideas to something more likely to have been implemented in actual MP3 compression. In order to understand this concept, one needs to be familiar with the concept of Fourier Transforms. There are two varieties of them, which are important in sound compression: The “Discreet Fourier Transform” (‘DFT’), and the “Discreet Cosine Transform” (‘DCT’), the latter of which has several types again.

I did notice that the temporal resolution of MP3s I listen to is poor, and it was an important realization I finally came to, that this was not due to the actual length of the sampling window.

If we were to assume for the moment that the sampling interval was 1024 samples long – and for MP3, it is not – then to compute the DFT of that would produce 1024 frequency coefficients, from (0-1023 / 2) cycles / sampling interval. Each of these coefficients would be a complex number, and the whole set of them can be used to reconstruct the original sample-set accurately, by inverting the DFT. The inverse of the DFT is actually the DFT computation again, but with the imaginary (sine) component inverted (negated).

But, MP3s do not use the DFT, instead using the DCT, the main difference in which is, that the DCT does not record a complex number for each coefficient, rather just stating a real number, which would normally correspond only to the cosine function within the DFT… Admittedly, each of these absolute amplitudes may possibly be negated.

If the time-domain signal consisted of a 5 kHz wave, which was pulsating on and off 200 times per second – which would actually sound like buzzing to human ears – then the DCT would record a frequency component at 5kHz, but as long as they are not suppressed due to the psychoacoustic models used, would also record ‘sidebands’ at 4800 and at 5200 Hz, each of which has 1/2 the amplitude of the center frequency at 5 kHz. I know this, because for the center frequency to be turned on and off, it must be amplitude modulated, virtually. And so what has this time-domain representation, even though this happens faster than once per sampling window, also has a valid frequency-domain representation.

When this gets decoded, the coefficient-set will reproduce a sample-set, whose 5 kHz center frequency again seems to ‘buzz’ 200 times per second, due to the individual frequency components interfering constructively and then destructively, even though they are being applied equally across the entire sampling interval.

But because the coefficient-set was produced by a DCT, it has no accurate phase information. And so the exact time each 5 kHz burst has its maximum amplitude, will not correspond to the exact time it did before. This will only seem to correct itself once per frame. If the sampling interval was truly 1024 samples long, then a frame will recur every 512 samples, which is ~80 times per second.

Now the question could be asked, why it should not be possible to base lossy audio compression on the DFT instead. And the answer is that in principle, it would be possible to do so. Only, if each coefficient-set consisted of complex numbers, it would also become more difficult to compress the number of kbps kept in the stream, in an effective way. It would probably still not be possible to preserve the phase information perfectly.

And then as a side-note, this one hypothetical sample-set started out as consisting of real numbers. But with the DFT, the sample-set could carry complex numbers as easily as the coefficient-set did. If the coefficients were compressed-and-simplified, then the samples reproduced would probably end up being so, with complex values. In a case like this, the correct thing to do is to ignore the imaginary component, and only output the real component, as the decoded result…

When using a DCT to encode a stream of sound, which is supposed to be continuous, a vulgarization of the problem could be, that the stream contains ‘a sine wave instead of a cosine wave’, which would therefore get missed by all the sampling intervals, because only the product with the cosine function is being computed each time, for a specific coefficient. The solution that comes from the Math of the DCT itself is, that the phase of the unit vector generally rotates 90 degrees ~from each frame to the next~. To the best of my understanding, two sampling intervals will generally overlap by 50% in time, resulting in one frame half as long. It may be that the designers only compute the odd-numbered coefficients. Then, the same coefficient belonging to the next frame should be aware of this wave notwithstanding. Further, the sampling intervals are made to overlap when the stream is decoded again, such that a continuous wave can be reconstructed. ( :1 )

The only question I remain curious about, is whether a need exists when encoding with a DCT, to blend any given coefficient as belonging to two consecutive frames, the current one plus the previous one.

While it can be done, to use rectangular sampling windows for encoding, the results from that are likely to be offensive to listen to. So in practice, Blackman Windows should ideally be used for encoding (that are twice as long as a frame).

The choice of whether decoders should use a Hanning Window or a Linear Taper, can depend on what sort of situation should best be reproduced.

Decoding with a linear taper, will cause crescendos to seem maximally smooth, and perfectly so if the crescendo is linear. But considering that linear crescendos might be rare in real music, a Hanning Window will minimize the distortion that is generated, when a burst of sound is decoded, just as a Blackman Window was supposed to do when encoding. Only, a Blackman Window cannot be used to decode, because coefficients being constant from one frame to the next would result in non-constant (output) sample amplitudes.

Dirk

(Edit 05/18/2016 : ) One related fact should be acknowledged. The DCT can be used to reconstruct a phase-correct sample-set, if non-zero even-numbered as well as odd-numbered coefficients are included. This follows directly from the fact that a ‘Type 3′ DCT is the inverse of the ‘Type 2′. But, the compression method used by several codecs is such, that a psychoacoustic model suppresses coefficients, on the assumption that they should be inaudible, because they are too close spectrally, to stronger ones. This would almost certainly go into effect, between complementary even-numbered and odd-numbered DCT coefficients.

( 05/31/2016 : ) One detail which was not made clear to me, was whether instead, coefficients which are in the same sub-band as one that has a stronger peak, are merely quantized more, due to the scale-factor of that sub-band being higher, to capture this higher peak. This would strike me as favorable, but also results in greater bit-rates, than what would follow, from setting supposedly-inaudible coefficients to zero. Due to Huffman Encoding, the bit-length of a more-quantized coefficient, is still longer than that for the value (zero).

In any type of Fourier Transform, signal energy at one frequency cannot be separated fully from energy measured at a frequency different by only half a cycle per frame. When the difference is at least by one cycle per frame, energy and therefore amplitude become isolated. This does not mean however, that the presence of a number of coefficients equal to the number of samples, is always redundant.

And so, one good way to achieve some phase-correctness might be, to try designing a codec, which does not rely too strongly on the customary psychoacoustic models. For example, a hypothetical codec might rely on Quantization, followed by Exponential-Golomb Coding of the coefficients, being sure to state the scale of quantization in the header information of each frame.

It is understood that such approaches will produce ‘poorer results’ at a given bit-rate. But then, simply choosing a higher bit-rate (than what might be appropriate for an MP3) could result in better sound.

And then, just not to make our hypothetical codec too primitive, one could subdivide the audible spectrum into 8 bands, each one octave higher than the previous, starting from coefficient (8), so that each of these bands can be quantized by a different scale, according to the Threshold Of Audibility. These Human Loudness Perception Curves may be a simple form of psychoacoustics, but are also thought to be reliable fact, as perceived loudness does not usually correspond consistently to uniform spectral distribution of energy.

Parts of the spectrum could be quantized less, for which ‘the lower threshold of hearing’ is lower with respect to a calculable loudness value, at which the Human ears are thought to be uniformly sensitive to all frequencies.

Assigning such a header-field 8 times for each frame would not be prohibitive.

1: ) ( 05/31/2016 ) An alternative approach, which the designers of MP3 could just as easily have used, would have been first to compute the DCT, including both even- and odd-numbered coefficients F(k), but then to derive only the even-numbered coefficients from that. The best way would have been, for even numbered, derived coefficient G(k) to be found as

r = F(k)

i = F(k+1) – F(k-1)

G(0) = F(0)

G(k) = sqrt( r^2 + i^2 )