An Update about MP3-Compressed Sound

In many of my earlier postings, I stated what happens in MP3-compressed sound somewhat inaccurately. One reason is the fact that an overview requires that information be combined from numerous sources. While earlier WiKiPedia articles tended to be quite incomplete on this subject, it happens that more-recent WiKi-coverage has become quite complete, yet still requires that users click deeper and deeper, into subjects such as the Type 4 Discrete Cosine Transform, the Modified Discrete Cosine Transform, and Polyphase Quadrature Filters.

What seems to happen with MP3 compression, which is also known as MPEG-2, Layer 3, is that the Discrete Cosine Transform is not applied to the audio directly, but that rather, the audio stream is divided down to 32 sub-bands in fact, and that the MDCT is applied to each sub-band individually.

Actually, after the coefficients are computed, a specific filter is applied to them, to reduce the aliasing that happened, just because of the PQF Filter-bank.

I cannot be sure that this was always how MP3 was implemented, because if we take into account the fact that with PQF, every second sub-band is frequency-inverted, we may be able to obtain equivalent results just by performing the Discrete Cosine Transform which is needed, directly on the audio. But apparently, there is some advantage in subdividing the spectrum into its 32 sub-bands first.

One advantage could be, that doing so reduces the amount of computation required. Another advantage could be the reduction of round-off errors. Computing many smaller Fourier Transforms has generally accomplished both.

Also, if the spectrum is first subdivided in this way, it becomes easier to extract the parameters from each sub-band, that will determine how best to quantize its coefficients, or to cull ones either deemed to be inaudible, or aliased artifacts.

Continue reading An Update about MP3-Compressed Sound

Why the Temporal Resolution of MP3s is Poor.

I have spent a lot of my private time, thinking about lossy sound compression, and then, simplifying my ideas to something more likely to have been implemented in actual MP3 compression. In order to understand this concept, one needs to be familiar with the concept of Fourier Transforms. There are two varieties of them, which are important in sound compression: The “Discreet Fourier Transform” (‘DFT’), and the “Discreet Cosine Transform” (‘DCT’), the latter of which has several types again.

I did notice that the temporal resolution of MP3s I listen to is poor, and it was an important realization I finally came to, that this was not due to the actual length of the sampling window.

If we were to assume for the moment that the sampling interval was 1024 samples long – and for MP3, it is not – then to compute the DFT of that would produce 1024 frequency coefficients, from (0-1023 / 2) cycles / sampling interval. Each of these coefficients would be a complex number, and the whole set of them can be used to reconstruct the original sample-set accurately, by inverting the DFT. The inverse of the DFT is actually the DFT computation again, but with the imaginary (sine) component inverted (negated).

But, MP3s do not use the DFT, instead using the DCT, the main difference in which is, that the DCT does not record a complex number for each coefficient, rather just stating a real number, which would normally correspond only to the cosine function within the DFT… Admittedly, each of these absolute amplitudes may possibly be negated.

If the time-domain signal consisted of a 5 kHz wave, which was pulsating on and off 200 times per second – which would actually sound like buzzing to human ears – then the DCT would record a frequency component at 5kHz, but as long as they are not suppressed due to the psychoacoustic models used, would also record ‘sidebands’ at 4800 and at 5200 Hz, each of which has 1/2 the amplitude of the center frequency at 5 kHz. I know this, because for the center frequency to be turned on and off, it must be amplitude modulated, virtually. And so what has this time-domain representation, even though this happens faster than once per sampling window, also has a valid frequency-domain representation.

When this gets decoded, the coefficient-set will reproduce a sample-set, whose 5 kHz center frequency again seems to ‘buzz’ 200 times per second, due to the individual frequency components interfering constructively and then destructively, even though they are being applied equally across the entire sampling interval.

But because the coefficient-set was produced by a DCT, it has no accurate phase information. And so the exact time each 5 kHz burst has its maximum amplitude, will not correspond to the exact time it did before. This will only seem to correct itself once per frame. If the sampling interval was truly 1024 samples long, then a frame will recur every 512 samples, which is ~80 times per second.

Now the question could be asked, why it should not be possible to base lossy audio compression on the DFT instead. And the answer is that in principle, it would be possible to do so. Only, if each coefficient-set consisted of complex numbers, it would also become more difficult to compress the number of kbps kept in the stream, in an effective way. It would probably still not be possible to preserve the phase information perfectly.

And then as a side-note, this one hypothetical sample-set started out as consisting of real numbers. But with the DFT, the sample-set could carry complex numbers as easily as the coefficient-set did. If the coefficients were compressed-and-simplified, then the samples reproduced would probably end up being so, with complex values. In a case like this, the correct thing to do is to ignore the imaginary component, and only output the real component, as the decoded result…

When using a DCT to encode a stream of sound, which is supposed to be continuous, a vulgarization of the problem could be, that the stream contains ‘a sine wave instead of a cosine wave’, which would therefore get missed by all the sampling intervals, because only the product with the cosine function is being computed each time, for a specific coefficient. The solution that comes from the Math of the DCT itself is, that the phase of the unit vector generally rotates 90 degrees ~from each frame to the next~. To the best of my understanding, two sampling intervals will generally overlap by 50% in time, resulting in one frame half as long. It may be that the designers only compute the odd-numbered coefficients. Then, the same coefficient belonging to the next frame should be aware of this wave notwithstanding. Further, the sampling intervals are made to overlap when the stream is decoded again, such that a continuous wave can be reconstructed. ( :1 )

The only question I remain curious about, is whether a need exists when encoding with a DCT, to blend any given coefficient as belonging to two consecutive frames, the current one plus the previous one.

While it can be done, to use rectangular sampling windows for encoding, the results from that are likely to be offensive to listen to. So in practice, Blackman Windows should ideally be used for encoding (that are twice as long as a frame).

The choice of whether decoders should use a Hanning Window or a Linear Taper, can depend on what sort of situation should best be reproduced.

Decoding with a linear taper, will cause crescendos to seem maximally smooth, and perfectly so if the crescendo is linear. But considering that linear crescendos might be rare in real music, a Hanning Window will minimize the distortion that is generated, when a burst of sound is decoded, just as a Blackman Window was supposed to do when encoding. Only, a Blackman Window cannot be used to decode, because coefficients being constant from one frame to the next would result in non-constant (output) sample amplitudes.

Dirk

(Edit 05/18/2016 : ) One related fact should be acknowledged. The DCT can be used to reconstruct a phase-correct sample-set, if non-zero even-numbered as well as odd-numbered coefficients are included. This follows directly from the fact that a ‘Type 3′ DCT is the inverse of the ‘Type 2′. But, the compression method used by several codecs is such, that a psychoacoustic model suppresses coefficients, on the assumption that they should be inaudible, because they are too close spectrally, to stronger ones. This would almost certainly go into effect, between complementary even-numbered and odd-numbered DCT coefficients.

( 05/31/2016 : ) One detail which was not made clear to me, was whether instead, coefficients which are in the same sub-band as one that has a stronger peak, are merely quantized more, due to the scale-factor of that sub-band being higher, to capture this higher peak. This would strike me as favorable, but also results in greater bit-rates, than what would follow, from setting supposedly-inaudible coefficients to zero. Due to Huffman Encoding, the bit-length of a more-quantized coefficient, is still longer than that for the value (zero).

In any type of Fourier Transform, signal energy at one frequency cannot be separated fully from energy measured at a frequency different by only half a cycle per frame. When the difference is at least by one cycle per frame, energy and therefore amplitude become isolated. This does not mean however, that the presence of a number of coefficients equal to the number of samples, is always redundant.

And so, one good way to achieve some phase-correctness might be, to try designing a codec, which does not rely too strongly on the customary psychoacoustic models. For example, a hypothetical codec might rely on Quantization, followed by Exponential-Golomb Coding of the coefficients, being sure to state the scale of quantization in the header information of each frame.

It is understood that such approaches will produce ‘poorer results’ at a given bit-rate. But then, simply choosing a higher bit-rate (than what might be appropriate for an MP3) could result in better sound.

And then, just not to make our hypothetical codec too primitive, one could subdivide the audible spectrum into 8 bands, each one octave higher than the previous, starting from coefficient (8), so that each of these bands can be quantized by a different scale, according to the Threshold Of Audibility. These Human Loudness Perception Curves may be a simple form of psychoacoustics, but are also thought to be reliable fact, as perceived loudness does not usually correspond consistently to uniform spectral distribution of energy.

Parts of the spectrum could be quantized less, for which ‘the lower threshold of hearing’ is lower with respect to a calculable loudness value, at which the Human ears are thought to be uniformly sensitive to all frequencies.

Assigning such a header-field 8 times for each frame would not be prohibitive.

1: ) ( 05/31/2016 ) An alternative approach, which the designers of MP3 could just as easily have used, would have been first to compute the DCT, including both even- and odd-numbered coefficients F(k), but then to derive only the even-numbered coefficients from that. The best way would have been, for even numbered, derived coefficient G(k) to be found as

r = F(k)

i = F(k+1) – F(k-1)

G(0) = F(0)

G(k) = sqrt( r^2 + i^2 )