In many of my earlier postings, I stated what happens in MP3-compressed sound somewhat inaccurately. One reason is the fact that an overview requires that information be combined from numerous sources. While earlier WiKiPedia articles tended to be quite incomplete on this subject, it happens that more-recent WiKi-coverage has become quite complete, yet still requires that users click deeper and deeper, into subjects such as the Type 4 Discrete Cosine Transform, the Modified Discrete Cosine Transform, and Polyphase Quadrature Filters.
What seems to happen with MP3 compression, which is also known as MPEG-2, Layer 3, is that the Discrete Cosine Transform is not applied to the audio directly, but that rather, the audio stream is divided down to 32 sub-bands in fact, and that the MDCT is applied to each sub-band individually.
Actually, after the coefficients are computed, a specific filter is applied to them, to reduce the aliasing that happened, just because of the PQF Filter-bank.
I cannot be sure that this was always how MP3 was implemented, because if we take into account the fact that with PQF, every second sub-band is frequency-inverted, we may be able to obtain equivalent results just by performing the Discrete Cosine Transform which is needed, directly on the audio. But apparently, there is some advantage in subdividing the spectrum into its 32 sub-bands first.
One advantage could be, that doing so reduces the amount of computation required. Another advantage could be the reduction of round-off errors. Computing many smaller Fourier Transforms has generally accomplished both.
Also, if the spectrum is first subdivided in this way, it becomes easier to extract the parameters from each sub-band, that will determine how best to quantize its coefficients, or to cull ones either deemed to be inaudible, or aliased artifacts.
I suppose a suitable question to ask might be, ‘Why did the designers of MP3 not choose to use a Quadrature Mirror Filter, to obtain its 32 sub-bands?’
The QMF would have two wavelets defined, one acting as a low-pass filter, and the other acting as the so-called “band-pass filter”, which I tend to refer to as the ‘high-pass filter’. These two wavelets will attenuate the frequency-components originally present at their crossover. While both systems are critically-sampled, the QMF will produce less aliasing.
But there is no word that a QMF can be applied successfully, 5 stages deep, to arrive at 32 sub-bands. There is only word they can be applied 2 stages deep, to arrive at 4 sub-bands. At 32 sub-bands, the main problem may become signal attenuation.
OTOH, Polyphase Quadrature Filters only have one wavelet defined, as their low-pass component. Their high-pass component seems to arise as a difference between what was introduced into the low-pass component, and ‘every second sample of the original signal’. Hence, this sort of filter requires less computation, and also tends not to attenuate any part of the spectrum in total.
However, the main drawback with PQF remains, that this leads to aliasing, for which MP3 specifically has its reduction algorithm.
Further, I feel that the windowing function deserves mention, specifically because
there is none. It is my own assumption, that if a Fourier Transform of any kind is computed on a block of samples, let us say with 576 samples, the transform will also have 576 coefficients, in its pure form. But, if a continuous stream is to undergo the transformation, a 50% overlap will be applied, between sampling windows. Hence, 1152-sample windows are possible, each of which produce 1152 coefficients.
In order to preserve the spectral quality as much as possible, I would say that a Blackman Window should be applied. Yet, to reproduce the time-domain equivalent, the Blackman Window is not suitable, because it would not generate constant amplitude. Therefore, for playback, I would suggest a Hanning Window.
(Edit 08/07/2017 : )
Apparently, the designers of MP3-compressed sound have opted for a kind of Modified Discrete Cosine Transform, which corresponds to a Type 4, which means a half-sample shift on both the time-domain and the frequency-domain sample-functions. The MDCT aims to double the number of time-domain samples, with respect to how many frequency-domain samples there are, so that each time, a sampling window of 1152 time-domain samples produces a frequency-domain block, of 576 samples. The time-domain samples are meant to overlap 50%.
Well, if we simply extend the Discrete Cosine Transform over 2x as many time-domain samples as were originally used to define it, then we are also doubling all the frequencies. And then each frequency-coefficient, which started out with a half-sample shift, corresponds to an odd coefficient, of a Type 2 transform, with 1152 time-domain samples.
And so what the designers of Frequency-Domain-Compressed sound have done, was initially to accept the fact that the same coefficient expresses base-vectors that are orthogonal from one sampling window to the next, so that a hypothetical series of amplitudes such as ( +1, 0, -1, 0, +1 ) could form, to represent a continuous sine-wave. ( :1 )
But then, what the developers did next, was simply to resynthesize the time-domain stream, by allowing the resulting 1152-sample windows to overlap again, and by adding their values. This apparently causes errors to cancel out, and in the hypothetical example of my previous paragraph, causes a sine-wave to reform, out of the non-zero 1st, 3rd and 5th
coefficients sampling-intervals – a sine-wave that would be intact.
My own failure to predict this solution, stems from my own premise, that some sort of windowing-function must be applied, to smooth any transitions from one sampling-window to the next, when decoding. The standard sampling window-functions from signal analysis – based on the cosine function – will destroy the effectiveness of this sort of “lapped transform”, and so my mind analyzed no further.
But there exist sampling windows not based on the cosine function, but rather on the sine function, which provide what the standard windowing functions provide, and which are compatible with lapped transforms.
So the conclusion is, that the lapped transform is not only more efficient than standard cosine transforms, by providing 576 frequency-domain samples for every 576 time-domain samples, but also invertible after all, so that potentially, even the phase-position of the sine-waves would be correct – i.e., implying mathematically-correct inversion.
However, whether in the resynthesized audio-stream, the phase-information is ultimately correct, depends on the degree with which the frequency-coefficients have either been quantized or culled. Missing coefficients, that have been reset to zero, seem to be the main enemy of accurate signal-reconstruction here.
1: ) There is a reason for which practical examples of signals do not line up that clearly, which has to do with the frequencies not being exactly equal to the frequencies corresponding to the coefficients. In that case, phase-positions will follow which also not align perfectly to 0, 90⁰, 180⁰ … positions, as seen in the local time-frame of any one sampling-window.
But in this more-general case, the reproduced waves from applying the IMDCT, will also appear as components which are 90⁰ out-of-phase between odd and even windows, and which when added, can reproduce a desired, arbitrary phase-position.
2: ) Any windowing function will imply, that for an instant in time, the decoded output-stream is being defined entirely by either the odd or the even sampling-interval, and not by the other. For this reason, they’ll all cause some amount of phase-distortion. Their underlying premise seems to be an input sine-wave at 45⁰ to the base-function of the cosine transform, which should maintain a constant amplitude.
The OGG Vorbis codec seems to apply a more-complex windowing function from the rest, in which the period close to zero is shorter, and with a longer run, over which both sets of sampling intervals contribute strongly to the decoded output.
Having said that, I recognize that most Sound Experts are more concerned, with the unintended frequency-components that result, because a continuous input stream is being amplitude-modulated into pulses, and with how to minimize those. Again, a sine-squared pulse seems to minimize those better, than most pulse-envelopes.