Both the Discreet Fourier Transform and the Fast Fourier Transform produce complex-numbered coefficients, the non-zero amplitudes of which will represent frequency components in the signal. They both produce a more accurate measure of this property of the signal, than the Discreet Cosine Transforms do.
Without getting into rigorous Math,
If we have a 1024-sample interval in the time-domain, then the DFT of that simply computes the coefficients from 0 through to 1023, half-cycles. A frequency component present at one coefficient, let us say an even-numbered coefficient, will also have a non-zero effect on the adjacent, odd-numbered coefficients, which can therefore not be separated fully, by a Fourier Transform that defines both sets. A DFT will generally compute them all.
An FFT has as a premise, a specific number of coefficients per octave. That number could be (1), but seldom actually is. In general, an FFT will at first compute (2 * n) coefficients over the full sampling interval, will then fold the interval, and will then compute another (n) coefficients, and will fold the interval again, until the highest-frequency coefficient approaches 1/2 the number of time-domain samples in the last computed interval.
This will cause the higher-octave coefficients to be more spread out and less numerous, but because they are also being computed for successively shorter sampling intervals, they also become less selective, so that all the signal energy is eventually accounted for.
Also, with an FFT, it is usually the coefficients which correspond to the even-numbered ones in the DFT which are computed, again because one frequency component from the signal does not need to be accounted for twice. Thus, whole-numbers of cycles per sampling interval are usually computed.
For example, if we start with a 1024-sample interval in the time-domain, we may decide that we want to achieve (n = 4) coefficients per octave. We therefore compute 8 over the full interval, including (F = 0) but excluding (F = 8). Then we fold the interval down to 512 samples, and compute the coefficients from (F = 4) through (F = 7).
A frequency component that completes the 1024-sample interval 8 times, will complete the 512-sample interval 4 times, so that the second set of coefficients continues where the first left off. And then again, for a twice-folded interval of 256 samples, we compute from (F = 4) through (F = 7)…
After we have folded our original sampling interval 6 times, we are left with a 16-sample interval, which forms the end of our series, because (F = 8) would fit in exactly, into 16 samples. And, true to form, we omit the last coefficient, as we did with the DFT.
210 = 1024
10 – 6 = 4
24 = 16
So we would end up with
(1 * 8) + (6 * 4) = 32 Coefficients .
And this type of symmetry seemed relevant in this earlier posting.
(Edit 11/17/2016 : ) The above synopsis should be regarded as penultimate. There are at least two major ways in which it is flawed:
1 ) Even though this observation might seem a bit wonky, I tend to think of this first. If every time we fold the sampling interval, we halve the frequencies, then it might not be necessary ever to compute the even-numbered coefficients, because after sufficient repetitions of folding, they should turn into odd-numbered ones. But in order for this to be useful, the coefficients computed after folding need to be as selective in units of frequency, as those were in the first, longest sampling interval.
To help with that, a type of folding can be used – to increase the accuracy of the cosine-function – in which the samples in the second half of the interval are reversed in order, before being added to those in the first half of the interval.
And, a corresponding algorithm is also possible, in which the samples are reversed and negated, to increase the accuracy with which the sine-function can be computed each time.
But using either method, the fact has not been changed that a sine-component will have a non-zero effect, on the adjacent cosine-components, and vice-versa…
2 ) The practical utility of this latter observation affects my assumption, that with MP3 compression, the sampling intervals are 1152 samples long, that they overlap 50%, and that they thus yield an output frame every 576 input samples. The most obvious question to ask then would be, ‘Why does each output frame not contain 1152 coefficients?’
I think that the first step which MP3 undertakes at data-reduction, is to halve the number of coefficients, into equivalents of even-numbered ones, that are reliably independent.
In principle, one might strive to derive 576 accurate coefficients, based on a 576-sample interval. This would double temporal accuracy, without compromising spectral accuracy.
I think that soon after MP3 compression, schemes were developed, that produce significant output corresponding to the odd-numbered coefficients, and which were based on the shorter sampling interval.
Such a scheme is inherently easier to encode than to decode, because when encoding, the odd components wander into and out of phase by 90° from one sampling interval to the next, again assuming that the intervals overlap 50%. This means that the output from 2 consecutive intervals needs to be combined, and that when decoding, a set of 4 possible ‘rotators’ will need to be applied, so that a continuous sine- or cosine-wave will be reconstructed.
As long as only the even numbers were used when encoding, decoding them either meant a 180° inversion from one frame to the next, or to keep them in-phase, at 0°.
(Edit 11/21/2016 : ) The fact that an effective cosine-function becomes a sine-function, and vice-versa, from one granule to the next, does not get lost by the Discreet Cosine Transform, because the DCT can also be inverted in such a way as to reproduce the sine-function, as long as both the odd and the even coefficients were preserved.
…However, doing so would also require that each non-zero coefficient be stored with its sign-bit preserved, and would require that twice the number of coefficients be stored over the same real time interval, than are stored in practice.
…Further, doing so would negate any advantages, that were associated with practical implementations of Joint Stereo, which many compression schemes, including MP3 and OGG Vorbis use.
In practice, each coefficient is stored as an absolute integer, which can be derived either by combining the coefficient-values over two granules and keeping the spectral resolution, or by simplifying the coefficient-values down to only the even-numbered ones, and doubling the temporal resolution.
MP3 happens to possess explicit transient detection, due to which that scheme will switch to frames which have 1/3 the length of a full-frame.
As an alternative, it is still possible to presume that high spectral resolution is only needed up to a specific frequency, and that pairs of granules will generally be combined into frames, in which frequencies higher than that one will have the faster temporal resolution, thus requiring no explicit transient detection.
And whenever I write ‘combine’, this does not mean to apply an average, or an average between two absolutes. To combine might mean
a = F(2k, t)
b = F(2k-1, t)
c = F(2k+1, t)
G(k, t) = sqrt( a2 + b2/2 + c2/2 )
Or, to combine might mean
a = F(k, 2t)
b = F(k, 2t+1)
G(k, t) = sqrt(a2 + b2)