There is a fundamental tradeoff which takes place, when we use some sort of Fourier Transform, to help compress a sound stream, in a way that is already known to be lossy. Higher spectral resolution requires longer sampling intervals, which also imply poorer temporal resolution. Higher temporal resolution requires shorter sampling intervals, which also imply poorer spectral resolution.
I believe that one way in which the ear can outperform this limitation, is in having its cilia work in a massively parallel way. I think that our ears also have poor temporal resolution at the lower frequencies, but that our Human temporal resolution improves at higher frequencies.
And so one way in which I think that sound could be compressed, would be not to stick to one length of sampling interval.
For example, it might be possible to have a longest sampling window, 2048 samples long. Even-numbered coefficients could be computed for it using a Modified Discreet Cosine Transform, which range from 0 to 23 cycles / window. After that, the same interval of time could be subdivided into shorter sampling windows, each only 1024 samples long, and coefficients could be computed from them, which go from 12 to 23 cycles / window, thus completing 3 granules.
4 more ‘octaves’ should be possible, with sampling window lengths of 512, 256, 128 and 64 samples. Most of them would derive coefficients from 12 to 23 cycles / window again, with the exception of the 64-sample windows, which would derive from 12 to 31 cycles / window.
I would maintain the assumption, that from each length of sampling, a granule would result which is half as long, and for which coefficients would be stored.
This should result in 6 ‘octaves’ in total, each of which would have its own scale factor, stored once per frame interval (1024 samples), corresponding to the slowest granule. To simplify computing this scale factor, a global quality level could simply decide how many integers all the coefficients should be quantized to. For each ‘octave’, the peak amplitude within all the granules would be taken, and divided by this quality level, to arrive at the scale factor.
Each frame would store 63 granules, the 32 of which with the highest frequencies, would have 20 coefficients each, 30 of which would have 12 coefficients, and the longest granule of which would have 24 coefficients. This would result in 1024 coefficients / frame, in a fixed order.
To reduce waste, the scale factor of the highest, 6th octave, could simply be the same as that of the previous, 5th octave, as long as using that one yields lower quantized integers than the global quality level.
The resulting, quantized amplitudes could again, be encoded in a variable-length scheme, such as Exponential-Golomb, optionally plus a sign-bit.
One adverse side-effect of this would be, the complex and tedious computation of the scale factors. I do not assume that I would be using any Fast Fourier Transform, to determine audibility thresholds, and to set many of the DCT coefficients to zero, the way it is done with MP3. Then, it would make most sense to determine the scale factors from DCT values very closely analogous to how they are encoded.
The problems start, with the fact that each sampling interval is assumed to have a windowing function, when encoding. This turns into a major CPU load, once a scale factor needs to be computed 32×20 times per frame.
So one simplification I could offer, would be to begin by computing and temporarily storing all the DCT coefficients as 15-bit values, with the mere notion that they will later be quantized, but that a maximum value for them is kept up-to-date, once per ‘octave’ as defined above. After that, the scale factor can be computed from this maximum
(Edit 06/07/2016 : ) This hypothetical scheme has a major drawback as it stands. Even though it will inherently detect and bracket transients, it would also have poor recovery from transients. After and before a transient, the above method will remain insensitive to sounds in the same octave, for up to 1024 samples in the case of a 44.1 kHz format, which translates into 25 milliseconds. In my opinion, the human ear can detect this as a ‘sound shadow’.
MP3 recovers from transients within 576 samples.
One way to correct this could be, to arrange for not one but two scale factors to be encoded for each octave, except for the lowest octave. The first scale factor would apply to the first half of the frame, while the second scale factor would apply to the second half.
In principle this idea could be extended, all the way until there is a separate scale factor for each granule, with the ostensible exception of the shortest, highest-frequency granules / octave… But then doing so would also imply the intent, of allocating a uniform number of bits / a uniform amount of information, to each granule, knowing that their number doubles temporally with each octave. This would not be, what I would want compressed sound to do.