A Possible Path to High-Resolution, Compressed Sound

There is a fundamental tradeoff which takes place, when we use some sort of Fourier Transform, to help compress a sound stream, in a way that is already known to be lossy. Higher spectral resolution requires longer sampling intervals, which also imply poorer temporal resolution. Higher temporal resolution requires shorter sampling intervals, which also imply poorer spectral resolution.

I believe that one way in which the ear can outperform this limitation, is in having its cilia work in a massively parallel way. I think that our ears also have poor temporal resolution at the lower frequencies, but that our Human temporal resolution improves at higher frequencies.

And so one way in which I think that sound could be compressed, would be not to stick to one length of sampling interval.

For example, it might be possible to have a longest sampling window, 2048 samples long. Even-numbered coefficients could be computed for it using a Modified Discreet Cosine Transform, which range from 0 to 23 cycles / window. After that, the same interval of time could be subdivided into shorter sampling windows, each only 1024 samples long, and coefficients could be computed from them, which go from 12 to 23 cycles / window, thus completing 3 granules.

4 more ‘octaves’ should be possible, with sampling window lengths of 512, 256, 128 and 64 samples. Most of them would derive coefficients from 12 to 23 cycles / window again, with the exception of the 64-sample windows, which would derive from 12 to 31 cycles / window.

I would maintain the assumption, that from each length of sampling, a granule would result which is half as long, and for which coefficients would be stored.

This should result in 6 ‘octaves’ in total, each of which would have its own scale factor, stored once per frame interval (1024 samples), corresponding to the slowest granule. To simplify computing this scale factor, a global quality level could simply decide how many integers all the coefficients should be quantized to. For each ‘octave’, the peak amplitude within all the granules would be taken, and divided by this quality level, to arrive at the scale factor.

Each frame would store 63 granules, the 32 of which with the highest frequencies, would have 20 coefficients each, 30 of which would have 12 coefficients, and the longest granule of which would have 24 coefficients. This would result in 1024 coefficients / frame, in a fixed order.

To reduce waste, the scale factor of the highest, 6th octave, could simply be the same as that of the previous, 5th octave, as long as using that one yields lower quantized integers than the global quality level.

The resulting, quantized amplitudes could again, be encoded in a variable-length scheme, such as Exponential-Golomb, optionally plus a sign-bit.

One adverse side-effect of this would be, the complex and tedious computation of the scale factors. I do not assume that I would be using any Fast Fourier Transform, to determine audibility thresholds, and to set many of the DCT coefficients to zero, the way it is done with MP3. Then, it would make most sense to determine the scale factors from DCT values very closely analogous to how they are encoded.

The problems start, with the fact that each sampling interval is assumed to have a windowing function, when encoding. This turns into a major CPU load, once a scale factor needs to be computed 32×20 times per frame.

So one simplification I could offer, would be to begin by computing and temporarily storing all the DCT coefficients as 15-bit values, with the mere notion that they will later be quantized, but that a maximum value for them is kept up-to-date, once per ‘octave’ as defined above. After that, the scale factor can be computed from this maximum

Dirk

(Edit 06/07/2016 : ) This hypothetical scheme has a major drawback as it stands. Even though it will inherently detect and bracket transients, it would also have poor recovery from transients. After and before a transient, the above method will remain insensitive to sounds in the same octave, for up to 1024 samples in the case of a 44.1 kHz format, which translates into 25 milliseconds. In my opinion, the human ear can detect this as a ‘sound shadow’.

MP3 recovers from transients within 576 samples.

One way to correct this could be, to arrange for not one but two scale factors to be encoded for each octave, except for the lowest octave. The first scale factor would apply to the first half of the frame, while the second scale factor would apply to the second half.

In principle this idea could be extended, all the way until there is a separate scale factor for each granule, with the ostensible exception of the shortest, highest-frequency granules / octave… But then doing so would also imply the intent, of allocating a uniform number of bits / a uniform amount of information, to each granule, knowing that their number doubles temporally with each octave. This would not be, what I would want compressed sound to do.

Why the Temporal Resolution of MP3s is Poor.

I have spent a lot of my private time, thinking about lossy sound compression, and then, simplifying my ideas to something more likely to have been implemented in actual MP3 compression. In order to understand this concept, one needs to be familiar with the concept of Fourier Transforms. There are two varieties of them, which are important in sound compression: The “Discreet Fourier Transform” (‘DFT’), and the “Discreet Cosine Transform” (‘DCT’), the latter of which has several types again.

I did notice that the temporal resolution of MP3s I listen to is poor, and it was an important realization I finally came to, that this was not due to the actual length of the sampling window.

If we were to assume for the moment that the sampling interval was 1024 samples long – and for MP3, it is not – then to compute the DFT of that would produce 1024 frequency coefficients, from (0-1023 / 2) cycles / sampling interval. Each of these coefficients would be a complex number, and the whole set of them can be used to reconstruct the original sample-set accurately, by inverting the DFT. The inverse of the DFT is actually the DFT computation again, but with the imaginary (sine) component inverted (negated).

But, MP3s do not use the DFT, instead using the DCT, the main difference in which is, that the DCT does not record a complex number for each coefficient, rather just stating a real number, which would normally correspond only to the cosine function within the DFT… Admittedly, each of these absolute amplitudes may possibly be negated.

If the time-domain signal consisted of a 5 kHz wave, which was pulsating on and off 200 times per second – which would actually sound like buzzing to human ears – then the DCT would record a frequency component at 5kHz, but as long as they are not suppressed due to the psychoacoustic models used, would also record ‘sidebands’ at 4800 and at 5200 Hz, each of which has 1/2 the amplitude of the center frequency at 5 kHz. I know this, because for the center frequency to be turned on and off, it must be amplitude modulated, virtually. And so what has this time-domain representation, even though this happens faster than once per sampling window, also has a valid frequency-domain representation.

When this gets decoded, the coefficient-set will reproduce a sample-set, whose 5 kHz center frequency again seems to ‘buzz’ 200 times per second, due to the individual frequency components interfering constructively and then destructively, even though they are being applied equally across the entire sampling interval.

But because the coefficient-set was produced by a DCT, it has no accurate phase information. And so the exact time each 5 kHz burst has its maximum amplitude, will not correspond to the exact time it did before. This will only seem to correct itself once per frame. If the sampling interval was truly 1024 samples long, then a frame will recur every 512 samples, which is ~80 times per second.

Now the question could be asked, why it should not be possible to base lossy audio compression on the DFT instead. And the answer is that in principle, it would be possible to do so. Only, if each coefficient-set consisted of complex numbers, it would also become more difficult to compress the number of kbps kept in the stream, in an effective way. It would probably still not be possible to preserve the phase information perfectly.

And then as a side-note, this one hypothetical sample-set started out as consisting of real numbers. But with the DFT, the sample-set could carry complex numbers as easily as the coefficient-set did. If the coefficients were compressed-and-simplified, then the samples reproduced would probably end up being so, with complex values. In a case like this, the correct thing to do is to ignore the imaginary component, and only output the real component, as the decoded result…

When using a DCT to encode a stream of sound, which is supposed to be continuous, a vulgarization of the problem could be, that the stream contains ‘a sine wave instead of a cosine wave’, which would therefore get missed by all the sampling intervals, because only the product with the cosine function is being computed each time, for a specific coefficient. The solution that comes from the Math of the DCT itself is, that the phase of the unit vector generally rotates 90 degrees ~from each frame to the next~. To the best of my understanding, two sampling intervals will generally overlap by 50% in time, resulting in one frame half as long. It may be that the designers only compute the odd-numbered coefficients. Then, the same coefficient belonging to the next frame should be aware of this wave notwithstanding. Further, the sampling intervals are made to overlap when the stream is decoded again, such that a continuous wave can be reconstructed. ( :1 )

The only question I remain curious about, is whether a need exists when encoding with a DCT, to blend any given coefficient as belonging to two consecutive frames, the current one plus the previous one.

While it can be done, to use rectangular sampling windows for encoding, the results from that are likely to be offensive to listen to. So in practice, Blackman Windows should ideally be used for encoding (that are twice as long as a frame).

The choice of whether decoders should use a Hanning Window or a Linear Taper, can depend on what sort of situation should best be reproduced.

Decoding with a linear taper, will cause crescendos to seem maximally smooth, and perfectly so if the crescendo is linear. But considering that linear crescendos might be rare in real music, a Hanning Window will minimize the distortion that is generated, when a burst of sound is decoded, just as a Blackman Window was supposed to do when encoding. Only, a Blackman Window cannot be used to decode, because coefficients being constant from one frame to the next would result in non-constant (output) sample amplitudes.

Dirk

(Edit 05/18/2016 : ) One related fact should be acknowledged. The DCT can be used to reconstruct a phase-correct sample-set, if non-zero even-numbered as well as odd-numbered coefficients are included. This follows directly from the fact that a ‘Type 3′ DCT is the inverse of the ‘Type 2′. But, the compression method used by several codecs is such, that a psychoacoustic model suppresses coefficients, on the assumption that they should be inaudible, because they are too close spectrally, to stronger ones. This would almost certainly go into effect, between complementary even-numbered and odd-numbered DCT coefficients.

( 05/31/2016 : ) One detail which was not made clear to me, was whether instead, coefficients which are in the same sub-band as one that has a stronger peak, are merely quantized more, due to the scale-factor of that sub-band being higher, to capture this higher peak. This would strike me as favorable, but also results in greater bit-rates, than what would follow, from setting supposedly-inaudible coefficients to zero. Due to Huffman Encoding, the bit-length of a more-quantized coefficient, is still longer than that for the value (zero).

In any type of Fourier Transform, signal energy at one frequency cannot be separated fully from energy measured at a frequency different by only half a cycle per frame. When the difference is at least by one cycle per frame, energy and therefore amplitude become isolated. This does not mean however, that the presence of a number of coefficients equal to the number of samples, is always redundant.

And so, one good way to achieve some phase-correctness might be, to try designing a codec, which does not rely too strongly on the customary psychoacoustic models. For example, a hypothetical codec might rely on Quantization, followed by Exponential-Golomb Coding of the coefficients, being sure to state the scale of quantization in the header information of each frame.

It is understood that such approaches will produce ‘poorer results’ at a given bit-rate. But then, simply choosing a higher bit-rate (than what might be appropriate for an MP3) could result in better sound.

And then, just not to make our hypothetical codec too primitive, one could subdivide the audible spectrum into 8 bands, each one octave higher than the previous, starting from coefficient (8), so that each of these bands can be quantized by a different scale, according to the Threshold Of Audibility. These Human Loudness Perception Curves may be a simple form of psychoacoustics, but are also thought to be reliable fact, as perceived loudness does not usually correspond consistently to uniform spectral distribution of energy.

Parts of the spectrum could be quantized less, for which ‘the lower threshold of hearing’ is lower with respect to a calculable loudness value, at which the Human ears are thought to be uniformly sensitive to all frequencies.

Assigning such a header-field 8 times for each frame would not be prohibitive.

1: ) ( 05/31/2016 ) An alternative approach, which the designers of MP3 could just as easily have used, would have been first to compute the DCT, including both even- and odd-numbered coefficients F(k), but then to derive only the even-numbered coefficients from that. The best way would have been, for even numbered, derived coefficient G(k) to be found as

r = F(k)

i = F(k+1) – F(k-1)

G(0) = F(0)

G(k) = sqrt( r^2 + i^2 )