I’ve written quite a few postings in this blog, about sound compression based on the Discrete Cosine Transform. And mixed in with my thoughts about that – where I was still, basically, trying to figure the subject out – were my statements to the effect that frequency-coefficients that are below a certain threshold of perceptibility could be set to zeroes, thus reducing the total number bits taken up, when Huffman-encoded.
My biggest problem in trying to analyze this is, the fact that I’m considering generalities, when in fact, specific compression methods based on the DCT, may or may not apply threshold-elimination at all. As an alternative, the compression technique could just rely on the quantization, to reduce how many bits per second it’s going to allocate to each sub-band of frequencies. ( :1 ) If the quantization step / scale-factor was high enough – suggesting the lowest quality-level – then many coefficients could still end up set to zeroes, just because they were below the quantization step used, as first computed from the DCT.
My impression is that the procedure which gets used to compute the quantization step remains straightforward:
- Subdivide the frequencies into an arbitrary set of sub-bands – fewer than 32.
- For each sub-band, first compute the DCTs to scale.
- Take the (absolute of the) highest coefficient that results.
- Divide that by the quality-level ( + 0.5 ) , to arrive at the quantization step to be used for that sub-band.
- Divide all the actual DCT-coefficients by that quantization step, so that the maximum, (signed) integer value that results, will be equal to the quality-level.
- How many coefficients end up being encoded to having such a high integer value, remains beyond our control.
- Encode the quantization step / scale-factor with the sub-band, as part of the header information for each granule of sound.
The sub-band which I speak of has nothing to do with the fact that additionally, in MP3-compression, the signal is first passed through a quadrature filter-bank, resulting in 32 sub-bands that are evenly-spaced in frequencies by nature, and that the DCT is computed of each sub-band. This latter feature is a refinement, which as best I recall, was not present in the earliest forms of MP3-compression, and which does not affect how an MP3-file needs to be decoded.
(Updated 03/10/2018 : )
(As of 03/03/2018 : )
This also means that each of the sub-bands being referred to now, still has 18 DCT coefficients.
If threshold-elimination is to be used in any one scheme, then my main alarm was that, to eliminate individual coefficients, or to set them to zero, would have as a result considerable aliasing, because the culled coefficients might be necessary complements to coefficients which are meant to be heard, but without which, the phase-positions and thus the timing of the heard coefficients could end up becoming wrong when played back. And, The coefficient which the present one complements, could belong to an overlapping granule of sound, and thus not even be accessible when the present granule is being encoded.
But, if the encoder is already subdividing the band into 32 sub-bands, then one advantage this can bring, is that the variance and therefore signal-energy of each sub-band can be computed easily and accurately. And therefore, an entire sub-band could be squelched, hypothetically, because its energy is below the threshold, and not individual coefficients. ( :2 )
One problem with the DCT was, that its coefficients by nature do not state the true signal-energy of a signal-component, because to do so really requires that both the sine-product and the cosine-product be computed, that each product be squared, and that the results be added.
But merely computing the variance that results in each sub-band, surely leads to an accurate measure of how much energy it has…
The approach which I outlined above, for computing the scale-factors, If left to itself, will produce one major problem: Each of the sub-bands used – of which there are fewer than 32 – will receive equal representation in the encoded signal, even if it had no meaningful content, prior to compression.
Scale-Factor == Quantization Step
In other words, even the sub-band that corresponds to ’15kHz and over’, will be computed to have some DCT coefficients, one of which will be the maximum, given an uncompressed signal which had little high-frequency content. The scale-factor of that sub-band will be computed like that of any other, so that its strongest coefficient will still be represented by an integer, whose maximum value also follows as ‘the quality factor’.
This is a type of problem which must be solved, before a practical compression-scheme can be designed. And of course, because we know that numerous compression-schemes exist, we can also conclude that each of those has solved this problem in some way.
(Edit 03/05/2018 :
Hypothetically, somebody could suggest, just to treat the entire band as one sub-band, with only one scale-factor.
The problem with this idea would lie in the fact that typical audio streams contain strong amplitudes at frequencies which Humans hear poorly – such as at 250Hz – as they would be seen on an oscilloscope, or as they’re often visualized today, in the GUI of digital audio workstation software. Typically, the actual amplitudes at the higher frequencies, are lower. In that case, the corresponding frequency-components will also be quantized to result in the suggested, maximum integer-value. At the same time, frequency-components which Humans hear well by comparison – such as at 1.5kHz – would end up being quantized to lower integers.
The result would also follow, that frequency-components which we hear most clearly, would receive a diminished number of bits in the compressed stream, to define what the listener will be able to hear upon decoding.
And so, some middle-ground must be found, between the two extremes which I just outlined. )
One assumption that I’m making is, that the famous bathtub-curve, which depicts absolute thresholds of audibility as a function of a single frequency occurring by itself, will continue to be a valid point-of-reference for relative audibility thresholds, when frequency-components are mixed. This pretty assumption is not proven. ( :4 )
One way to solve the problem above could be, for the encoder to recognize which ( arbitrary, ) quantization sub-band had the strongest DCT coefficient, and to compute the quantization step according to that sub-band. Next, the resulting scale-factor could be divided by the (lowest) audibility-threshold within the same sub-band, before being multiplied by the corresponding audibility threshold within each other sub-band, to determine the scale-factors of all the sub-bands.
- Frequencies above 16kHz or below 250Hz, would be treated as equivalent to those, for 16kHz or 250Hz.
- Because the result is to be in amplitude-units, and not energy-units, the square root needs to be computed for the antilogarithm, of the SPL Decibel values I linked to. Or, the Decibel values used must first be halved.
Another way to solve the same problem could be, that a line across the audibility-threshold curve could be drawn to meet the threshold that’s valid for 16kHz … , and that this line could represent a reference-level.
The variance of any one sampling interval could be seen as an equivalent to that level, so that its representation in the software could be as a (1.0) , so that the representation of all the other thresholds could be represented by floating-point numbers smaller than (1.0) , as antilogarithms of the Decibel values I linked to …
The sampling-interval variance could next be divided by the quality-level which is set, squared ( + 0.5 ) , after which it could be multiplied by the (lowest) relative threshold within each sub-band to be used (squared), to arrive at an actual threshold for that sub-band. Alternatively, it could additionally be divided by the absolute audibility-threshold for 16kHz (squared). ( :3 )
- This approach assumes that the values to be eliminated are accurate measures of signal-energy.
The intended meaning of this second concept is, that at a quality-level set to (1) , a frequency above 16kHz or below 250Hz would be audible, if and only if it was the only frequency-component in the signal. But, if a quality-level of maybe (5) was set, then there could be as many as 25 frequency-components above 16kHz or below 250Hz, again, provided that those are the only frequency-components that make up the signal.
In practice, applying this concept means, that if the quality-level was set to (1) , then there would be no real chance of signal-components above 16kHz being encoded. But few sampling-intervals would really be encoded with a quality-level set so low.
(Edit 03/05/2018 : )
If it is a known fact, that only 32 sub-bands are to be analyzed for their energy, then the quality-level used to determine the threshold can be capped at (6), because:
62 >= 32
(Edit 03/10/2018 : )
If the reader thinks, that to admit signal-energy in one sub-band, which is ‘only 1/36′ the signal-energy of the entire signal, represents a low threshold, this is not true. The reason for this would be the fact, that 1/36 the energy, which can approximately be written as -15db, is actually as high as 1/6 the original amplitude ! This would make the amplitude, consistent with the quantization step. Thus, we can visualize a waveform in graphical representation, and we can visualize another, that has 1/6 the amplitude of the first. And then, both waveforms would be clearly recognizable. It’s not obvious then, that the waveform with 1/6 amplitude, would not be audible. At best, this would happen at frequencies, which Humans hear most-poorly.
Above, I wrote that a certain, pretty concept was ‘not proven’. I suspect that it is not proven, because it’s actually untrue. And if that’s the case, then compression-schemes which do not apply threshold-elimination, If they’ve been set to quality-levels high enough, will quantize the poorly-heard frequencies more (than the easily-heard frequencies), but will at least encode the poorly-heard ones also, so that some facsimile of those frequencies is also played back.
I suspect that the not-proven concept would be difficult to verify experimentally. Audio Experts actually have a hard time, just to measure the absolute audibility thresholds. To do so, they’ve had to modify their laboratory procedures, because procedures used decades ago, already did not produce accurate results. I don’t know what procedures could be devised, to measure the relative thresholds, ‘When many frequency-components are already-present in the signal, and mixed.’