Threshold Elimination in Compressed Sound

I’ve written quite a few postings in this blog, about sound compression based on the Discrete Cosine Transform. And mixed in with my thoughts about that – where I was still, basically, trying to figure the subject out – were my statements to the effect that frequency-coefficients that are below a certain threshold of perceptibility could be set to zeroes, thus reducing the total number bits taken up, when Huffman-encoded.

My biggest problem in trying to analyze this is, the fact that I’m considering generalities, when in fact, specific compression methods based on the DCT, may or may not apply threshold-elimination at all. As an alternative, the compression technique could just rely on the quantization, to reduce how many bits per second it’s going to allocate to each sub-band of frequencies. ( :1 ) If the quantization step / scale-factor was high enough – suggesting the lowest quality-level – then many coefficients could still end up set to zeroes, just because they were below the quantization step used, as first computed from the DCT.

My impression is that the procedure which gets used to compute the quantization step remains straightforward:

  • Subdivide the frequencies into an arbitrary set of sub-bands – fewer than 32.
  • For each sub-band, first compute the DCTs to scale.
  • Take the (absolute of the) highest coefficient that results.
  • Divide that by the quality-level ( + 0.5 ) , to arrive at the quantization step to be used for that sub-band.
  • Divide all the actual DCT-coefficients by that quantization step, so that the maximum, (signed) integer value that results, will be equal to the quality-level.
  • How many coefficients end up being encoded to having such a high integer value, remains beyond our control.
  • Encode the quantization step / scale-factor with the sub-band, as part of the header information for each granule of sound.

The sub-band which I speak of has nothing to do with the fact that additionally, in MP3-compression, the signal is first passed through a quadrature filter-bank, resulting in 32 sub-bands that are evenly-spaced in frequencies by nature, and that the DCT is computed of each sub-band. This latter feature is a refinement, which as best I recall, was not present in the earliest forms of MP3-compression, and which does not affect how an MP3-file needs to be decoded.

(Updated 03/10/2018 : )

Continue reading Threshold Elimination in Compressed Sound

A single time-delay can also be expressed in the frequency-domain.

Another way to state, that a stream of time-domain samples has been given a time-delay, is simply to state that each frequency-coefficient has been given a phase-shift, that depends both on the frequency of the coefficient, and on the intended time-delay.

A concern that some readers might have with this, is the fact that a number of samples need to be stored, in order for a time-delay to be executed in the time-domain. But as soon as differing values for coefficients, for a Fourier Transform, are spaced closer together, indicating in this case a longer time-delay, its computation also requires that a longer interval of samples in the time-domain need to be combined.

Now, if the reader would like to visualize what this would look like, as a homology to a graphical equalizer, then he would need to imagine a graphical equalizer the sliders of which can be made negative – i.e. one that can command, that one frequency come out inverted – so that then, if he was to set his sliders into the accurate shape of a sine-wave that goes both positive and negative in its settings, he should obtain a simple time-delay.

But there is one more reason for which this homology would be flawed. The type of Fourier Transform which is best-suited for this, would be the Discrete Fourier Transform, not one of the Discrete Cosine Transforms. The reason is the fact that the DFT accepts complex numbers as its terms. And so the reader would also have to imagine, that his equalizer not only have sliders that move up and down, but sliders with little wheels on them, from which he can give a phase-shift to one frequency, without changing its amplitude. Obviously graphical equalizers for music are not made that way.

Continue reading A single time-delay can also be expressed in the frequency-domain.

Why The Discreet Cosine Transform Is Invertible

There are people who would answer this question entirely using Algebra, but unfortunately, my Algebra is not up to standard, specifically when applied to Fourier Transforms. Yet, I can often visualize such problems and reason them out, which can provide a kind of common-sense answer, even to this type of a question.

If a DCT is fed a time-domain sine-wave, the frequency of which exactly corresponds to an odd-numbered frequency coefficient, but which is 90 degrees out of phase with that coefficient, the fact stands, that the coefficient in question remains zero for the current sampling interval.

But in that case, the even-numbered coefficients, and not only the two directly adjacent to this center frequency, will alternate between positive and negative values. When the coefficients are then laid out, a kind of decaying wave-pattern becomes humanly discernible, which happens to have its zero-crossings, directly at the odd coefficients.

Also, in this case, if we were just to add all the coefficients, we should obtain zero, which would also be what the time-domain sample at n=0 should be equal to, consistently with a sine wave and not a cosine wave.

And this is why, if a DCT is applied to the coefficients, and if the phase information of this chosen IDCT is correct, the original sine wave can be reconstructed.

Note: If the aim is to compress and then reproduce sound, we normalize the DCT, but do not normalize the IDCT. Hence, with the Inverse, if a coefficient stated a certain magnitude, then that one coefficient by itself is also expected to produce a ‘sine-wave’, with the corresponding amplitude. ( :1 )

I think that it is a kind of slip which people can make, to regard a Fourier Transform ‘as if it was a spectrum analyzer’, the ideal behavior of which, in response to an analog sine-wave of one frequency, was just to display one line, which represents a single non-zero data-point, in this case corresponding to a frequency coefficient. In particular because Fourier Transforms are often computed for finite sampling intervals, the latter can behave differently. And the DCT seems to display this the most strongly.

While it would be tempting to say, that a DFT might be better behaved, the fact is that when computers crunch complex numbers, they represent those as pairs of real numbers. So while there is a ‘real’ component that results from the cosine-multiplication, and an ‘imaginary’ component that results from the sine-multiplication, each of these components could leave a human viewer equally confused as a DCT might, because again, each of these is just an orthogonal component vector.

So even in the case of the DFT, each number is initially not yet an amplitude. We still need to square each of these, and to add them. Only then, depending on whether we take the square root or not, we are left with an amplitude, or a signal energy, finally.

When using a DFT, it can be easy to forget, that if we feed it a time-domain single-pulse, what it will yield in the frequency-domain, is actually a series of complex numbers, the absolutes of which do not change, but which do a rotation in the complex plane, when plotted out along the frequency-domain. And then, if all we could see was either their real or their imaginary component, we would see that the DFT also produces a fringing effect initially.

The fact that these numerical tools are not truly spectrographs, can make them unsuitable for direct use in Psychoacoustics, especially if they have not been adapted in some special way for that use.

Dirk

1: ) This latter observation also has a meaning, for when we want to entropy-encode a (compressed) sound file, and when the time-domain signal was white noise. If we can assume that each frame states 512 coefficients, and that the maximum amplitude of the simulated white noise is supposed to be +/- 32768, Then the amplitude of our ‘small numbers’, would really only need to reach 64, so that when they interfere constructively and destructively over an output interval, they will produce this effect.

Now, one known fact about musical sounds which are based on white noise is, that they are likely to be ‘colored’, meaning that the distribution of signal energy is usually not uniform over the entire audible spectrum. Hence, If we wanted just 1/8 of the audible spectrum to be able to produce a full signal strength, Then we would need for the entropy-encoded samples to reach 512. And, we might not expect the ‘small numbers’ to be able to reproduce white noise at full amplitude, since the length of the big numbers is ‘only’ 15 bits+ anyway. One entropy-encoded value might already have a length of ~3 bits. So it could also be acceptable, if as many as 1/6 of the coefficients were encoded as ‘big numbers’, so that again, the maximum amplitude of the ‘small numbers’ would not need to carry the sound all by itself…

And yet, some entropy-encoding tables with high amplitudes might be defined, just in case the user asks for the lowest-possible bit-rates.