I can offer a sound-compression scheme that I know will not work, as a point of reference.

In This Posting, I suggested a way of using a Discreet Fourier Transform, which I suspect may be in use in sound compression techniques such as MP3, with the exception of the fact that I think MP3 uses sampling intervals of 1152 samples, while in theory I was suggesting 1024.

What I was suggesting, was that if the sampling intervals overlap by 50%, Because they only use the odd-numbered coefficients, each of them would analyze a unit vector, as part of a phase-vector diagram, which would have been 90 degrees out of phase with the previous. And every second sampling interval would also have a base vector, which is 180 degrees out of phase with the earlier one.

If the aim was, to preserve the phase-position of the sampled sound correctly, it might seem that all we need to do, is to preserve the sign of each coefficient, so that when the sampling intervals are reconstructed as overlapping, a wave will result, that has the correct phase angle, between being a ‘cosine’ and a ‘sine’ wave.

But there would be yet another problem with that, specifically in sound compression, if the codec is using customary psychoacoustic models.

By its nature, such a scheme would produce amplitudes which, in addition to requiring a sign bit to store, would be substantially different between even-numbered and odd-numbered sampling intervals, not because of time-based changes in the signal, but because they are orthogonal unit vectors.

An assumption that several sound compression schemes also make, is that If the amplitude of a certain frequency component was at say 100% at t=-1, Then at t=0 that same frequency component has an inherited ‘audibility threshold’ of maybe 80%, ? Thus, if the later frequency coefficient does exceed 80%, it will be deemed inaudible and suppressed. Hence, an entire ‘skirt’ of audibility thresholds tends to get stored, for all the coefficients, which not only surrounds peak amplitudes with slopes, but which additionally decays from one frame to the next.

Hence, even if our frames were intended to complement each other as being orthogonal, practical algorithms will nonetheless treat them as having the same meaning, but consecutive in time. And then, if one or the other is simply suppressed, our phase accuracy is gone again.

This thought was also the main reason for which I had suggested, that the current and the previous sampling interval should have their coefficients blended, to arrive at the coefficient for the current frame. And the blending method which I would have suggested, was not a linear summation, but to find the square root, of the sums, of the squares of theĀ  two coefficients.

As soon as anybody has done that, they have computed the absolute amplitude, and have destroyed all phase-information.

But there is an observation about surround-sound which comes as comforting for this subject. Methods that exist today, to decode stereo into 5.1 surround, only require phase-accuracy to within 90 degrees, as far as I know, to work properly. This would be due to the industrious way in which Pro Logic 1 and 2 were designed.

And so one type of information which could be added back in to the frequency coefficients, would be of whether the cosine and the sine function are each positive or negative, with respect to the origin of each frame. This will result in sound reproduction which is actually 45 degrees out of phase, from how it started, yet possessing 4 possible phase positions, that correspond to the 4 quadrants of the sine and cosine functions.

And this could even be encoded, simply by giving the coefficient a single sign-bit, with respect to each frame.

And this could cause some perception oddity when the weaker coefficients are played back, with an inaccurate phase position with respect to the dominant coefficients. Yet, a system is plausible, that at least states a phase position, for the stronger, dominant frequency components.

What I could also add, is that in a case where Coefficient 1 had an amplitude of 100%, and the audibility threshold of Coefficient 2 followed as being at 80%, their computation does not always require that these values be represented in decibels.

Obviously, in order for any psychoacoustic model to work, the initial research needs to reveal relationships in decibels. But if we can at least assume that Coefficient 2 was always a set number of decibels lower than Coefficient 1, even at odd numbers of decibels, this relationship can be converted into a fraction, which can be applied to amplitude units, instead of to decibel values.

And, if we are given a signed 16-bit amplitude, and would wish to multiply it by a fraction, which has also been prepared as an unsigned 15-bit value expressing values from 0 to 99%, then to perform an integer multiplication between these two will yield a 32-bit integer. Because we have right-shifted the value of one of our two integers, from the sense in which they usually express +32767 … -32768, we do have the option of next ignoring the least-significant word from our product, and using only the most-significant 16-bit word, to result in a fractional multiplication.

The same can be done with 8-bit integers.

Further, if we did not want our hypothetical scheme to introduce a constant 45 degree phase-shift, there would be a way to get rid of that, which would add some complexity at the encoding level.

For each pair of sampling intervals, it could be determined rather easily, which of thee two was the even-numbered, and which was the odd-numbered. Then, we could find whether the absolute of the sine or the absolute of the cosine component was greater, and record that as 1 bit. Finally, we would determine whether the given component was negative or not, and record that as another bit.

(Edit 05/23/2016 : ) Further, there is no need to encode both frames in MPEG-2, where one frame is stored in their place, as derived from both sampling intervals. Hence, the default pattern would be shortened to [ (0,0), (1,0) ] .

(Edit 12/31/2016 : The default pattern would be shortened to [ (0,0), (0,1) ] .

When playing back the frames, the second granule of each could follow from the first, by default, in the following pattern:


+cos -> -sin
-sin -> -cos
-cos -> +sin
+sin -> +cos

We would need access to the sine-counterpart, of the IDCT, to play this back.

End of Edit . )

But then we should also ask ourselves, what has truly been gained. A phase-position that lines up with an assumed cosine or an assumed sine vector, should be remembered as only being lined up, with the origin of each sampling window. But the exact timing of each sampling window is arbitrary, with respect to the input signal. There is really no reason to assume any exact correspondence, since the source of the signal is typically from somewhere else, than the provider of the codec.

And so in my opinion, to have all the reproduced waves consistently 45 degrees out of phase, only puts them that way, with respect to a sampling window whose timing is unknown. According to Pro Logic, Surround-Sound decoding, what should really matter, is whether two waves belonging to the same stream, are in-phase with each other, or out-of-phase, to whatever degree of accuracy can be achieved.

( This last concept, actually contradicts being able to reconstruct a waveform accurately, because a constant phase-shift is inconsistent with a constant time-delay, over a range of frequencies. When a complex waveform is actually time-shifted, so as to keep its shape, then this conversely implies different phase-shifts for its frequency components. )

Dirk

 

Why the Temporal Resolution of MP3s is Poor.

I have spent a lot of my private time, thinking about lossy sound compression, and then, simplifying my ideas to something more likely to have been implemented in actual MP3 compression. In order to understand this concept, one needs to be familiar with the concept of Fourier Transforms. There are two varieties of them, which are important in sound compression: The “Discreet Fourier Transform” (‘DFT’), and the “Discreet Cosine Transform” (‘DCT’), the latter of which has several types again.

I did notice that the temporal resolution of MP3s I listen to is poor, and it was an important realization I finally came to, that this was not due to the actual length of the sampling window.

If we were to assume for the moment that the sampling interval was 1024 samples long – and for MP3, it is not – then to compute the DFT of that would produce 1024 frequency coefficients, from (0-1023 / 2) cycles / sampling interval. Each of these coefficients would be a complex number, and the whole set of them can be used to reconstruct the original sample-set accurately, by inverting the DFT. The inverse of the DFT is actually the DFT computation again, but with the imaginary (sine) component inverted (negated).

But, MP3s do not use the DFT, instead using the DCT, the main difference in which is, that the DCT does not record a complex number for each coefficient, rather just stating a real number, which would normally correspond only to the cosine function within the DFT… Admittedly, each of these absolute amplitudes may possibly be negated.

If the time-domain signal consisted of a 5 kHz wave, which was pulsating on and off 200 times per second – which would actually sound like buzzing to human ears – then the DCT would record a frequency component at 5kHz, but as long as they are not suppressed due to the psychoacoustic models used, would also record ‘sidebands’ at 4800 and at 5200 Hz, each of which has 1/2 the amplitude of the center frequency at 5 kHz. I know this, because for the center frequency to be turned on and off, it must be amplitude modulated, virtually. And so what has this time-domain representation, even though this happens faster than once per sampling window, also has a valid frequency-domain representation.

When this gets decoded, the coefficient-set will reproduce a sample-set, whose 5 kHz center frequency again seems to ‘buzz’ 200 times per second, due to the individual frequency components interfering constructively and then destructively, even though they are being applied equally across the entire sampling interval.

But because the coefficient-set was produced by a DCT, it has no accurate phase information. And so the exact time each 5 kHz burst has its maximum amplitude, will not correspond to the exact time it did before. This will only seem to correct itself once per frame. If the sampling interval was truly 1024 samples long, then a frame will recur every 512 samples, which is ~80 times per second.

Now the question could be asked, why it should not be possible to base lossy audio compression on the DFT instead. And the answer is that in principle, it would be possible to do so. Only, if each coefficient-set consisted of complex numbers, it would also become more difficult to compress the number of kbps kept in the stream, in an effective way. It would probably still not be possible to preserve the phase information perfectly.

And then as a side-note, this one hypothetical sample-set started out as consisting of real numbers. But with the DFT, the sample-set could carry complex numbers as easily as the coefficient-set did. If the coefficients were compressed-and-simplified, then the samples reproduced would probably end up being so, with complex values. In a case like this, the correct thing to do is to ignore the imaginary component, and only output the real component, as the decoded result…

When using a DCT to encode a stream of sound, which is supposed to be continuous, a vulgarization of the problem could be, that the stream contains ‘a sine wave instead of a cosine wave’, which would therefore get missed by all the sampling intervals, because only the product with the cosine function is being computed each time, for a specific coefficient. The solution that comes from the Math of the DCT itself is, that the phase of the unit vector generally rotates 90 degrees ~from each frame to the next~. To the best of my understanding, two sampling intervals will generally overlap by 50% in time, resulting in one frame half as long. It may be that the designers only compute the odd-numbered coefficients. Then, the same coefficient belonging to the next frame should be aware of this wave notwithstanding. Further, the sampling intervals are made to overlap when the stream is decoded again, such that a continuous wave can be reconstructed. ( :1 )

The only question I remain curious about, is whether a need exists when encoding with a DCT, to blend any given coefficient as belonging to two consecutive frames, the current one plus the previous one.

While it can be done, to use rectangular sampling windows for encoding, the results from that are likely to be offensive to listen to. So in practice, Blackman Windows should ideally be used for encoding (that are twice as long as a frame).

The choice of whether decoders should use a Hanning Window or a Linear Taper, can depend on what sort of situation should best be reproduced.

Decoding with a linear taper, will cause crescendos to seem maximally smooth, and perfectly so if the crescendo is linear. But considering that linear crescendos might be rare in real music, a Hanning Window will minimize the distortion that is generated, when a burst of sound is decoded, just as a Blackman Window was supposed to do when encoding. Only, a Blackman Window cannot be used to decode, because coefficients being constant from one frame to the next would result in non-constant (output) sample amplitudes.

Dirk

(Edit 05/18/2016 : ) One related fact should be acknowledged. The DCT can be used to reconstruct a phase-correct sample-set, if non-zero even-numbered as well as odd-numbered coefficients are included. This follows directly from the fact that a ‘Type 3′ DCT is the inverse of the ‘Type 2′. But, the compression method used by several codecs is such, that a psychoacoustic model suppresses coefficients, on the assumption that they should be inaudible, because they are too close spectrally, to stronger ones. This would almost certainly go into effect, between complementary even-numbered and odd-numbered DCT coefficients.

( 05/31/2016 : ) One detail which was not made clear to me, was whether instead, coefficients which are in the same sub-band as one that has a stronger peak, are merely quantized more, due to the scale-factor of that sub-band being higher, to capture this higher peak. This would strike me as favorable, but also results in greater bit-rates, than what would follow, from setting supposedly-inaudible coefficients to zero. Due to Huffman Encoding, the bit-length of a more-quantized coefficient, is still longer than that for the value (zero).

In any type of Fourier Transform, signal energy at one frequency cannot be separated fully from energy measured at a frequency different by only half a cycle per frame. When the difference is at least by one cycle per frame, energy and therefore amplitude become isolated. This does not mean however, that the presence of a number of coefficients equal to the number of samples, is always redundant.

And so, one good way to achieve some phase-correctness might be, to try designing a codec, which does not rely too strongly on the customary psychoacoustic models. For example, a hypothetical codec might rely on Quantization, followed by Exponential-Golomb Coding of the coefficients, being sure to state the scale of quantization in the header information of each frame.

It is understood that such approaches will produce ‘poorer results’ at a given bit-rate. But then, simply choosing a higher bit-rate (than what might be appropriate for an MP3) could result in better sound.

And then, just not to make our hypothetical codec too primitive, one could subdivide the audible spectrum into 8 bands, each one octave higher than the previous, starting from coefficient (8), so that each of these bands can be quantized by a different scale, according to the Threshold Of Audibility. These Human Loudness Perception Curves may be a simple form of psychoacoustics, but are also thought to be reliable fact, as perceived loudness does not usually correspond consistently to uniform spectral distribution of energy.

Parts of the spectrum could be quantized less, for which ‘the lower threshold of hearing’ is lower with respect to a calculable loudness value, at which the Human ears are thought to be uniformly sensitive to all frequencies.

Assigning such a header-field 8 times for each frame would not be prohibitive.

1: ) ( 05/31/2016 ) An alternative approach, which the designers of MP3 could just as easily have used, would have been first to compute the DCT, including both even- and odd-numbered coefficients F(k), but then to derive only the even-numbered coefficients from that. The best way would have been, for even numbered, derived coefficient G(k) to be found as

r = F(k)

i = F(k+1) – F(k-1)

G(0) = F(0)

G(k) = sqrt( r^2 + i^2 )