## Deriving a workable entropy-encoding scheme, based on the official explanation of CABAC.

One of the subjects which I recently blogged about, is that when encoding video-streams, some Codecs use 8×8 sample Discrete Cosine Transforms, but as with many DCTs, the coefficients produced tend to be values, which would take up too much space to store, in a fixed-length format. And so a family of techniques which gets applied, is loosely referred to as ‘Entropy Encoding’, with the key idea being, that the Entropy Encoding used for compressed video, is different again, from the Entropy Encoding used for compressed audio. And the scheme used for video has as advantage, that the encoding itself is lossless. Apparently, there are two variants actually used with H.264-encoded videos, which some people group together as MPEG-4:

1. An unspecified form of variable-length encoding,
2. CABAC,

The latter of which promises better compression, at the cost of greater CPU-power required, both to encode and to decode. I’m going to focus on ‘CABAC’ in this posting. There is an official explanation for how CABAC works, which I will refer to. In order to understand my posting here, the reader will need to have read the documentation I just linked to.

From first impressions – yesterday evening was the first day on which I examined CABAC – I’d say that the official explanation contains an error. And I’ll explain why, by offering a version of Entropy-Encoding, which I know can work, based on the link above, but different from it:

• Integers are meant to be encoded, that are “Binarized”.
• The probability with which the first “Bin” has become (1) instead of (0) can be analyzed as described, resulting in a Context Model of one out of (0, 1, 2), as described.
• The next four Bins may not have individual probabilities computed, only resulting in Context Models (3, 4, 5, 6) when they are (1) instead of (0), which override the Context Model that the first Bin would generate.
• The resulting, one Context Model could be averaged over the previous Values.
• Using As a Pair of values, the Context Model (from the previous values) which was just computed, And the (present) Integer Value, a look-up can take place in a 2-dimensional table, of which sequence of bits to use, to encode (both).
• Because the decoder has chosen the integer value out of a known row in the same look-up table, it can also update the Context Model being used, so that future look-ups when decoding remain unambiguous.

The main problem I see with the official explanation is, that because up to 6 Context Models can be computed, each of which supposedly has its own probability, based on that, the lookup-table in which binary values (entropy encodings) are to be found, would effectively need to be a 6-dimensional table ! Officially, all the Context-Models found, have equal meaning. Software is much-more probable, which uses a 2D table, than software which uses a 6-dimensional table, although according to Theoretical Math, 6-dimensional tables are also possible.

But then, a property of Variable Length Coding which has been observed for some time, was that small integers, such as (0), (1) and (2), were assigned very short bit-sequences to be recognized, while larger integers, such as (16) or (17), were assigned recognizable bit-sequences, which would sometimes have been impractically long, and which resulted in poor compression, when the probability of the integer actually being (0), (1) or (2) decreased.

So, because we know that we can have at least one Context-Model, based on the actual, local probabilities, when the probabilities of very small integers become smaller, a series of entropy-encodings can be selected in the table, the bit-length of which can be made more-uniform, resulting in smaller encodings overall, than what straight Variable-Length Encoding would have generated, CABAC instead being adapted to probable, larger integers.

The fact will remain, that the smaller integers will require fewer bits to encode, in general, than the larger integers. But when the smallest integers become very improbable, the bit-lengths for all the integers can be evened out. This will still result in longer streams overall, as larger integers become more-probable, but in shorter streams than the streams that would result, if the encodings for the smallest integers remained the shortest they could be.

## Why The Discreet Cosine Transform Is Invertible

There are people who would answer this question entirely using Algebra, but unfortunately, my Algebra is not up to standard, specifically when applied to Fourier Transforms. Yet, I can often visualize such problems and reason them out, which can provide a kind of common-sense answer, even to this type of a question.

If a DCT is fed a time-domain sine-wave, the frequency of which exactly corresponds to an odd-numbered frequency coefficient, but which is 90 degrees out of phase with that coefficient, the fact stands, that the coefficient in question remains zero for the current sampling interval.

But in that case, the even-numbered coefficients, and not only the two directly adjacent to this center frequency, will alternate between positive and negative values. When the coefficients are then laid out, a kind of decaying wave-pattern becomes humanly discernible, which happens to have its zero-crossings, directly at the odd coefficients.

Also, in this case, if we were just to add all the coefficients, we should obtain zero, which would also be what the time-domain sample at n=0 should be equal to, consistently with a sine wave and not a cosine wave.

And this is why, if a DCT is applied to the coefficients, and if the phase information of this chosen IDCT is correct, the original sine wave can be reconstructed.

Note: If the aim is to compress and then reproduce sound, we normalize the DCT, but do not normalize the IDCT. Hence, with the Inverse, if a coefficient stated a certain magnitude, then that one coefficient by itself is also expected to produce a ‘sine-wave’, with the corresponding amplitude. ( :1 )

I think that it is a kind of slip which people can make, to regard a Fourier Transform ‘as if it was a spectrum analyzer’, the ideal behavior of which, in response to an analog sine-wave of one frequency, was just to display one line, which represents a single non-zero data-point, in this case corresponding to a frequency coefficient. In particular because Fourier Transforms are often computed for finite sampling intervals, the latter can behave differently. And the DCT seems to display this the most strongly.

While it would be tempting to say, that a DFT might be better behaved, the fact is that when computers crunch complex numbers, they represent those as pairs of real numbers. So while there is a ‘real’ component that results from the cosine-multiplication, and an ‘imaginary’ component that results from the sine-multiplication, each of these components could leave a human viewer equally confused as a DCT might, because again, each of these is just an orthogonal component vector.

So even in the case of the DFT, each number is initially not yet an amplitude. We still need to square each of these, and to add them. Only then, depending on whether we take the square root or not, we are left with an amplitude, or a signal energy, finally.

When using a DFT, it can be easy to forget, that if we feed it a time-domain single-pulse, what it will yield in the frequency-domain, is actually a series of complex numbers, the absolutes of which do not change, but which do a rotation in the complex plane, when plotted out along the frequency-domain. And then, if all we could see was either their real or their imaginary component, we would see that the DFT also produces a fringing effect initially.

The fact that these numerical tools are not truly spectrographs, can make them unsuitable for direct use in Psychoacoustics, especially if they have not been adapted in some special way for that use.

Dirk

1: ) This latter observation also has a meaning, for when we want to entropy-encode a (compressed) sound file, and when the time-domain signal was white noise. If we can assume that each frame states 512 coefficients, and that the maximum amplitude of the simulated white noise is supposed to be +/- 32768, Then the amplitude of our ‘small numbers’, would really only need to reach 64, so that when they interfere constructively and destructively over an output interval, they will produce this effect.

Now, one known fact about musical sounds which are based on white noise is, that they are likely to be ‘colored’, meaning that the distribution of signal energy is usually not uniform over the entire audible spectrum. Hence, If we wanted just 1/8 of the audible spectrum to be able to produce a full signal strength, Then we would need for the entropy-encoded samples to reach 512. And, we might not expect the ‘small numbers’ to be able to reproduce white noise at full amplitude, since the length of the big numbers is ‘only’ 15 bits+ anyway. One entropy-encoded value might already have a length of ~3 bits. So it could also be acceptable, if as many as 1/6 of the coefficients were encoded as ‘big numbers’, so that again, the maximum amplitude of the ‘small numbers’ would not need to carry the sound all by itself…

And yet, some entropy-encoding tables with high amplitudes might be defined, just in case the user asks for the lowest-possible bit-rates.

## Some Specific Detail, about MP3 Compression of Sound

In This Posting, I wrote at length, about a weakness that exists in MP3-compressed sound, basing my text on the Discreet Cosine Transform, and what some of its implications are. I wrote about ‘a rational approach’, according to which it might make sense, to use a sampling interval of 1024 samples. But I do know in fact that with MP3 compression, each sampling interval has 1152 samples, and the length of each frame is 576 samples. Somebody please correct me, if I have this wrong.

But there is a certain aspect to MP3 encoding which I did not mention, that has to do with the actual representation of the coefficients, and that has implications for what can and cannot be done in the realm of the Fourier Transform used. A Fourier Transform by itself, does not succeed at compressing data. It only alters the representation of the data, from the time-domain into the frequency-domain, which is useful in sound compression, because to alter the data in the frequency-domain does not damage its suitability for listening, the same way that altering its representation in the time-domain would damage it.

I.e., We can quantize the signal after having performed the Fourier Transform on it, but not before.

One of the aspects of MP3 compression which truly reduces the bit-rates obtained substantially, is called “Entropy Encoding”. This is an encoding scheme, by which a finite number of symbols are assigned a set of bits to represent them in a data stream, which invert the frequency of occurrence, to result in the shortest possible bit-stream.

1. One aspect of Entropy Encoding which I do not see mentioned often enough, is the fact that the symbols need to repeat themselves, in order for this scheme to achieve any compression. Hence, if the coefficients used in sound compression were to consist of floating-point numbers, the probability that any one of them would actually occur twice in the data stream would be small, and  Entropy Encoding would not be a suitable means to reduce the bit-rate.
2. Further, traditionally, in order for Entropy Encoding to be decoded, a data stream needed to be accompanied with a decoding table, that defines each of the variable-bit-length codes, into the intended symbol. In sound compression, even if we needed to state what the exact 15-bit value was, for each variable-bit-length encoding, doing so would nevertheless require that we state the 15-bit value once, in the header of each frame. And having to do so, would result in unacceptably high bit-rates overall.

And so both of these limitations of Entropy Encoding had to be surmounted, in order for MP3 compression to exist as we have it today.

(As of 05/23/2016, I have learned the following about this scheme: )

What happens with MP3, at the encoding level, after they have passed filtering through the psychoacoustic criteria, is that coefficients are scaled. The scale-factor is written once for each of 22 bands of frequencies, before Huffman Codes are written, that state all the frequency coefficients.

Further, because Huffman Encoding by itself does not yield enough compression, pairs of coefficients are encoded instead of single coefficients. Somehow, the Statistics of this yield better compression.

What also happens with MP3, is that this fixed table (for pairs of integers) is assumed by the standard.

(What had caused me to follow a misconception until 05/23/2016 :

Apparently, a Huffman Code for 15 signals that a full-precision, ‘big value’ is written, following that Huffman Code, with a precision of 13 bits.

The crucial note to myself here is, that the Entropy Encoding table is specifically the Huffman Coding Table, and that for this reason, integers greater than 15 could also be encoded. But by that time, we would have reached the point of diminishing return. And more precisely, it is the Huffman Coding Table, modified to encode Pairs of integers, so that a maximum compression down to 12.5% becomes possible, instead of merely 25%. )

(Edit 06/06/2016 : ) It should be noted, that the practice of Huffman Encoding pairs of values is really only advantageous, if at least one of them was equal to zero, often. Otherwise, it would work just as well to encode them individually.

(Edit 05/28/2016 : ) What strikes me as most plausible, is that with MP3, initially the odd-numbered DCT coefficients are computed, to avoid missing out-of-phase sine-waves. But then, even-numbered coefficients may be derived from them, so that the stream can be decoded again efficiently. The even-numbered coefficients will have as property, that they are 180 degrees out of phase, between two 50% overlapping sampling intervals / frames. This can make playback easier, in that the decoder only needs to keep track, of even-numbered and odd-numbered frames / granules.

Now, I would not say that people should never use MP3. It has its uses. But it also has drawbacks, which are usually correlated with the use that MP3 was originally designed to fill. It was designed for listening to music over limited, early Internet data-connections, and may be just as useful for compressing speech, if the aim is to reduce the bit-rate strongly, and to accept some level of information-loss.

At the bit-rates used today, it leaves the user with a sound quality superior to what the old tape cassettes offered, but inferior to what Raw CDs offered.

It was never really intended to encode movie sound tracks, especially since those often involve ‘Surround Sound’. MP3 generally does not capture surround sound. Yet, I can see myself using it to encode the audio portion of certain video-clips myself, if I know that those clips do not include surround sound. An example might be a rock concert, or some random clip I was experimenting with, but for which my original production never even included any surround information.

There exist numerous alternatives to MP3, that are also available to ordinary users today.

Dirk

(Edit 05/24/2016 : ) There are some other idiosyncrasies in real MP3 compression, which I had noted at some earlier point in time, but which I had since forgotten:

One of them is, that because it is popular right now to refer to the ‘Discreet Fourier Transform’, as a “Fast Fourier Transform”, the DFT is actually computed in order to derive the psychoacoustic parameters. In this transform, there are 32 frequency sub-bands. But then the DCT gets used, actually to compress the sound.

Another idiosyncrasy is, that MP3 will use discreet transient detection, to replace one granule that had a length of 576, with 3 granules that have a length of 192, thus implying a new sampling interval of 384. This defines 4 regions into which any granule can belong, a ‘start’, a ‘normal’, and an ‘end’ region, as well as a ‘fast’ region. Each region has its own sampling window defined.

(Edit 06/06/2016 : ) There was an interesting detail I read about, according to which, the scale factor of each of the 22 encoded sub-bands is stored in the per-granule information, with the exclusion of the highest-frequency sub-band. Apparently, to have the encoder compute a scale factor for all the sub-bands would have implied, that a balanced amount of information is to be allocated to each one.

However, the highest sub-band was thought by the designers, to contain less-pleasant information than the others, which is not supposed to take up as many bits necessarily. Therefore, the decoder is expected to reuse the scale factor of the second-highest sub-band, as the one for the highest.

The highest sub-band will then store many bits, if its amplitudes were quite high during encoding.

Also, whether the Fourier Transform used to derive the psychoacoustic parameters is an ‘FFT’ or a ‘DFT’, is a decision left to the programmers of the codec, since this transform is not used actually to encode the granules. If there was a programmer who wanted to use a DFT here, with 32 sub-bands of its own, then that programmer was recognizing the fact that today, CPUs have far more power than older ones did, and he was trying to improve the quality with which the granules are encoded.

By default, an FFT is used as the first transform, simply because doing so follows the general principal, of trying to reduce the total number of computations needed by the encoder. Its purpose is to determine the audibility thresholds, according to which some of the coefficients of the DCT are set to zero, on the grounds that those should be inaudible

This was also why a ‘DCT’ was used for the actual sound information. That could also have been a DFT, but with the phase information later ignored…