## Deriving a workable entropy-encoding scheme, based on the official explanation of CABAC.

One of the subjects which I recently blogged about, is that when encoding video-streams, some Codecs use 8×8 sample Discrete Cosine Transforms, but as with many DCTs, the coefficients produced tend to be values, which would take up too much space to store, in a fixed-length format. And so a family of techniques which gets applied, is loosely referred to as ‘Entropy Encoding’, with the key idea being, that the Entropy Encoding used for compressed video, is different again, from the Entropy Encoding used for compressed audio. And the scheme used for video has as advantage, that the encoding itself is lossless. Apparently, there are two variants actually used with H.264-encoded videos, which some people group together as MPEG-4:

1. An unspecified form of variable-length encoding,
2. CABAC,

The latter of which promises better compression, at the cost of greater CPU-power required, both to encode and to decode. I’m going to focus on ‘CABAC’ in this posting. There is an official explanation for how CABAC works, which I will refer to. In order to understand my posting here, the reader will need to have read the documentation I just linked to.

From first impressions – yesterday evening was the first day on which I examined CABAC – I’d say that the official explanation contains an error. And I’ll explain why, by offering a version of Entropy-Encoding, which I know can work, based on the link above, but different from it:

• Integers are meant to be encoded, that are “Binarized”.
• The probability with which the first “Bin” has become (1) instead of (0) can be analyzed as described, resulting in a Context Model of one out of (0, 1, 2), as described.
• The next four Bins may not have individual probabilities computed, only resulting in Context Models (3, 4, 5, 6) when they are (1) instead of (0), which override the Context Model that the first Bin would generate.
• The resulting, one Context Model could be averaged over the previous Values.
• Using As a Pair of values, the Context Model (from the previous values) which was just computed, And the (present) Integer Value, a look-up can take place in a 2-dimensional table, of which sequence of bits to use, to encode (both).
• Because the decoder has chosen the integer value out of a known row in the same look-up table, it can also update the Context Model being used, so that future look-ups when decoding remain unambiguous.

The main problem I see with the official explanation is, that because up to 6 Context Models can be computed, each of which supposedly has its own probability, based on that, the lookup-table in which binary values (entropy encodings) are to be found, would effectively need to be a 6-dimensional table ! Officially, all the Context-Models found, have equal meaning. Software is much-more probable, which uses a 2D table, than software which uses a 6-dimensional table, although according to Theoretical Math, 6-dimensional tables are also possible.

But then, a property of Variable Length Coding which has been observed for some time, was that small integers, such as (0), (1) and (2), were assigned very short bit-sequences to be recognized, while larger integers, such as (16) or (17), were assigned recognizable bit-sequences, which would sometimes have been impractically long, and which resulted in poor compression, when the probability of the integer actually being (0), (1) or (2) decreased.

So, because we know that we can have at least one Context-Model, based on the actual, local probabilities, when the probabilities of very small integers become smaller, a series of entropy-encodings can be selected in the table, the bit-length of which can be made more-uniform, resulting in smaller encodings overall, than what straight Variable-Length Encoding would have generated, CABAC instead being adapted to probable, larger integers.

The fact will remain, that the smaller integers will require fewer bits to encode, in general, than the larger integers. But when the smallest integers become very improbable, the bit-lengths for all the integers can be evened out. This will still result in longer streams overall, as larger integers become more-probable, but in shorter streams than the streams that would result, if the encodings for the smallest integers remained the shortest they could be.

## Exploring Joint Stereo Encoding, with Non-Negative Integers

A concept can exist, by which a stereo signal consists of a left channel L and a right channel R, and by which it gets translated in the time-domain, into sample streams M and S, such that M = L+R and S = L-R. In this case, L and R can be reconstructed as


L = (M+S) / 2
R = (M-S) / 2



This seems trivial. but a more specific context for this set of equations could be, the variables could be frequency coefficients, and

L >= 0
R >= 0
M >= 0
L, R, M, S are all Integers.


Because the equations for L and R are truly the inverse, of the definition of M and S, it would follow that in order for them to be true, (M+S) and (M-S) must also be even integers.

If we were encoding the integers M and S in a variable-length scheme, then the bit-length of S has already been compromised by 1 bit, because somewhere we need to state its sign. Yet, we might want to be certain, that the encoding of (M,S) is not longer than that of (L,R).

And so an implication of this which we might want to take advantage of, is knowing that


If M is Even, S Must Also Be Even.
If M is Odd,  S Must Also Be Odd.



And so one idea that might be helpful, would be to define a derived value S’ , such that


S' = S / 2, Rounded Down,



meaning, rounded to the More Negative, If S was Odd.

We could then store (M,S’). The length of S’ is the length of S reduced by at least one bit. Then, when the time comes to decode the stream, we could compute


IF M Is Even, S = (S' * 2)

and,

IF M Is Odd,  S = (S' * 2) + 1



Thereby not wasting any bits. And, depending on what type of variable-length encoding was being used, shortening the length of the integer S’ by 1 bit, may in fact shorten its encoding by more than 1 bit.

Dirk

## One reason Why, It is Difficult For Me to Guess, at the Variable-Length Encoding of numbers, Chosen By Other People

The problem can come up often in Computing, that instead of having a fixed-length encoding for numbers, we may want to encode a majority of integers which lie in a small range, but out of which a progressively smaller number have larger values. This can lead to variable-length encoding schemes, and MP3 sound compression, logically, is one place where this happens.

If one is trying to guess at what encoding was used, the fact that can stymie a person is, that many methods exist to accomplish exactly that. Huffman Encoding has as a problem, that although the higher-value integers are assigned longer bit-sequences, the relative frequency with which these higher integers will occur, is not the inverse, of their bit-length. This can also be why, a non-default arrangement can be made, that if the size of the integer reaches 15, a full-length value needs to follow.

I have now finally learned, that with MP3 compression, at least the integers smaller than 15, Huffman-Encoded, are intended to be the default case, and values at or above 15 are intended to be the exception. Thus, if the scaling factor is increased, for sure the bit-length of the stream will decrease, until the bit-rate is achieved, that the user set. I got that.

But Why, then, You May Ask, does Dirk choose a different way to accomplish the same thing, and so often?

My answer would be, that formal solutions are often good at compressing the size of integers when those lie in a certain range, but if one value appears in the stream which is much larger than the average, those variable-length encoding schemes can become monsters. ( :1 )

As an example, I read that ‘FLAC’ will record the Linear Predictive Encoding coefficients for a frame accurately, and that this scheme will then Rice Encode the residual each time.

Pure Rice Encoding means, that a remainder of fixed bit-length is encoded with each sample, but that it is given a (variable-length) prefix encoded in straight unary, which states what the multiple of the tunable parameter is (the quotient), that fits into the fixed-length remainder. This choice of a pure unary prefix is questionable, for the reason that I just stated above.

Now, I know that there is also Exponential-Golomb encoding, which like Huffman Coding has a bit-length that grows with the size of the integer to be encoded. But Exponential-Golomb generally produces a bit-length twice as long, as what it would take just to write out the integer on paper.

And so at least a slightly more sophisticated form of encoding exists, which is called Golomb-Rice Encoding, which is essentially Rice Encoding, but in which the prefix, which states the quotient, is prepended in Exponential-Golomb format. Why would they not use it?

And, since it is possible just to put a prefix before an integer in unary, that states its length, an approach which I would be tempted to use, would be just to assume that this unary prefix should be multiplied by a factor such as 3, to arrive at the true length of the integer.

But then a problem with that would be, the fact that this type of prefix would need to be at least 2 bits long, for non-zero values, followed by this multiple of bits belonging to the value, as a minimum. So it will not compress very small values well.

And the reason for this would be the fact, that it would no longer be certain then, that the first bit which actually belongs to the integer, will always be a (1), the way it is with Exponential-Golomb.

And, while I tend to view such encoding schemes as arbitrary, the fashion these days is, always to select a formally-defined one.

In general, my approaches will work well, if a substantial number of values are high.

Dirk

1: ) And what ‘FLAC’ will do in such a case, is just switch the type of the frame this happens in, to a type ‘VERBATIM’ frame. In other words, FLAC would just decide ‘This is one frame we cannot compress’.

FLAC also has a mode, in which each sample is stated as a delta, from the previous one. This corresponds to an ‘LPE’ with one predictor, the coefficient of which is just equal to (+1), relative to which the current value is just the residual…

(Edit 05/25/2016 : ) This is another posting of mine, in which I explain an additional detail about MP3 compression.

Further, ‘FLAC’ is able to encode some of its frames as using LPE, with a variable number of coefficients. I.e., When set to compress more, it will spend its CPU time trying encodings with (!) 6 or more predictors, and will store those in cases where doing so led to more-compact encoding.

While a set of 4 or more coefficients needs to be computed specifically for one frame, via a Statistical Regression Analysis, I have read that for 1, 2, or 3, FLAC just uses a standard set of them. For 1, that will be [ +1 ] . For 2, that will be [ -1, +2 ] . It might seem like a purely academic exercise, to know a standard set of coefficients, which will generally work well if there are only 3 of them. But in fact, having this available offers a non-trivial advantage, over having to store those in the compressed stream.

Since we would presumably be multiplying signed, 16-bit samples with signed, 16-bit coefficients, it will be helpful if the latter only need to fall into the fractional range of ( -1.0 … +1.0 ) . The reason for this is the fact that If we needed to store coefficients which are allowed to exceed ( + 1.0 ) , Then we are blowing another bit of precision just so that one coefficient could do so.

The most recent coefficient will still exceed ( +1.0 ) when there are 3. But as soon as there are 4, none of them would exceed ( +1.0 ) anymore. Therefore, all the coefficients which must be stored, when their number reaches 4 or more, can be made more precise, just because there is a standard set for when we have 3.