# Deriving a workable entropy-encoding scheme, based on the official explanation of CABAC.

One of the subjects which I recently blogged about, is that when encoding video-streams, some Codecs use 8×8 sample Discrete Cosine Transforms, but as with many DCTs, the coefficients produced tend to be values, which would take up too much space to store, in a fixed-length format. And so a family of techniques which gets applied, is loosely referred to as ‘Entropy Encoding’, with the key idea being, that the Entropy Encoding used for compressed video, is different again, from the Entropy Encoding used for compressed audio. And the scheme used for video has as advantage, that the encoding itself is lossless. Apparently, there are two variants actually used with H.264-encoded videos, which some people group together as MPEG-4:

1. An unspecified form of variable-length encoding,
2. CABAC,

The latter of which promises better compression, at the cost of greater CPU-power required, both to encode and to decode. I’m going to focus on ‘CABAC’ in this posting. There is an official explanation for how CABAC works, which I will refer to. In order to understand my posting here, the reader will need to have read the documentation I just linked to.

From first impressions – yesterday evening was the first day on which I examined CABAC – I’d say that the official explanation contains an error. And I’ll explain why, by offering a version of Entropy-Encoding, which I know can work, based on the link above, but different from it:

• Integers are meant to be encoded, that are “Binarized”.
• The probability with which the first “Bin” has become (1) instead of (0) can be analyzed as described, resulting in a Context Model of one out of (0, 1, 2), as described.
• The next four Bins may not have individual probabilities computed, only resulting in Context Models (3, 4, 5, 6) when they are (1) instead of (0), which override the Context Model that the first Bin would generate.
• The resulting, one Context Model could be averaged over the previous Values.
• Using As a Pair of values, the Context Model (from the previous values) which was just computed, And the (present) Integer Value, a look-up can take place in a 2-dimensional table, of which sequence of bits to use, to encode (both).
• Because the decoder has chosen the integer value out of a known row in the same look-up table, it can also update the Context Model being used, so that future look-ups when decoding remain unambiguous.

The main problem I see with the official explanation is, that because up to 6 Context Models can be computed, each of which supposedly has its own probability, based on that, the lookup-table in which binary values (entropy encodings) are to be found, would effectively need to be a 6-dimensional table ! Officially, all the Context-Models found, have equal meaning. Software is much-more probable, which uses a 2D table, than software which uses a 6-dimensional table, although according to Theoretical Math, 6-dimensional tables are also possible.

But then, a property of Variable Length Coding which has been observed for some time, was that small integers, such as (0), (1) and (2), were assigned very short bit-sequences to be recognized, while larger integers, such as (16) or (17), were assigned recognizable bit-sequences, which would sometimes have been impractically long, and which resulted in poor compression, when the probability of the integer actually being (0), (1) or (2) decreased.

So, because we know that we can have at least one Context-Model, based on the actual, local probabilities, when the probabilities of very small integers become smaller, a series of entropy-encodings can be selected in the table, the bit-length of which can be made more-uniform, resulting in smaller encodings overall, than what straight Variable-Length Encoding would have generated, CABAC instead being adapted to probable, larger integers.

The fact will remain, that the smaller integers will require fewer bits to encode, in general, than the larger integers. But when the smallest integers become very improbable, the bit-lengths for all the integers can be evened out. This will still result in longer streams overall, as larger integers become more-probable, but in shorter streams than the streams that would result, if the encodings for the smallest integers remained the shortest they could be.

Now, I could adapt my reinterpretation of CABAC in such a way, that a 3-dimensional table is used:

• When encoding, 2 Context Models and the actual value are used to look up a bit-sequence,
• When decoding, a combination of 2 Context Models can be used, to define, how any one bit-sequence is to be decoded into an actual value,
• Both Context Models can be updated in the decoder, to match what the Encoder has.

But I cannot actually visualize, that a 6-dimensional table is being used.

• I can even imagine, that numbers which result from another table, ranging from [0 .. 2] have a meaning, and that this meaning has been named a ‘Context Model’, but that numbers resulting from the same table, ranging from [3 .. 6] have a different meaning, and that a second ‘Context Model’ can be computed from them. But then, this would result in (2) Context Models, and not (6).

According to my assumptions, it could happen that the initial (2) indexes, into a 3-dimensional table, could both state maximum probabilities. I.e., the probability of the first Bin being (1), and the probability of the next four Bins being (1), could both be maximal. And this would put the look-up into a corner of the table, where the encodings of actual, Integer Values could be defined as fixed-length.

It was always assumed, that only encodings up to some maximal value can be defined in a look-up table.

But then the problem would remain, that a DCT generates 64 values to encode, all of which would now have fixed-length encodings.

My assumption would also be, that none of the Context Models would lead to encodings, which perform worse, than ‘standard’, variable-length encoding did.

Further, I should observe that the use of an 8×8 sample DCT is also not guaranteed. In cases where the H.264 video Codec used Variable-Length Encoding in the past, instead of CABAC, apparently, they also used a 4×4 sample DCT.

BTW:

I suppose that a logical question which the reader may have, would be, ‘What does any of this have to do with Lossless?’

The short answer would be, that to skip quantization of the DCT coefficients, also requires that larger values than usual, be encoded by the CABAC compression. I think that usually, Entropy Encoding was already lossless, but also resulted in very unusable results, for example if the maximum value to encode could have been the integer (255). CABAC can overcome that limitation. However, if 10-bit or 12-bit depth is required, then I’d say that the lookup-table I described above, cannot contain the required encodings, and Variable-Length Encoding would need to be used again, specifically because quantization would not be taking place. And then, for 10-bit or 12-bit depths, lossless may become unachievable again.

By same token, if the user’s version of ‘ffmpeg’ supports 12-bits per channel – which mine does – and if the user specifies a quantization-factor of at least (16), then again, it should be possible for the program to apply CABAC, because after quantization, 8-bit values will remain – in principle, not always.

The somewhat longer answer would be:

Even if quantization is disabled, which, using ‘ffmpeg’, can be accomplished by giving the parameter ‘-crf 0‘, there are other processes taking place in the Codec, which could cause an eventual re-decoded result, not to match the original pre-encoded content exactly. One such process could be, the fact that Discrete Cosine Transforms are being computed which, though invertible in exact numbers, could suffer from roundoff-errors the way they are actually computed.

There was another person, who claims to have achieved true, lossless results, when given original content in the form of a raw, YUV-file. The fact that he succeeded with what he described, means that ‘ffmpeg’ was able to compute its DCT (twice), without producing any roundoff-errors. It also means, that after Motion Estimation was applied, the Residual was computed without any roundoff-errors.

I can see that the formation of the YUV-pixel-format would be lossy,

1. Because YUV 4:2:0 is a more common preset than YUV 4:4:4, and
2. Because ‘YUV’ is not the same as ‘YCbCr’. The way YUV is computed is correct and invertible using floating-point numbers, but actual YUV tuples use an integer representation, with the same bit-depth as the original RGB values had. And, ‘real YUV’ implies maximum (signed) values for U and V, that are less-than-half-amplitude, thereby Not utilizing the full range of values possible in a ‘YUV’ tuple. This again means, that what worked with floating-point numbers, will not work with integers.

So instead of actually shooting for ‘lossless’, which suggests, that the hash-codes will be correct after a round-trip, I think that the reader might want to shoot for ‘near-lossless’, which just suggests a conversion, that does not quantize.

The other person who I linked to above, supplied a YUV-file, but may even have put ‘YCbCr’ values into it, and then decoded back to the same set of tuples which he started with.

Dirk