A Weakness in Exponential-Golomb Encoding

I have written before, about variable-length encoding, and one observation I had made then, was that Huffman Encoding of value-pairs was only really strong, as long as we do not need to encode consecutive non-zero values, or consecutive values significantly greater than zero. And so I had suggested Exponential-Golomb encoding as an alternative.

I also see a problem with that, in the way it was prescribed ages ago.

The way Exponential-Golomb works, is that the value of zero is actually encoded as a single bit equal to (1). Non-zero values are as if written out by hand in binary, so that their first bit is always equal to (1), and then the length in bits is signaled to the decoder with a prepended series of zeroes, numbering the length of the actual integer. This prefix of zeroes is one possible form of unary encoding.

The problem I see, is with how we are supposed to encode integers that may or may not be negative. In the classical approach, we should interleave the negative integers with the positive, and then encode as though we were encoding non-negative values. Doing so increases the length of non-zero integers by one bit in average.

What nobody seemed to point out, was that doing so will also lengthen the unary prefix by one zero, on average, to signal this longer unsigned integer. So on average, this approach will actually lengthen the variable-length bit-sequence by two bits.

The main exception will be with (+1), which continues to be represented by 2 bits. Yet, if the values are signed, then I would guess that a (-1) will have an equal probability of occurring as (+1), and (-1) will be encoded as 4 bits, thus forming and average length of 3 bits, and merely breaking even with what I am about to suggest.

And so as far as I am concerned, it makes greater sense just to Exponential-Golomb-encode the absolute, and if this absolute is not zero, to append a sign bit. Doing so will only lengthen the variable-length sequence by one bit. (+1) and (-1) would both be encoded as 3 bits, and (+2), (+3), (-2) and (-3) would all be encoded as 4 bits.

Dirk

 

A Possible Path to High-Resolution, Compressed Sound

There is a fundamental tradeoff which takes place, when we use some sort of Fourier Transform, to help compress a sound stream, in a way that is already known to be lossy. Higher spectral resolution requires longer sampling intervals, which also imply poorer temporal resolution. Higher temporal resolution requires shorter sampling intervals, which also imply poorer spectral resolution.

I believe that one way in which the ear can outperform this limitation, is in having its cilia work in a massively parallel way. I think that our ears also have poor temporal resolution at the lower frequencies, but that our Human temporal resolution improves at higher frequencies.

And so one way in which I think that sound could be compressed, would be not to stick to one length of sampling interval.

For example, it might be possible to have a longest sampling window, 2048 samples long. Even-numbered coefficients could be computed for it using a Modified Discreet Cosine Transform, which range from 0 to 23 cycles / window. After that, the same interval of time could be subdivided into shorter sampling windows, each only 1024 samples long, and coefficients could be computed from them, which go from 12 to 23 cycles / window, thus completing 3 granules.

4 more ‘octaves’ should be possible, with sampling window lengths of 512, 256, 128 and 64 samples. Most of them would derive coefficients from 12 to 23 cycles / window again, with the exception of the 64-sample windows, which would derive from 12 to 31 cycles / window.

I would maintain the assumption, that from each length of sampling, a granule would result which is half as long, and for which coefficients would be stored.

This should result in 6 ‘octaves’ in total, each of which would have its own scale factor, stored once per frame interval (1024 samples), corresponding to the slowest granule. To simplify computing this scale factor, a global quality level could simply decide how many integers all the coefficients should be quantized to. For each ‘octave’, the peak amplitude within all the granules would be taken, and divided by this quality level, to arrive at the scale factor.

Each frame would store 63 granules, the 32 of which with the highest frequencies, would have 20 coefficients each, 30 of which would have 12 coefficients, and the longest granule of which would have 24 coefficients. This would result in 1024 coefficients / frame, in a fixed order.

To reduce waste, the scale factor of the highest, 6th octave, could simply be the same as that of the previous, 5th octave, as long as using that one yields lower quantized integers than the global quality level.

The resulting, quantized amplitudes could again, be encoded in a variable-length scheme, such as Exponential-Golomb, optionally plus a sign-bit.

One adverse side-effect of this would be, the complex and tedious computation of the scale factors. I do not assume that I would be using any Fast Fourier Transform, to determine audibility thresholds, and to set many of the DCT coefficients to zero, the way it is done with MP3. Then, it would make most sense to determine the scale factors from DCT values very closely analogous to how they are encoded.

The problems start, with the fact that each sampling interval is assumed to have a windowing function, when encoding. This turns into a major CPU load, once a scale factor needs to be computed 32×20 times per frame.

So one simplification I could offer, would be to begin by computing and temporarily storing all the DCT coefficients as 15-bit values, with the mere notion that they will later be quantized, but that a maximum value for them is kept up-to-date, once per ‘octave’ as defined above. After that, the scale factor can be computed from this maximum

Dirk

(Edit 06/07/2016 : ) This hypothetical scheme has a major drawback as it stands. Even though it will inherently detect and bracket transients, it would also have poor recovery from transients. After and before a transient, the above method will remain insensitive to sounds in the same octave, for up to 1024 samples in the case of a 44.1 kHz format, which translates into 25 milliseconds. In my opinion, the human ear can detect this as a ‘sound shadow’.

MP3 recovers from transients within 576 samples.

One way to correct this could be, to arrange for not one but two scale factors to be encoded for each octave, except for the lowest octave. The first scale factor would apply to the first half of the frame, while the second scale factor would apply to the second half.

In principle this idea could be extended, all the way until there is a separate scale factor for each granule, with the ostensible exception of the shortest, highest-frequency granules / octave… But then doing so would also imply the intent, of allocating a uniform number of bits / a uniform amount of information, to each granule, knowing that their number doubles temporally with each octave. This would not be, what I would want compressed sound to do.

 

One reason Why, It is Difficult For Me to Guess, at the Variable-Length Encoding of numbers, Chosen By Other People

The problem can come up often in Computing, that instead of having a fixed-length encoding for numbers, we may want to encode a majority of integers which lie in a small range, but out of which a progressively smaller number have larger values. This can lead to variable-length encoding schemes, and MP3 sound compression, logically, is one place where this happens.

If one is trying to guess at what encoding was used, the fact that can stymie a person is, that many methods exist to accomplish exactly that. Huffman Encoding has as a problem, that although the higher-value integers are assigned longer bit-sequences, the relative frequency with which these higher integers will occur, is not the inverse, of their bit-length. This can also be why, a non-default arrangement can be made, that if the size of the integer reaches 15, a full-length value needs to follow.

I have now finally learned, that with MP3 compression, at least the integers smaller than 15, Huffman-Encoded, are intended to be the default case, and values at or above 15 are intended to be the exception. Thus, if the scaling factor is increased, for sure the bit-length of the stream will decrease, until the bit-rate is achieved, that the user set. I got that.

But Why, then, You May Ask, does Dirk choose a different way to accomplish the same thing, and so often?

My answer would be, that formal solutions are often good at compressing the size of integers when those lie in a certain range, but if one value appears in the stream which is much larger than the average, those variable-length encoding schemes can become monsters. ( :1 )

As an example, I read that ‘FLAC’ will record the Linear Predictive Encoding coefficients for a frame accurately, and that this scheme will then Rice Encode the residual each time.

Pure Rice Encoding means, that a remainder of fixed bit-length is encoded with each sample, but that it is given a (variable-length) prefix encoded in straight unary, which states what the multiple of the tunable parameter is (the quotient), that fits into the fixed-length remainder. This choice of a pure unary prefix is questionable, for the reason that I just stated above.

Now, I know that there is also Exponential-Golomb encoding, which like Huffman Coding has a bit-length that grows with the size of the integer to be encoded. But Exponential-Golomb generally produces a bit-length twice as long, as what it would take just to write out the integer on paper.

And so at least a slightly more sophisticated form of encoding exists, which is called Golomb-Rice Encoding, which is essentially Rice Encoding, but in which the prefix, which states the quotient, is prepended in Exponential-Golomb format. Why would they not use it?

And, since it is possible just to put a prefix before an integer in unary, that states its length, an approach which I would be tempted to use, would be just to assume that this unary prefix should be multiplied by a factor such as 3, to arrive at the true length of the integer.

But then a problem with that would be, the fact that this type of prefix would need to be at least 2 bits long, for non-zero values, followed by this multiple of bits belonging to the value, as a minimum. So it will not compress very small values well.

And the reason for this would be the fact, that it would no longer be certain then, that the first bit which actually belongs to the integer, will always be a (1), the way it is with Exponential-Golomb.

And, while I tend to view such encoding schemes as arbitrary, the fashion these days is, always to select a formally-defined one.

In general, my approaches will work well, if a substantial number of values are high.

Dirk

1: ) And what ‘FLAC’ will do in such a case, is just switch the type of the frame this happens in, to a type ‘VERBATIM’ frame. In other words, FLAC would just decide ‘This is one frame we cannot compress’.

FLAC also has a mode, in which each sample is stated as a delta, from the previous one. This corresponds to an ‘LPE’ with one predictor, the coefficient of which is just equal to (+1), relative to which the current value is just the residual…

(Edit 05/25/2016 : ) This is another posting of mine, in which I explain an additional detail about MP3 compression.

Further, ‘FLAC’ is able to encode some of its frames as using LPE, with a variable number of coefficients. I.e., When set to compress more, it will spend its CPU time trying encodings with (!) 6 or more predictors, and will store those in cases where doing so led to more-compact encoding.

While a set of 4 or more coefficients needs to be computed specifically for one frame, via a Statistical Regression Analysis, I have read that for 1, 2, or 3, FLAC just uses a standard set of them. For 1, that will be [ +1 ] . For 2, that will be [ -1, +2 ] . It might seem like a purely academic exercise, to know a standard set of coefficients, which will generally work well if there are only 3 of them. But in fact, having this available offers a non-trivial advantage, over having to store those in the compressed stream.

Since we would presumably be multiplying signed, 16-bit samples with signed, 16-bit coefficients, it will be helpful if the latter only need to fall into the fractional range of ( -1.0 … +1.0 ) . The reason for this is the fact that If we needed to store coefficients which are allowed to exceed ( + 1.0 ) , Then we are blowing another bit of precision just so that one coefficient could do so.

The most recent coefficient will still exceed ( +1.0 ) when there are 3. But as soon as there are 4, none of them would exceed ( +1.0 ) anymore. Therefore, all the coefficients which must be stored, when their number reaches 4 or more, can be made more precise, just because there is a standard set for when we have 3.