## Some Thoughts on Surround Sound

The way I seem to understand modern 5.1 Surround Sound, there exists a complete stereo signal, which for the sake of legacy compatibility, is still played directly to the front-left and the front-right speaker. But what also happens, is that a third signal is picked up, which acts as the surround channel, in a way that neither favors the left nor the right asymmetrically.

I.e., if people were to try to record this surround channel as being a sideways-facing microphone component, by its nature its positive signal would either favor the left or the right channel, and this would not count as a correct surround-sound mike. In fact, such an arrangement can best be used to synthesize stereo, out of geometries which do not really favor two separate mikes, one for left and one for right.

But, a single, downward-facing, HQ mike would do as a provider of surround information.

If the task becomes, to carry out a stereo mix-down of a surround signal, this third channel is first phase-shifted 90 degrees, and then added differentially between the left and right channels, so that it will interfere least with stereo sound.

In the case where such a mixed-down, analog stereo signal needs to be decoded into multi-speaker surround again, the main component of “Pro Logic” does a balanced summation of the left and right channels, producing the center channel, but at the same time a subtraction is carried out, which is sent rearward.

The advantage which Pro Logic II has over I, is that this summation first adjusts the relative gain of both input channels, so that the front-center channel has zero correlation with the rearward surround information, which has presumably been recovered from the adjusted stereo as well.

Now, an astute reader will recognize, that if the surround-sound thus recovered, was ‘positive facing left’, its addition to the front-left signal will produce the rear-left signal favorably. But then the thought could come up, ‘How does this also derive a rear-right channel?’ The reason for which this question can arise, is the fact that a subtraction has taken place within the Pro Logic decoder, which is either positive when the left channel is more so, or positive when the right channel is more so.

(Edit 02/15/2017 : The less trivial answer to this question is, A convention might exist, by which the left stereo channel was always encoded as delayed 90 degrees, while the right could always be advanced, so that a subsequent 90 degree phase-shift when decoding the surround signal can bring it back to its original polarity, so that it can be mixed with the rear left and right speaker outputs again. The same could be achieved, if the standard stated, that the right stereo channel was always encoded as phase-delayed.

However, the obvious conclusion of that would be, that if the mixed-down signal was simply listened to as legacy stereo, it would seem strangely asymmetrical, which we can observe does not happen.

I believe that when decoding Pro Logic, the recovered Surround component is inverted when it is applied to one of the two Rear speakers. )

But what the reader may already have noticed, is that if he or she simply encodes his mixed-down stereo into an MP3 File, later attempts to use a Pro Logic decoder are for not, and that some better means must exist to encode surround-sound onto DVDs or otherwise, into compressed streams.

Well, because I have exhausted my search for any way to preserve the phase-accuracy, at least within highly-compressed streams, the only way in which this happens, which makes any sense to me, is if in addition to the ‘joint stereo’, which provides two channels, a 3rd channel was multiplexed into the compressed stream, which as before, has its own set of constraints, for compression and expansion. These constraints can again minimize the added bit-rate needed, let us say because the highest frequencies are not thought to contribute much to human directional hearing…

(Edit 02/15/2017 :

Now, if a computer decodes such a signal, and recognizes that its sound card is only in  stereo, the actual player-application may do a stereo mix-down as described above, in hopes that the user has a pro Logic II -capable speaker amp. But otherwise, if the software recognizes that it has 4.1 or 5.1 channels as output, it can do the reconstruction of the additional speaker-channels in software, better than Pro Logic I did it.

I think that the default behavior of the AC3 codec when decoding, if the output is only specified to consist of 2 channels, is to output legacy stereo only.

The approach that some software might take, is simply to put two stages in sequence: First, AC3 decoding with 6 output channels, Secondly, mixing down the resulting stereo in a standard way, such as with a fixed matrix. This might not be as good for movie-sound, but would be best for music.


1.0   0.0
0.0   1.0
0.5   0.5
0.5   0.5
+0.5  -0.5
-0.5  +0.5



If we expected our software to do the steering, then we might also expect, that software do the 90° phase-shift, in the time-domain, rather than in the frequency-domain. And this option is really not feasible in a real-time context.

The AC3 codec itself would need to be capable of 6-channel output. There is really no blind guarantee, that a 6-channel signal is communicated from the codec to the sound system, through an unknown player application... )

(Edit 02/15/2017 : One note which should be made on this subject, is that the type of matrix which I suggested above might work for Pro Logic decoding of the stereo, but that if it does, it will not be heard correctly on headphones.

The separate subject exists, of ‘Headphone Spacialization’, and I think this has become relevant in modern times.

A matrix approach to Headphone Spacialization would assume that the 4 elements of the output vector, are different from the ones above. For example, each of the crossed-over components might be subject to some fixed time-delay, which is based on the Inter-Aural Delay, after it is output from the matrix, instead of awaiting a phase-shift… )

(Edit 03/06/2017 : After much thought, I have come to the conclusion that there must exist two forms of the Surround channel, which are mutually-exclusive.

There can exist a differential form of the channel, which can be phase-shifted 90⁰ and added differentially to the stereo.

And there can exist a common-mode, non-differential form of it, which either correlates more with the Left stereo or with the Right stereo.

For analog Surround – aka Pro Logic – the differential form of the Surround channel would be used, as it would for compressed files.

But when an all-in-one surround-mike is implemented on a camcorder, this originally provides a common-mode Surround-channel. And then it would be up to the audio system of the camcorder, to provide steering, according to which this channel either correlates more with the front-left or the front-right. As a result of that, a differential surround channel can be derived. )

(Updated 11/20/2017 : )

## One reason Why, It is Difficult For Me to Guess, at the Variable-Length Encoding of numbers, Chosen By Other People

The problem can come up often in Computing, that instead of having a fixed-length encoding for numbers, we may want to encode a majority of integers which lie in a small range, but out of which a progressively smaller number have larger values. This can lead to variable-length encoding schemes, and MP3 sound compression, logically, is one place where this happens.

If one is trying to guess at what encoding was used, the fact that can stymie a person is, that many methods exist to accomplish exactly that. Huffman Encoding has as a problem, that although the higher-value integers are assigned longer bit-sequences, the relative frequency with which these higher integers will occur, is not the inverse, of their bit-length. This can also be why, a non-default arrangement can be made, that if the size of the integer reaches 15, a full-length value needs to follow.

I have now finally learned, that with MP3 compression, at least the integers smaller than 15, Huffman-Encoded, are intended to be the default case, and values at or above 15 are intended to be the exception. Thus, if the scaling factor is increased, for sure the bit-length of the stream will decrease, until the bit-rate is achieved, that the user set. I got that.

But Why, then, You May Ask, does Dirk choose a different way to accomplish the same thing, and so often?

My answer would be, that formal solutions are often good at compressing the size of integers when those lie in a certain range, but if one value appears in the stream which is much larger than the average, those variable-length encoding schemes can become monsters. ( :1 )

As an example, I read that ‘FLAC’ will record the Linear Predictive Encoding coefficients for a frame accurately, and that this scheme will then Rice Encode the residual each time.

Pure Rice Encoding means, that a remainder of fixed bit-length is encoded with each sample, but that it is given a (variable-length) prefix encoded in straight unary, which states what the multiple of the tunable parameter is (the quotient), that fits into the fixed-length remainder. This choice of a pure unary prefix is questionable, for the reason that I just stated above.

Now, I know that there is also Exponential-Golomb encoding, which like Huffman Coding has a bit-length that grows with the size of the integer to be encoded. But Exponential-Golomb generally produces a bit-length twice as long, as what it would take just to write out the integer on paper.

And so at least a slightly more sophisticated form of encoding exists, which is called Golomb-Rice Encoding, which is essentially Rice Encoding, but in which the prefix, which states the quotient, is prepended in Exponential-Golomb format. Why would they not use it?

And, since it is possible just to put a prefix before an integer in unary, that states its length, an approach which I would be tempted to use, would be just to assume that this unary prefix should be multiplied by a factor such as 3, to arrive at the true length of the integer.

But then a problem with that would be, the fact that this type of prefix would need to be at least 2 bits long, for non-zero values, followed by this multiple of bits belonging to the value, as a minimum. So it will not compress very small values well.

And the reason for this would be the fact, that it would no longer be certain then, that the first bit which actually belongs to the integer, will always be a (1), the way it is with Exponential-Golomb.

And, while I tend to view such encoding schemes as arbitrary, the fashion these days is, always to select a formally-defined one.

In general, my approaches will work well, if a substantial number of values are high.

Dirk

1: ) And what ‘FLAC’ will do in such a case, is just switch the type of the frame this happens in, to a type ‘VERBATIM’ frame. In other words, FLAC would just decide ‘This is one frame we cannot compress’.

FLAC also has a mode, in which each sample is stated as a delta, from the previous one. This corresponds to an ‘LPE’ with one predictor, the coefficient of which is just equal to (+1), relative to which the current value is just the residual…

(Edit 05/25/2016 : ) This is another posting of mine, in which I explain an additional detail about MP3 compression.

Further, ‘FLAC’ is able to encode some of its frames as using LPE, with a variable number of coefficients. I.e., When set to compress more, it will spend its CPU time trying encodings with (!) 6 or more predictors, and will store those in cases where doing so led to more-compact encoding.

While a set of 4 or more coefficients needs to be computed specifically for one frame, via a Statistical Regression Analysis, I have read that for 1, 2, or 3, FLAC just uses a standard set of them. For 1, that will be [ +1 ] . For 2, that will be [ -1, +2 ] . It might seem like a purely academic exercise, to know a standard set of coefficients, which will generally work well if there are only 3 of them. But in fact, having this available offers a non-trivial advantage, over having to store those in the compressed stream.

Since we would presumably be multiplying signed, 16-bit samples with signed, 16-bit coefficients, it will be helpful if the latter only need to fall into the fractional range of ( -1.0 … +1.0 ) . The reason for this is the fact that If we needed to store coefficients which are allowed to exceed ( + 1.0 ) , Then we are blowing another bit of precision just so that one coefficient could do so.

The most recent coefficient will still exceed ( +1.0 ) when there are 3. But as soon as there are 4, none of them would exceed ( +1.0 ) anymore. Therefore, all the coefficients which must be stored, when their number reaches 4 or more, can be made more precise, just because there is a standard set for when we have 3.

## About the Length of the Sampling Interval

I would guess that Microsoft chose a sampling interval of 1152, for its MP3 compression, so that they could truly say that the frequency response goes down to 20 Hertz. With an interval of 1024 samples, one could only get down to 22 Hertz.

And the difference of 128 samples, also factorizes well with 1024.

(Edit 05/23/2016 : ) According to the terminology of some other sources, what I refer to as ‘one sampling interval’, is named “the frame”, while what I have referred to as ‘one frame’, has been referred to as “a granule”.

(Edit 05/31/2016 : ) Another reason seems to be the fact, that both 1152 and 576 are divisible by 3. When a transient is detected, the need seems to exist, always to replace an odd number of frames, with an odd number of shorter frames. It seems that during playback, a count between even-numbered and odd-numbered granules takes place, which also causes alternation between a +1 and a -1 , except for the coefficient (0) , which must be encoded with its sign bit stored. A sequence of ( +1, -1, +1 ) will replace ( +1 ), and a sequence of ( -1, +1, -1 ) will replace ( -1 ) .

## Some Specific Detail, about MP3 Compression of Sound

In This Posting, I wrote at length, about a weakness that exists in MP3-compressed sound, basing my text on the Discreet Cosine Transform, and what some of its implications are. I wrote about ‘a rational approach’, according to which it might make sense, to use a sampling interval of 1024 samples. But I do know in fact that with MP3 compression, each sampling interval has 1152 samples, and the length of each frame is 576 samples. Somebody please correct me, if I have this wrong.

But there is a certain aspect to MP3 encoding which I did not mention, that has to do with the actual representation of the coefficients, and that has implications for what can and cannot be done in the realm of the Fourier Transform used. A Fourier Transform by itself, does not succeed at compressing data. It only alters the representation of the data, from the time-domain into the frequency-domain, which is useful in sound compression, because to alter the data in the frequency-domain does not damage its suitability for listening, the same way that altering its representation in the time-domain would damage it.

I.e., We can quantize the signal after having performed the Fourier Transform on it, but not before.

One of the aspects of MP3 compression which truly reduces the bit-rates obtained substantially, is called “Entropy Encoding”. This is an encoding scheme, by which a finite number of symbols are assigned a set of bits to represent them in a data stream, which invert the frequency of occurrence, to result in the shortest possible bit-stream.

1. One aspect of Entropy Encoding which I do not see mentioned often enough, is the fact that the symbols need to repeat themselves, in order for this scheme to achieve any compression. Hence, if the coefficients used in sound compression were to consist of floating-point numbers, the probability that any one of them would actually occur twice in the data stream would be small, and  Entropy Encoding would not be a suitable means to reduce the bit-rate.
2. Further, traditionally, in order for Entropy Encoding to be decoded, a data stream needed to be accompanied with a decoding table, that defines each of the variable-bit-length codes, into the intended symbol. In sound compression, even if we needed to state what the exact 15-bit value was, for each variable-bit-length encoding, doing so would nevertheless require that we state the 15-bit value once, in the header of each frame. And having to do so, would result in unacceptably high bit-rates overall.

And so both of these limitations of Entropy Encoding had to be surmounted, in order for MP3 compression to exist as we have it today.

What happens with MP3, at the encoding level, after they have passed filtering through the psychoacoustic criteria, is that coefficients are scaled. The scale-factor is written once for each of 22 bands of frequencies, before Huffman Codes are written, that state all the frequency coefficients.

Further, because Huffman Encoding by itself does not yield enough compression, pairs of coefficients are encoded instead of single coefficients. Somehow, the Statistics of this yield better compression.

What also happens with MP3, is that this fixed table (for pairs of integers) is assumed by the standard.

Apparently, a Huffman Code for 15 signals that a full-precision, ‘big value’ is written, following that Huffman Code, with a precision of 13 bits.

The crucial note to myself here is, that the Entropy Encoding table is specifically the Huffman Coding Table, and that for this reason, integers greater than 15 could also be encoded. But by that time, we would have reached the point of diminishing return. And more precisely, it is the Huffman Coding Table, modified to encode Pairs of integers, so that a maximum compression down to 12.5% becomes possible, instead of merely 25%. )

(Edit 06/06/2016 : ) It should be noted, that the practice of Huffman Encoding pairs of values is really only advantageous, if at least one of them was equal to zero, often. Otherwise, it would work just as well to encode them individually.

(Edit 05/28/2016 : ) What strikes me as most plausible, is that with MP3, initially the odd-numbered DCT coefficients are computed, to avoid missing out-of-phase sine-waves. But then, even-numbered coefficients may be derived from them, so that the stream can be decoded again efficiently. The even-numbered coefficients will have as property, that they are 180 degrees out of phase, between two 50% overlapping sampling intervals / frames. This can make playback easier, in that the decoder only needs to keep track, of even-numbered and odd-numbered frames / granules.

Now, I would not say that people should never use MP3. It has its uses. But it also has drawbacks, which are usually correlated with the use that MP3 was originally designed to fill. It was designed for listening to music over limited, early Internet data-connections, and may be just as useful for compressing speech, if the aim is to reduce the bit-rate strongly, and to accept some level of information-loss.

At the bit-rates used today, it leaves the user with a sound quality superior to what the old tape cassettes offered, but inferior to what Raw CDs offered.

It was never really intended to encode movie sound tracks, especially since those often involve ‘Surround Sound’. MP3 generally does not capture surround sound. Yet, I can see myself using it to encode the audio portion of certain video-clips myself, if I know that those clips do not include surround sound. An example might be a rock concert, or some random clip I was experimenting with, but for which my original production never even included any surround information.

There exist numerous alternatives to MP3, that are also available to ordinary users today.

Dirk

(Edit 05/24/2016 : ) There are some other idiosyncrasies in real MP3 compression, which I had noted at some earlier point in time, but which I had since forgotten:

One of them is, that because it is popular right now to refer to the ‘Discreet Fourier Transform’, as a “Fast Fourier Transform”, the DFT is actually computed in order to derive the psychoacoustic parameters. In this transform, there are 32 frequency sub-bands. But then the DCT gets used, actually to compress the sound.

Another idiosyncrasy is, that MP3 will use discreet transient detection, to replace one granule that had a length of 576, with 3 granules that have a length of 192, thus implying a new sampling interval of 384. This defines 4 regions into which any granule can belong, a ‘start’, a ‘normal’, and an ‘end’ region, as well as a ‘fast’ region. Each region has its own sampling window defined.

(Edit 06/06/2016 : ) There was an interesting detail I read about, according to which, the scale factor of each of the 22 encoded sub-bands is stored in the per-granule information, with the exclusion of the highest-frequency sub-band. Apparently, to have the encoder compute a scale factor for all the sub-bands would have implied, that a balanced amount of information is to be allocated to each one.

However, the highest sub-band was thought by the designers, to contain less-pleasant information than the others, which is not supposed to take up as many bits necessarily. Therefore, the decoder is expected to reuse the scale factor of the second-highest sub-band, as the one for the highest.

The highest sub-band will then store many bits, if its amplitudes were quite high during encoding.

Also, whether the Fourier Transform used to derive the psychoacoustic parameters is an ‘FFT’ or a ‘DFT’, is a decision left to the programmers of the codec, since this transform is not used actually to encode the granules. If there was a programmer who wanted to use a DFT here, with 32 sub-bands of its own, then that programmer was recognizing the fact that today, CPUs have far more power than older ones did, and he was trying to improve the quality with which the granules are encoded.

By default, an FFT is used as the first transform, simply because doing so follows the general principal, of trying to reduce the total number of computations needed by the encoder. Its purpose is to determine the audibility thresholds, according to which some of the coefficients of the DCT are set to zero, on the grounds that those should be inaudible

This was also why a ‘DCT’ was used for the actual sound information. That could also have been a DFT, but with the phase information later ignored…