Not Being Sure about the Sign Bit

One format in which MP3 can encode stereo, is in the form of “Joint Stereo”. In this form, the signal is sent as a sum-channel, and a difference-channel. The left and right channels can be reconstructed from them, just as easily as a sum and a difference, could be computed. The reason this is done, is to save on the bit-rate of the difference channel, along the argument that human stereo-directionality is more limited in frequencies, than straightforward hearing is.

But this is also one example, in which the difference channel needs to have a defined sign, as either being positive when left is more so, or being more positive when right is more so.

And so one way in which this could be encoded, if the decision was made to include doing so into the standard, could be to encode an additional sign-bit into each non-zero frequency coefficient of the difference channel. But doing so would also affect the overall bit-rate of the signal enough, that this deters professionals from doing it. Also, the argument is made that lower bit-rates can lead to higher sound quality, because if they want, users can increase the bit-rate anyway, resulting in greater definition of the more audible components of the sound.

And so with Joint Stereo, a feature built-in to MP3 is, “Side-Bits”. These side-bits are included if the mode is enabled, and declare as header information, how the signal should be panned, either to the left or the right, as well as what the sign of the difference-channel might be. ( :1 )

Well there is an implementation detail about the side-bits, which I do not specifically know: Whether they are stored only once per frame, or once per frequency sub-band, since the compression schemes do divide the spectrum into sub-bands already. Thus, if the side-bits were encoded for each sub-band, for each frame, then a good compromise could be reached, in terms of how much data should be spent on that.

The only coefficient which I imagine would have a sign-bit to itself each time, would be the zero coefficient, that corresponds to DC, but that also corresponds to F = Sampling Interval / 2.

This question becomes more relevant for surround-sound encoding. If, rather than using the Pro Logic method, some other scheme was decided on, for defining a ‘Back-Front’ channel, then it would suddenly become critical, that this channel have correct sign information. And then it would also become possible for this channel to correlate negatively with frequency components that also belong to the stereo channels. Hence, a more familiar method of designing the servos of Pro Logic II would be effective again, than what would be needed, if none of the cosine transforms could produce negative coefficients. ( :2 )

Dirk

P.S. When we listen to accurately-reproduced signals on stereo headphones, If the signals are perfectly in-phase, humans tend to hear them as if they came from inside our own heads. If they are 180 degrees out-of-phase, we tend to hear them as coming from a non-specific place outside our heads.

I have had a personal friend complain, that when he listens to Classical Music via MP3 Files, he cannot ascertain the direction which sounds are coming from – i.e. the positions of the instruments in the orchestra, but can hear panning and this in-versus-out condition. Listening to Classical Music via MP3 is a very bad idea.

What this tells me is that my friend has very good hearing, which is tuned to the needs of Classical Music.

The reason he hears some of the sound as being in the out-of-phase position, may well just be due to the fact that Joint Stereo was being selected by the encoder, and that certain frequency components in the difference-channel, did not have substantial counterparts in the sum-channel. Mathematically, this results in a correlation of zero between the encoded channels, but in a correlation of -1 between the reconstructed left and right…

Subjectively, I would say that I have observed better sound quality in this regard, when using OGG Compression, and at a higher bit-rate. I found that “Enya” required 192kbps, while “Strauss” did not sound good until I had reached 256kbps.

But I do not know objectively, what it is in the OGG Files, that gives me this better experience. I do not have the precision hearing, which said friend has. I have used FLAC to encode some “Beethoven” and “Schubert”, but mainly just in order to archive their music without any loss in information at all, and not as a testament, to the listening experience really being that much better.

1: ) In the case of Joint Stereo with MP3, what I would expect, is that the ‘pan-number’ will also direct the decoder to set the polarity of the difference-channel, to be positive with whichever side the sum-channel is being panned-towards more strongly. And I expect this to happen, regardless of what the phase information was when encoding.

If there was explicit sign information here, such information would first also have had to be measured, when encoding, as relative to the phase-position of the sum-channel. Since phase-information is generally relative. And I do not hear speak, of correlation information being collected first when encoding, between the stereo and difference channels.

2: ) This subject peaked my interest into how OGG Compression deals with multi-channel sound. I did an experiment using “Audacity”, in which I prepared a 6-channel project, chose to export it in a custom channel-configuration, and then chose different settings in the channel-meta-data window.

While AC3 was ‘limited’ to allowing a maximum of 7 channels, OGG allows a compressed stream with up to 32 channels. But, I seem to have observed that when compressing more than 2 channels, OGG forgoes even joint stereo optimization, instead only compressing each channel individually. This seems to follow from the observation, that If I mix channels 3 and 4, assigned to a hypothetical front-center and LFE, I should have turned those two into a monaural signal repeated once. But doing so does not improve the OGG File size.

There was a 3m45s stream, which took 5.2MB as a 6-channel AC3. The same stream takes up 18.1MB as a 6-channel OGG. And these bit-rates result from choosing a rate of 192kbps for the AC3 File, while choosing ‘Quality Level 8/10′ for the OGG.

I think that one reason for the big difference in bit-rates, is the fact that my stream consisted of a stereo signal originally, of which there were merely 3 copies. The AC3 File takes advantage of the correlations to compress, while OGG is not as able to do so.

Further, I read somewhere that OGG takes the remarkable approach, to convert the stereo into joint stereo, after quantization (in the frequency domain), while MP3 does so before quantization. This makes the conversion which OGG performs, of a signal into stereo, a lossless process, and also seems to imply, encoding one sign bit with each coefficient of the difference-channel. Any advantage OGG gives to the bit-rate would need to stem, from the majority of low-amplitude coefficients in the difference-channel, as well as from limiting its frequencies.

By contrast, this would seem to suggest, that MP3 will compute an ‘FFT’ of each channel, also in order to determine the side-bits, after which it will compute a sum and a difference channel in the time-domain, and then compute the ‘DCT’ of each…

An Observation about Pro Logic Versus AC3

One question that people might ask, would be ‘Why is there still any interest in Pro Logic, when in the world today, we have AC3 sound compression?’ Beyond AC3, we also have AAC sound compression, which gets used in MP4 Video Files, or by itself, in M4A Audio Files.

The answer I would give is as follows. As long as our player or player application supports AC3, it will definitely be better able to output 6 channels of sound from such a compressed, digital stream.

But it can happen to us that our speaker amplifier only accepts two analog channels, which would have been called Left and Right. In such a case, If our speaker amp possesses a Pro Logic decoder, the player of our AC3-compressed stream still has the option, of Pro Logic encoding its stereo output.

In that case, our speaker amp will still try to decode that into surround sound, with as many speakers as we have connected to this amp.

But, If we do that, we are subjecting the sound to a loss in quality, because the sound has been collapsed into analog stereo first.

Yet, to substitute some other, Back-Front component for the surround channel, which is being fed to the Pro Logic decoder, does not really hurt the quality of the surround decoding more, than using Pro Logic already would. And so I would see no hesitation in doing so, if the need arises.

Dirk

Some Thoughts on Surround Sound

The way I seem to understand modern 5.1 Surround Sound, there exists a complete stereo signal, which for the sake of legacy compatibility, is still played directly to the front-left and the front-right speaker. But what also happens, is that a third signal is picked up, which acts as the surround channel, in a way that neither favors the left nor the right asymmetrically.

I.e., if people were to try to record this surround channel as being a sideways-facing microphone component, by its nature its positive signal would either favor the left or the right channel, and this would not count as a correct surround-sound mike. In fact, such an arrangement can best be used to synthesize stereo, out of geometries which do not really favor two separate mikes, one for left and one for right.

But, a single, downward-facing, HQ mike would do as a provider of surround information.

If the task becomes, to carry out a stereo mix-down of a surround signal, this third channel is first phase-shifted 90 degrees, and then added differentially between the left and right channels, so that it will interfere least with stereo sound.

In the case where such a mixed-down, analog stereo signal needs to be decoded into multi-speaker surround again, the main component of “Pro Logic” does a balanced summation of the left and right channels, producing the center channel, but at the same time a subtraction is carried out, which is sent rearward.

The advantage which Pro Logic II has over I, is that this summation first adjusts the relative gain of both input channels, so that the front-center channel has zero correlation with the rearward surround information, which has presumably been recovered from the adjusted stereo as well.

Now, an astute reader will recognize, that if the surround-sound thus recovered, was ‘positive facing left’, its addition to the front-left signal will produce the rear-left signal favorably. But then the thought could come up, ‘How does this also derive a rear-right channel?’ The reason for which this question can arise, is the fact that a subtraction has taken place within the Pro Logic decoder, which is either positive when the left channel is more so, or positive when the right channel is more so.

(Edit 02/15/2017 : The less trivial answer to this question is, A convention might exist, by which the left stereo channel was always encoded as delayed 90 degrees, while the right could always be advanced, so that a subsequent 90 degree phase-shift when decoding the surround signal can bring it back to its original polarity, so that it can be mixed with the rear left and right speaker outputs again. The same could be achieved, if the standard stated, that the right stereo channel was always encoded as phase-delayed.

However, the obvious conclusion of that would be, that if the mixed-down signal was simply listened to as legacy stereo, it would seem strangely asymmetrical, which we can observe does not happen.

I believe that when decoding Pro Logic, the recovered Surround component is inverted when it is applied to one of the two Rear speakers. )

But what the reader may already have noticed, is that if he or she simply encodes his mixed-down stereo into an MP3 File, later attempts to use a Pro Logic decoder are for not, and that some better means must exist to encode surround-sound onto DVDs or otherwise, into compressed streams.

Well, because I have exhausted my search for any way to preserve the phase-accuracy, at least within highly-compressed streams, the only way in which this happens, which makes any sense to me, is if in addition to the ‘joint stereo’, which provides two channels, a 3rd channel was multiplexed into the compressed stream, which as before, has its own set of constraints, for compression and expansion. These constraints can again minimize the added bit-rate needed, let us say because the highest frequencies are not thought to contribute much to human directional hearing…

(Edit 02/15/2017 :

Now, if a computer decodes such a signal, and recognizes that its sound card is only in  stereo, the actual player-application may do a stereo mix-down as described above, in hopes that the user has a pro Logic II -capable speaker amp. But otherwise, if the software recognizes that it has 4.1 or 5.1 channels as output, it can do the reconstruction of the additional speaker-channels in software, better than Pro Logic I did it.

I think that the default behavior of the AC3 codec when decoding, if the output is only specified to consist of 2 channels, is to output legacy stereo only.

The approach that some software might take, is simply to put two stages in sequence: First, AC3 decoding with 6 output channels, Secondly, mixing down the resulting stereo in a standard way, such as with a fixed matrix. This might not be as good for movie-sound, but would be best for music.


1.0   0.0
0.0   1.0
0.5   0.5
0.5   0.5
+0.5  -0.5
-0.5  +0.5



If we expected our software to do the steering, then we might also expect, that software do the 90° phase-shift, in the time-domain, rather than in the frequency-domain. And this option is really not feasible in a real-time context.

The AC3 codec itself would need to be capable of 6-channel output. There is really no blind guarantee, that a 6-channel signal is communicated from the codec to the sound system, through an unknown player application... )

(Edit 02/15/2017 : One note which should be made on this subject, is that the type of matrix which I suggested above might work for Pro Logic decoding of the stereo, but that if it does, it will not be heard correctly on headphones.

The separate subject exists, of ‘Headphone Spacialization’, and I think this has become relevant in modern times.

A matrix approach to Headphone Spacialization would assume that the 4 elements of the output vector, are different from the ones above. For example, each of the crossed-over components might be subject to some fixed time-delay, which is based on the Inter-Aural Delay, after it is output from the matrix, instead of awaiting a phase-shift… )

(Edit 03/06/2017 : After much thought, I have come to the conclusion that there must exist two forms of the Surround channel, which are mutually-exclusive.

There can exist a differential form of the channel, which can be phase-shifted 90⁰ and added differentially to the stereo.

And there can exist a common-mode, non-differential form of it, which either correlates more with the Left stereo or with the Right stereo.

For analog Surround – aka Pro Logic – the differential form of the Surround channel would be used, as it would for compressed files.

But when an all-in-one surround-mike is implemented on a camcorder, this originally provides a common-mode Surround-channel. And then it would be up to the audio system of the camcorder, to provide steering, according to which this channel either correlates more with the front-left or the front-right. As a result of that, a differential surround channel can be derived. )

(Updated 11/20/2017 : )

I can offer a sound-compression scheme that I know will not work, as a point of reference.

In This Posting, I suggested a way of using a Discreet Fourier Transform, which I suspect may be in use in sound compression techniques such as MP3, with the exception of the fact that I think MP3 uses sampling intervals of 1152 samples, while in theory I was suggesting 1024.

What I was suggesting, was that if the sampling intervals overlap by 50%, Because they only use the odd-numbered coefficients, each of them would analyze a unit vector, as part of a phase-vector diagram, which would have been 90 degrees out of phase with the previous. And every second sampling interval would also have a base vector, which is 180 degrees out of phase with the earlier one.

If the aim was, to preserve the phase-position of the sampled sound correctly, it might seem that all we need to do, is to preserve the sign of each coefficient, so that when the sampling intervals are reconstructed as overlapping, a wave will result, that has the correct phase angle, between being a ‘cosine’ and a ‘sine’ wave.

But there would be yet another problem with that, specifically in sound compression, if the codec is using customary psychoacoustic models.

By its nature, such a scheme would produce amplitudes which, in addition to requiring a sign bit to store, would be substantially different between even-numbered and odd-numbered sampling intervals, not because of time-based changes in the signal, but because they are orthogonal unit vectors.

An assumption that several sound compression schemes also make, is that If the amplitude of a certain frequency component was at say 100% at t=-1, Then at t=0 that same frequency component has an inherited ‘audibility threshold’ of maybe 80%, ? Thus, if the later frequency coefficient does exceed 80%, it will be deemed inaudible and suppressed. Hence, an entire ‘skirt’ of audibility thresholds tends to get stored, for all the coefficients, which not only surrounds peak amplitudes with slopes, but which additionally decays from one frame to the next.

Hence, even if our frames were intended to complement each other as being orthogonal, practical algorithms will nonetheless treat them as having the same meaning, but consecutive in time. And then, if one or the other is simply suppressed, our phase accuracy is gone again.

This thought was also the main reason for which I had suggested, that the current and the previous sampling interval should have their coefficients blended, to arrive at the coefficient for the current frame. And the blending method which I would have suggested, was not a linear summation, but to find the square root, of the sums, of the squares of the  two coefficients.

As soon as anybody has done that, they have computed the absolute amplitude, and have destroyed all phase-information.

But there is an observation about surround-sound which comes as comforting for this subject. Methods that exist today, to decode stereo into 5.1 surround, only require phase-accuracy to within 90 degrees, as far as I know, to work properly. This would be due to the industrious way in which Pro Logic 1 and 2 were designed.

And so one type of information which could be added back in to the frequency coefficients, would be of whether the cosine and the sine function are each positive or negative, with respect to the origin of each frame. This will result in sound reproduction which is actually 45 degrees out of phase, from how it started, yet possessing 4 possible phase positions, that correspond to the 4 quadrants of the sine and cosine functions.

And this could even be encoded, simply by giving the coefficient a single sign-bit, with respect to each frame.

And this could cause some perception oddity when the weaker coefficients are played back, with an inaccurate phase position with respect to the dominant coefficients. Yet, a system is plausible, that at least states a phase position, for the stronger, dominant frequency components.

What I could also add, is that in a case where Coefficient 1 had an amplitude of 100%, and the audibility threshold of Coefficient 2 followed as being at 80%, their computation does not always require that these values be represented in decibels.

Obviously, in order for any psychoacoustic model to work, the initial research needs to reveal relationships in decibels. But if we can at least assume that Coefficient 2 was always a set number of decibels lower than Coefficient 1, even at odd numbers of decibels, this relationship can be converted into a fraction, which can be applied to amplitude units, instead of to decibel values.

And, if we are given a signed 16-bit amplitude, and would wish to multiply it by a fraction, which has also been prepared as an unsigned 15-bit value expressing values from 0 to 99%, then to perform an integer multiplication between these two will yield a 32-bit integer. Because we have right-shifted the value of one of our two integers, from the sense in which they usually express +32767 … -32768, we do have the option of next ignoring the least-significant word from our product, and using only the most-significant 16-bit word, to result in a fractional multiplication.

The same can be done with 8-bit integers.

Further, if we did not want our hypothetical scheme to introduce a constant 45 degree phase-shift, there would be a way to get rid of that, which would add some complexity at the encoding level.

For each pair of sampling intervals, it could be determined rather easily, which of thee two was the even-numbered, and which was the odd-numbered. Then, we could find whether the absolute of the sine or the absolute of the cosine component was greater, and record that as 1 bit. Finally, we would determine whether the given component was negative or not, and record that as another bit.

(Edit 05/23/2016 : ) Further, there is no need to encode both frames in MPEG-2, where one frame is stored in their place, as derived from both sampling intervals. Hence, the default pattern would be shortened to [ (0,0), (1,0) ] .

(Edit 12/31/2016 : The default pattern would be shortened to [ (0,0), (0,1) ] .

When playing back the frames, the second granule of each could follow from the first, by default, in the following pattern:


+cos -> -sin
-sin -> -cos
-cos -> +sin
+sin -> +cos



We would need access to the sine-counterpart, of the IDCT, to play this back.

End of Edit . )

But then we should also ask ourselves, what has truly been gained. A phase-position that lines up with an assumed cosine or an assumed sine vector, should be remembered as only being lined up, with the origin of each sampling window. But the exact timing of each sampling window is arbitrary, with respect to the input signal. There is really no reason to assume any exact correspondence, since the source of the signal is typically from somewhere else, than the provider of the codec.

And so in my opinion, to have all the reproduced waves consistently 45 degrees out of phase, only puts them that way, with respect to a sampling window whose timing is unknown. According to Pro Logic, Surround-Sound decoding, what should really matter, is whether two waves belonging to the same stream, are in-phase with each other, or out-of-phase, to whatever degree of accuracy can be achieved.

( This last concept, actually contradicts being able to reconstruct a waveform accurately, because a constant phase-shift is inconsistent with a constant time-delay, over a range of frequencies. When a complex waveform is actually time-shifted, so as to keep its shape, then this conversely implies different phase-shifts for its frequency components. )

Dirk