A Possible Path to High-Resolution, Compressed Sound

There is a fundamental tradeoff which takes place, when we use some sort of Fourier Transform, to help compress a sound stream, in a way that is already known to be lossy. Higher spectral resolution requires longer sampling intervals, which also imply poorer temporal resolution. Higher temporal resolution requires shorter sampling intervals, which also imply poorer spectral resolution.

I believe that one way in which the ear can outperform this limitation, is in having its cilia work in a massively parallel way. I think that our ears also have poor temporal resolution at the lower frequencies, but that our Human temporal resolution improves at higher frequencies.

And so one way in which I think that sound could be compressed, would be not to stick to one length of sampling interval.

For example, it might be possible to have a longest sampling window, 2048 samples long. Even-numbered coefficients could be computed for it using a Modified Discreet Cosine Transform, which range from 0 to 23 cycles / window. After that, the same interval of time could be subdivided into shorter sampling windows, each only 1024 samples long, and coefficients could be computed from them, which go from 12 to 23 cycles / window, thus completing 3 granules.

4 more ‘octaves’ should be possible, with sampling window lengths of 512, 256, 128 and 64 samples. Most of them would derive coefficients from 12 to 23 cycles / window again, with the exception of the 64-sample windows, which would derive from 12 to 31 cycles / window.

I would maintain the assumption, that from each length of sampling, a granule would result which is half as long, and for which coefficients would be stored.

This should result in 6 ‘octaves’ in total, each of which would have its own scale factor, stored once per frame interval (1024 samples), corresponding to the slowest granule. To simplify computing this scale factor, a global quality level could simply decide how many integers all the coefficients should be quantized to. For each ‘octave’, the peak amplitude within all the granules would be taken, and divided by this quality level, to arrive at the scale factor.

Each frame would store 63 granules, the 32 of which with the highest frequencies, would have 20 coefficients each, 30 of which would have 12 coefficients, and the longest granule of which would have 24 coefficients. This would result in 1024 coefficients / frame, in a fixed order.

To reduce waste, the scale factor of the highest, 6th octave, could simply be the same as that of the previous, 5th octave, as long as using that one yields lower quantized integers than the global quality level.

The resulting, quantized amplitudes could again, be encoded in a variable-length scheme, such as Exponential-Golomb, optionally plus a sign-bit.

One adverse side-effect of this would be, the complex and tedious computation of the scale factors. I do not assume that I would be using any Fast Fourier Transform, to determine audibility thresholds, and to set many of the DCT coefficients to zero, the way it is done with MP3. Then, it would make most sense to determine the scale factors from DCT values very closely analogous to how they are encoded.

The problems start, with the fact that each sampling interval is assumed to have a windowing function, when encoding. This turns into a major CPU load, once a scale factor needs to be computed 32×20 times per frame.

So one simplification I could offer, would be to begin by computing and temporarily storing all the DCT coefficients as 15-bit values, with the mere notion that they will later be quantized, but that a maximum value for them is kept up-to-date, once per ‘octave’ as defined above. After that, the scale factor can be computed from this maximum


(Edit 06/07/2016 : ) This hypothetical scheme has a major drawback as it stands. Even though it will inherently detect and bracket transients, it would also have poor recovery from transients. After and before a transient, the above method will remain insensitive to sounds in the same octave, for up to 1024 samples in the case of a 44.1 kHz format, which translates into 25 milliseconds. In my opinion, the human ear can detect this as a ‘sound shadow’.

MP3 recovers from transients within 576 samples.

One way to correct this could be, to arrange for not one but two scale factors to be encoded for each octave, except for the lowest octave. The first scale factor would apply to the first half of the frame, while the second scale factor would apply to the second half.

In principle this idea could be extended, all the way until there is a separate scale factor for each granule, with the ostensible exception of the shortest, highest-frequency granules / octave… But then doing so would also imply the intent, of allocating a uniform number of bits / a uniform amount of information, to each granule, knowing that their number doubles temporally with each octave. This would not be, what I would want compressed sound to do.


Exploring Joint Stereo Encoding, with Non-Negative Integers

A concept can exist, by which a stereo signal consists of a left channel L and a right channel R, and by which it gets translated in the time-domain, into sample streams M and S, such that M = L+R and S = L-R. In this case, L and R can be reconstructed as

L = (M+S) / 2
R = (M-S) / 2

This seems trivial. but a more specific context for this set of equations could be, the variables could be frequency coefficients, and

L >= 0
R >= 0
M >= 0
L, R, M, S are all Integers.

Because the equations for L and R are truly the inverse, of the definition of M and S, it would follow that in order for them to be true, (M+S) and (M-S) must also be even integers.

If we were encoding the integers M and S in a variable-length scheme, then the bit-length of S has already been compromised by 1 bit, because somewhere we need to state its sign. Yet, we might want to be certain, that the encoding of (M,S) is not longer than that of (L,R).

And so an implication of this which we might want to take advantage of, is knowing that

If M is Even, S Must Also Be Even.
If M is Odd,  S Must Also Be Odd.

And so one idea that might be helpful, would be to define a derived value S’ , such that

S' = S / 2, Rounded Down,

meaning, rounded to the More Negative, If S was Odd.

We could then store (M,S’). The length of S’ is the length of S reduced by at least one bit. Then, when the time comes to decode the stream, we could compute

IF M Is Even, S = (S' * 2)


IF M Is Odd,  S = (S' * 2) + 1

Thereby not wasting any bits. And, depending on what type of variable-length encoding was being used, shortening the length of the integer S’ by 1 bit, may in fact shorten its encoding by more than 1 bit.



Some Thoughts on Surround Sound

The way I seem to understand modern 5.1 Surround Sound, there exists a complete stereo signal, which for the sake of legacy compatibility, is still played directly to the front-left and the front-right speaker. But what also happens, is that a third signal is picked up, which acts as the surround channel, in a way that neither favors the left nor the right asymmetrically.

I.e., if people were to try to record this surround channel as being a sideways-facing microphone component, by its nature its positive signal would either favor the left or the right channel, and this would not count as a correct surround-sound mike. In fact, such an arrangement can best be used to synthesize stereo, out of geometries which do not really favor two separate mikes, one for left and one for right.

But, a single, downward-facing, HQ mike would do as a provider of surround information.

If the task becomes, to carry out a stereo mix-down of a surround signal, this third channel is first phase-shifted 90 degrees, and then added differentially between the left and right channels, so that it will interfere least with stereo sound.

In the case where such a mixed-down, analog stereo signal needs to be decoded into multi-speaker surround again, the main component of “Pro Logic” does a balanced summation of the left and right channels, producing the center channel, but at the same time a subtraction is carried out, which is sent rearward.

The advantage which Pro Logic II has over I, is that this summation first adjusts the relative gain of both input channels, so that the front-center channel has zero correlation with the rearward surround information, which has presumably been recovered from the adjusted stereo as well.

Now, an astute reader will recognize, that if the surround-sound thus recovered, was ‘positive facing left’, its addition to the front-left signal will produce the rear-left signal favorably. But then the thought could come up, ‘How does this also derive a rear-right channel?’ The reason for which this question can arise, is the fact that a subtraction has taken place within the Pro Logic decoder, which is either positive when the left channel is more so, or positive when the right channel is more so.

(Edit 02/15/2017 : The less trivial answer to this question is, A convention might exist, by which the left stereo channel was always encoded as delayed 90 degrees, while the right could always be advanced, so that a subsequent 90 degree phase-shift when decoding the surround signal can bring it back to its original polarity, so that it can be mixed with the rear left and right speaker outputs again. The same could be achieved, if the standard stated, that the right stereo channel was always encoded as phase-delayed.

However, the obvious conclusion of that would be, that if the mixed-down signal was simply listened to as legacy stereo, it would seem strangely asymmetrical, which we can observe does not happen.

I believe that when decoding Pro Logic, the recovered Surround component is inverted when it is applied to one of the two Rear speakers. )

But what the reader may already have noticed, is that if he or she simply encodes his mixed-down stereo into an MP3 File, later attempts to use a Pro Logic decoder are for not, and that some better means must exist to encode surround-sound onto DVDs or otherwise, into compressed streams.

Well, because I have exhausted my search for any way to preserve the phase-accuracy, at least within highly-compressed streams, the only way in which this happens, which makes any sense to me, is if in addition to the ‘joint stereo’, which provides two channels, a 3rd channel was multiplexed into the compressed stream, which as before, has its own set of constraints, for compression and expansion. These constraints can again minimize the added bit-rate needed, let us say because the highest frequencies are not thought to contribute much to human directional hearing…

(Edit 02/15/2017 :

Now, if a computer decodes such a signal, and recognizes that its sound card is only in  stereo, the actual player-application may do a stereo mix-down as described above, in hopes that the user has a pro Logic II -capable speaker amp. But otherwise, if the software recognizes that it has 4.1 or 5.1 channels as output, it can do the reconstruction of the additional speaker-channels in software, better than Pro Logic I did it.

I think that the default behavior of the AC3 codec when decoding, if the output is only specified to consist of 2 channels, is to output legacy stereo only.

The approach that some software might take, is simply to put two stages in sequence: First, AC3 decoding with 6 output channels, Secondly, mixing down the resulting stereo in a standard way, such as with a fixed matrix. This might not be as good for movie-sound, but would be best for music.


 1.0   0.0
 0.0   1.0
 0.5   0.5
 0.5   0.5
+0.5  -0.5
-0.5  +0.5


If we expected our software to do the steering, then we might also expect, that software do the 90° phase-shift, in the time-domain, rather than in the frequency-domain. And this option is really not feasible in a real-time context.

The AC3 codec itself would need to be capable of 6-channel output. There is really no blind guarantee, that a 6-channel signal is communicated from the codec to the sound system, through an unknown player application... )

(Edit 02/15/2017 : One note which should be made on this subject, is that the type of matrix which I suggested above might work for Pro Logic decoding of the stereo, but that if it does, it will not be heard correctly on headphones.

The separate subject exists, of ‘Headphone Spacialization’, and I think this has become relevant in modern times.

A matrix approach to Headphone Spacialization would assume that the 4 elements of the output vector, are different from the ones above. For example, each of the crossed-over components might be subject to some fixed time-delay, which is based on the Inter-Aural Delay, after it is output from the matrix, instead of awaiting a phase-shift… )

(Edit 03/06/2017 : After much thought, I have come to the conclusion that there must exist two forms of the Surround channel, which are mutually-exclusive.

There can exist a differential form of the channel, which can be phase-shifted 90⁰ and added differentially to the stereo.

And there can exist a common-mode, non-differential form of it, which either correlates more with the Left stereo or with the Right stereo.

For analog Surround – aka Pro Logic – the differential form of the Surround channel would be used, as it would for compressed files.

But when an all-in-one surround-mike is implemented on a camcorder, this originally provides a common-mode Surround-channel. And then it would be up to the audio system of the camcorder, to provide steering, according to which this channel either correlates more with the front-left or the front-right. As a result of that, a differential surround channel can be derived. )

(Updated 11/20/2017 : )

Continue reading Some Thoughts on Surround Sound

Sound Compression is also Possible, in the Time Domain.

One alternative type of sound compression which exists, is ‘FLAC’. This is a method which preserves all the information in the stream (later reproducing it exactly), which uses Linear Predictive Coding, and which Rice-Encodes the Residual, by which actual samples differ from the LPE-predicted values.

When I first tried to understand FLAC, my efforts were lacking, in that I did not know that LPE takes place in the time-domain, and not in the frequency-domain.