I can offer a sound-compression scheme that I know will not work, as a point of reference.

In This Posting, I suggested a way of using a Discreet Fourier Transform, which I suspect may be in use in sound compression techniques such as MP3, with the exception of the fact that I think MP3 uses sampling intervals of 1152 samples, while in theory I was suggesting 1024.

What I was suggesting, was that if the sampling intervals overlap by 50%, Because they only use the odd-numbered coefficients, each of them would analyze a unit vector, as part of a phase-vector diagram, which would have been 90 degrees out of phase with the previous. And every second sampling interval would also have a base vector, which is 180 degrees out of phase with the earlier one.

If the aim was, to preserve the phase-position of the sampled sound correctly, it might seem that all we need to do, is to preserve the sign of each coefficient, so that when the sampling intervals are reconstructed as overlapping, a wave will result, that has the correct phase angle, between being a ‘cosine’ and a ‘sine’ wave.

But there would be yet another problem with that, specifically in sound compression, if the codec is using customary psychoacoustic models.

By its nature, such a scheme would produce amplitudes which, in addition to requiring a sign bit to store, would be substantially different between even-numbered and odd-numbered sampling intervals, not because of time-based changes in the signal, but because they are orthogonal unit vectors.

An assumption that several sound compression schemes also make, is that If the amplitude of a certain frequency component was at say 100% at t=-1, Then at t=0 that same frequency component has an inherited ‘audibility threshold’ of maybe 80%, ? Thus, if the later frequency coefficient does exceed 80%, it will be deemed inaudible and suppressed. Hence, an entire ‘skirt’ of audibility thresholds tends to get stored, for all the coefficients, which not only surrounds peak amplitudes with slopes, but which additionally decays from one frame to the next.

Hence, even if our frames were intended to complement each other as being orthogonal, practical algorithms will nonetheless treat them as having the same meaning, but consecutive in time. And then, if one or the other is simply suppressed, our phase accuracy is gone again.

This thought was also the main reason for which I had suggested, that the current and the previous sampling interval should have their coefficients blended, to arrive at the coefficient for the current frame. And the blending method which I would have suggested, was not a linear summation, but to find the square root, of the sums, of the squares of the  two coefficients.

As soon as anybody has done that, they have computed the absolute amplitude, and have destroyed all phase-information.

But there is an observation about surround-sound which comes as comforting for this subject. Methods that exist today, to decode stereo into 5.1 surround, only require phase-accuracy to within 90 degrees, as far as I know, to work properly. This would be due to the industrious way in which Pro Logic 1 and 2 were designed.

And so one type of information which could be added back in to the frequency coefficients, would be of whether the cosine and the sine function are each positive or negative, with respect to the origin of each frame. This will result in sound reproduction which is actually 45 degrees out of phase, from how it started, yet possessing 4 possible phase positions, that correspond to the 4 quadrants of the sine and cosine functions.

And this could even be encoded, simply by giving the coefficient a single sign-bit, with respect to each frame.

And this could cause some perception oddity when the weaker coefficients are played back, with an inaccurate phase position with respect to the dominant coefficients. Yet, a system is plausible, that at least states a phase position, for the stronger, dominant frequency components.

What I could also add, is that in a case where Coefficient 1 had an amplitude of 100%, and the audibility threshold of Coefficient 2 followed as being at 80%, their computation does not always require that these values be represented in decibels.

Obviously, in order for any psychoacoustic model to work, the initial research needs to reveal relationships in decibels. But if we can at least assume that Coefficient 2 was always a set number of decibels lower than Coefficient 1, even at odd numbers of decibels, this relationship can be converted into a fraction, which can be applied to amplitude units, instead of to decibel values.

And, if we are given a signed 16-bit amplitude, and would wish to multiply it by a fraction, which has also been prepared as an unsigned 15-bit value expressing values from 0 to 99%, then to perform an integer multiplication between these two will yield a 32-bit integer. Because we have right-shifted the value of one of our two integers, from the sense in which they usually express +32767 … -32768, we do have the option of next ignoring the least-significant word from our product, and using only the most-significant 16-bit word, to result in a fractional multiplication.

The same can be done with 8-bit integers.

Further, if we did not want our hypothetical scheme to introduce a constant 45 degree phase-shift, there would be a way to get rid of that, which would add some complexity at the encoding level.

For each pair of sampling intervals, it could be determined rather easily, which of thee two was the even-numbered, and which was the odd-numbered. Then, we could find whether the absolute of the sine or the absolute of the cosine component was greater, and record that as 1 bit. Finally, we would determine whether the given component was negative or not, and record that as another bit.

(Edit 05/23/2016 : ) Further, there is no need to encode both frames in MPEG-2, where one frame is stored in their place, as derived from both sampling intervals. Hence, the default pattern would be shortened to [ (0,0), (1,0) ] .

(Edit 12/31/2016 : The default pattern would be shortened to [ (0,0), (0,1) ] .

When playing back the frames, the second granule of each could follow from the first, by default, in the following pattern:

``````
+cos -> -sin
-sin -> -cos
-cos -> +sin
+sin -> +cos
```
```

We would need access to the sine-counterpart, of the IDCT, to play this back.

End of Edit . )

But then we should also ask ourselves, what has truly been gained. A phase-position that lines up with an assumed cosine or an assumed sine vector, should be remembered as only being lined up, with the origin of each sampling window. But the exact timing of each sampling window is arbitrary, with respect to the input signal. There is really no reason to assume any exact correspondence, since the source of the signal is typically from somewhere else, than the provider of the codec.

And so in my opinion, to have all the reproduced waves consistently 45 degrees out of phase, only puts them that way, with respect to a sampling window whose timing is unknown. According to Pro Logic, Surround-Sound decoding, what should really matter, is whether two waves belonging to the same stream, are in-phase with each other, or out-of-phase, to whatever degree of accuracy can be achieved.

( This last concept, actually contradicts being able to reconstruct a waveform accurately, because a constant phase-shift is inconsistent with a constant time-delay, over a range of frequencies. When a complex waveform is actually time-shifted, so as to keep its shape, then this conversely implies different phase-shifts for its frequency components. )

Dirk

6 thoughts on “I can offer a sound-compression scheme that I know will not work, as a point of reference.”

1. There are many video formats like mpeg, mp4, avi,
xvid, dvd and the newest Blu-ray. The issue here is not
whether you can find them, since there’s probably
so. Fast surpassing the popularity of DVDs, this new format has taken the
world by storm.

1. Dirk Mittler says:

These are all ‘media formats’ of sorts. But one detail which we should not overlook, is that the distinction into .MP4 , .MTS , .M2TS , .MPEG , .AVI … All describe a Container File, and not an actual compression scheme. The compression scheme is defined in part, by which Codec is being used to COde or DECode the Video Stream. The .AVI File format was a bit special, because it supported almost all Codecs, while most container file-formats support only a small set of Codecs, if more than one at all. The .MP4 Container File Format is frequently associated with a type of compression also known as H.264 . At the same time, Blu-ray disks as such are encoded with H.264 compression, but with much more strict constraints on the Video Stream otherwise.

Obviously, the Computing world is replete with numerous Codecs for Sound as well as for Video. I think that my main subject, in the part of the blog which you commented on, was a part of my own attempt to understand why any of them even work. If it was assumed that the sampling windows need to overlap, and that each sampling window needed to have as many frequency-coefficients as its has time-domain samples, then the first stage of our compression scheme would already double the number of data-points, and that is the opposite of compression. So right off the bat, the industry developed a Modified Discrete Cosine Transform, which allows an Audio Stream to be converted from time-domain into frequency-domain, and which preserves the number of data-points.

In the case of Video, a Discrete Cosine Transform is usually also used, only that being a 2D transform instead of a one-dimensional transform. In the 2D case, there is no assumed overlap. And then, a method of interpolation is applied, by which some of the frames are basically encoded ~like JPEGs~ , those becoming reference frames – aka key-frames – and the rest of the frames are either intra-predictive or forward-predictive, or bi-predictive, with respect to the key-frames. But a common mistake which some people make, is to expect that the interpolation would be a pixel-wise differentiation or subtraction.

The method of interpolation is usually based on some sort of macro-block structure, which is really a motion-following methodology.

Long story short, many of the frequency-domain-based methods of stream-compression, lossy, are based on almost the same principles, over and over again. If somebody is truly interested in Computing, then somewhere along the line it becomes important to understand the underlying system.

Dirk

2. Hey there! Someone in my Facebook group shared this site with us so I came to take a look.
I’m definitely enjoying the information. I’m bookmarking and will be tweeting
this to my followers! Exceptional blog and excellent design and style.

3. I’ve been browsing online more than 2 hours today, yet I never found any interesting
article like yours. It’s pretty worth enough for me.

In my view, if all site owners and bloggers made good content as you did, the net will be a lot more useful than ever before.