How certain signal-operations are not convolutions.

One concept that exists in signal processing, is that there could be a definition of a filter, which is based in the time-domain, and that this definition can resemble a convolution. And yet, a derived filter could no longer be expressible perfectly as a convolution.

For example, the filter in question might add reverb to a signal recursively. In the frequency-domain, the closer two frequencies are, which need to be distinguished, the longer the interval is in the time-domain, which needs to be considered before an output sample is computed.

Well, reverb that is recursive would need to be expressed as a convolution with an infinite number of samples. In the frequency-domain, this would result in sharp spikes instead of smooth curves.

I.e., If the time-constant of the reverb was 1/4 millisecond, a 4kHz sine-wave would complete within this interval, while a 2kHz sine-wave would be inverted in phase 180⁰. What this can mean is that a representation in the frequency-domain may simply have maxima and minima, that alternate every 2kHz. The task might never be undertaken to make the effect recursive.

(Last Edited on 02/23/2017 … )

Continue reading How certain signal-operations are not convolutions.

Emphasizing a Presumed Difference between OGG and MP3 Sound Compression

In this posting from some time ago, I wrote down certain details I had learned about MP3 sound compression. I suppose that while I did write, that the Discreet Cosine Transform coefficients get scaled, I may have missed to mention in that same posting, that they also get quantized. But I did imply it, and I also made up for the omission in this posting.

But one subject which I did mention over several postings, was my own disagreement with the practice, of culling frequency-coefficients which are deemed inaudible, thus setting those to zero, just to reduce the bit-rate in one step, hoping to get better results, ‘because a lower initial bit-rate also means that the user can select a higher final bit-rate…’

In fact, I think that some technical observers have confused two separate processes that take place in MP3:

  1. An audibility threshold is determined, so that coefficients which are lower than that are set to zero.
  2. The non-zero coefficients are quantized, in such a way that the highest of them fits inside a fixed maximum, quantized value. Since a scale-factor is computed for one frequency sub-band, this also implies that close to strong frequency coefficients, weaker ones are just quantized more.

In principle, concept (1) above disagrees with me, while concept (2) seems perfectly fine.

And so based on that I also need to emphasize, that with MP3, first a Fast-Fourier Transform is computed, the exact implementation of which is not critical for the correct playback of the stream, but the only purpose of which is to determine audibility thresholds for the DCT transform coefficients, the frequency-sub-bands of which must fit the standard exactly, since the DCT is actually used to compress the sound, and then to play it back.

This FFT can serve a second purpose in Stereo. Since this transform is assumed to produce complex numbers – unlike the DCT – it is possible to determine whether the Left-Minus-Right channel correlates positively or negatively with the Left-Plus-Right channel, regarding their phase. The way to do this effectively, is to compute the dot-product between two complex numbers, and to see whether this dot-product is positive or negative. The imaginary component of one of the sources needs to be inverted for that to work.

But then negative or positive correlation can be recorded once for each sub-band of the DCT as one bit. This will tell, whether a positive difference-signal, is positive when the left channel is more so, or positive if the right channel is more so.

You see, in addition to the need to store this information, potentially with each coefficient, there is the need to measure this information somehow first.

But an alternative approach is possible, in which no initial FFT is computed, but in which only the DCT is computed, once for each Stereo channel. This might even have been done, to reduce the required coding effort. And in that case, the DCT would need to be computed for each channel separately, before a later encoding stage decides to store the sum and the difference for each coefficient. In that case, it is not possible first to determine, whether the time-domain streams correlate positively or negatively.

This would also imply, that close to strong frequency-components, the weaker ones are only quantized more, not culled.

So, partially because of what I read, and partially because of my own idea of how I might do things, I am hoping that OGG sound compression takes this latter approach.

Dirk

Continue reading Emphasizing a Presumed Difference between OGG and MP3 Sound Compression

The Advantage of Linear Filters – ie Convolutions

A question might come to mind to readers who are not familiar with this subject, as to why the subset of ‘Morphologies’ that is known as ‘Convolutions’ – i.e. ‘Linear Filters’ – is advantageous in filtering signals.

This is because even though such a static system of coefficients, applied constantly to input samples, will often produce spectral changes in the signal, they will not produce frequency components that were not present before. If new frequency components are produced, this is referred to as ‘distortion’, while otherwise all we get is spectral errors – i.e. ‘coloration of the sound’. The latter type of error is gentler on the ear.

For this reason, the mere realization that certain polynomial approximations can be converted into systems, that entirely produce linear products of the input samples, makes those more interesting.

OTOH, If each sampling of a continuous polynomial curve was at a random, irregular point in time – thus truly revealing it to be a polynomial – then additional errors get introduced, which might resemble ‘noise’, because those may not have deterministic frequencies with respect to the input.

And, the fact that the output samples are being generated at a frequency which is a multiple of the original sample-rate, also means that new frequency components will be generated, that go up to the same multiple.

In the case of digital signal processing, the most common type of distortion is ‘Aliasing’, while with analog methods it used to be ‘Total Harmonic Distortion’, followed by ‘Intermodulation Distortion’.

If we up-sample a digital stream and apply a filter, which consistently underestimates the sub-sampled, then the resulting distortion will consist of unwanted modulations of the higher Nyquist Frequency.

Continue reading The Advantage of Linear Filters – ie Convolutions

A Possible Path to High-Resolution, Compressed Sound

There is a fundamental tradeoff which takes place, when we use some sort of Fourier Transform, to help compress a sound stream, in a way that is already known to be lossy. Higher spectral resolution requires longer sampling intervals, which also imply poorer temporal resolution. Higher temporal resolution requires shorter sampling intervals, which also imply poorer spectral resolution.

I believe that one way in which the ear can outperform this limitation, is in having its cilia work in a massively parallel way. I think that our ears also have poor temporal resolution at the lower frequencies, but that our Human temporal resolution improves at higher frequencies.

And so one way in which I think that sound could be compressed, would be not to stick to one length of sampling interval.

For example, it might be possible to have a longest sampling window, 2048 samples long. Even-numbered coefficients could be computed for it using a Modified Discreet Cosine Transform, which range from 0 to 23 cycles / window. After that, the same interval of time could be subdivided into shorter sampling windows, each only 1024 samples long, and coefficients could be computed from them, which go from 12 to 23 cycles / window, thus completing 3 granules.

4 more ‘octaves’ should be possible, with sampling window lengths of 512, 256, 128 and 64 samples. Most of them would derive coefficients from 12 to 23 cycles / window again, with the exception of the 64-sample windows, which would derive from 12 to 31 cycles / window.

I would maintain the assumption, that from each length of sampling, a granule would result which is half as long, and for which coefficients would be stored.

This should result in 6 ‘octaves’ in total, each of which would have its own scale factor, stored once per frame interval (1024 samples), corresponding to the slowest granule. To simplify computing this scale factor, a global quality level could simply decide how many integers all the coefficients should be quantized to. For each ‘octave’, the peak amplitude within all the granules would be taken, and divided by this quality level, to arrive at the scale factor.

Each frame would store 63 granules, the 32 of which with the highest frequencies, would have 20 coefficients each, 30 of which would have 12 coefficients, and the longest granule of which would have 24 coefficients. This would result in 1024 coefficients / frame, in a fixed order.

To reduce waste, the scale factor of the highest, 6th octave, could simply be the same as that of the previous, 5th octave, as long as using that one yields lower quantized integers than the global quality level.

The resulting, quantized amplitudes could again, be encoded in a variable-length scheme, such as Exponential-Golomb, optionally plus a sign-bit.

One adverse side-effect of this would be, the complex and tedious computation of the scale factors. I do not assume that I would be using any Fast Fourier Transform, to determine audibility thresholds, and to set many of the DCT coefficients to zero, the way it is done with MP3. Then, it would make most sense to determine the scale factors from DCT values very closely analogous to how they are encoded.

The problems start, with the fact that each sampling interval is assumed to have a windowing function, when encoding. This turns into a major CPU load, once a scale factor needs to be computed 32×20 times per frame.

So one simplification I could offer, would be to begin by computing and temporarily storing all the DCT coefficients as 15-bit values, with the mere notion that they will later be quantized, but that a maximum value for them is kept up-to-date, once per ‘octave’ as defined above. After that, the scale factor can be computed from this maximum

Dirk

(Edit 06/07/2016 : ) This hypothetical scheme has a major drawback as it stands. Even though it will inherently detect and bracket transients, it would also have poor recovery from transients. After and before a transient, the above method will remain insensitive to sounds in the same octave, for up to 1024 samples in the case of a 44.1 kHz format, which translates into 25 milliseconds. In my opinion, the human ear can detect this as a ‘sound shadow’.

MP3 recovers from transients within 576 samples.

One way to correct this could be, to arrange for not one but two scale factors to be encoded for each octave, except for the lowest octave. The first scale factor would apply to the first half of the frame, while the second scale factor would apply to the second half.

In principle this idea could be extended, all the way until there is a separate scale factor for each granule, with the ostensible exception of the shortest, highest-frequency granules / octave… But then doing so would also imply the intent, of allocating a uniform number of bits / a uniform amount of information, to each granule, knowing that their number doubles temporally with each octave. This would not be, what I would want compressed sound to do.