A Concept about Directionality In Sound Perception

We all understand that given two ears, we can hear panning when we listen to reproduced stereo, as well as maybe that sounds seem to come ‘from outside’ as opposed to ‘from inside’, corresponding to out-of-phase as opposed to in-phase. But the reality of human sound perception is, that we are supposed to be capable of more subtle perception, about the location of the origin of sounds. I will call this more subtle perception of directions, ‘complete stereo-directionality’.

One idea which some people have pursued, is that we do not just hear amplitudes associated with frequencies, but that we might be able to perceive phase-vectors associated with frequencies as well. This idea seems to agree with the fact that at least a part of our complete stereo-directionality seems to be based on Inter-Aural-Time-Differences, as a basis for perceiving direction. This idea also seems to agree well with the fact that in Science, and with Machines, the amplitude of any frequency component, can be represented by a complex number.

But this idea does not seem to agree well, with the fact that our ultimate organ to perceive sound is not the outer ear, nor the middle ear, but the inner ear, which is also known as the cochlea. As I understand it, the cochlea is capable of differentiating along frequency-mappings incredibly precisely, but not along phase-relationships.

Now, some reason may exist to think, that the middle ear and the skull carry out some sort of mixing of sounds, that enter the outer ear, before those sounds reach the cochlea. But for the moment, I am going to regard this detail as secondary.

I think that what ultimately happens, is that on the cerebral cortex, just as it goes with the optical lobes, the aural lobes have a mapping of fingerprint-like ‘ridges’. The long-range mapping may be according to frequency, but the short-range mapping may be such, that one set of ridges corresponds to input from one ear, while the negative of that same pattern of ridges, represents the input of the opposite ear.

And so what the cerebral cortex can do, is make very precise differentiations in its short-range neural systems, between what any one frequency-component has as amplitude, as perceived by one cochlea differently from the other cochlea.

When sound events reach our ears, they can follow many paths, as well as perhaps being mixed as well by our middle ear, so that real phase positions lead to subtle amplitude-differences, as sensed by our cochlea, and as interpreted by our cerebral cortex with its ridged mappings. Inter-Aural Time-Differences may also lead to subtle differences in per-frequency amplitudes, by the time they reach the cochlea.

And I suspect that the latter is what leads to our ‘complete stereo-directionality’.

What this would also mean, is that in lossy sound compression, if the programmers decided to compute a Fourier Transform of each stereo channel first – and the Discreet Cosine Transform is one type of Fourier Transform – and then to store the differences between absolute amplitudes that result, they may quite accidentally have processed the sound closer to how human hearing processes sound.

If instead, the programmers chose to compute the L-R component in the time-domain first, and then to perform some Fourier Transform of L+R and L-R secondly, they may have been intending to capture more information than can be captured in the other way. But they may have captured information with this method, that human hearing is not able to interpret well.

This would be especially true then, in cases where L and R mainly cancel, so that the amplitude of L+R is low, while the Fourier Amplitude of L-R would be high.

This might sound fascinating due to whatever our middle ear next does with it, but does not lead to meaningful interpretations, of ‘where that sound even supposedly comes from’. Hence, while this could be psychedelic, it would not enhance our ‘complete stereo-directionality’.

Also, the idea may be applied by our brain, that whatever sound we are focusing on, ‘all the other sounds’ form a continuous background noise, such that the sound we are focusing on may seem to have negative amplitudes, because real amplitudes locally become lower than the virtual noise levels. And while this may allow us to derive some sort of perception of phase-cancellation, it may not actually be due, to our cochlea having picked up phase-cancellation.



Those Mysterious FFTs

One question I do not know the answer to, is why many Fourier Transforms are being named ‘FFTs’, which I feel should be named ‘DFTs’.

According to what I read, an FFT is supposed to compute a number of coefficients per octave, while a DFT is supposed to compute a number of them per unit frequency. This would improve the computation of FFTs, by folding them and computing fewer, high-frequency coefficients.

I am seeing Transforms named FFTs, that are still computing the full number of coefficients, per unit of frequency.



Why The Discreet Cosine Transform Is Invertible

There are people who would answer this question entirely using Algebra, but unfortunately, my Algebra is not up to standard, specifically when applied to Fourier Transforms. Yet, I can often visualize such problems and reason them out, which can provide a kind of common-sense answer, even to this type of a question.

If a DCT is fed a time-domain sine-wave, the frequency of which exactly corresponds to an odd-numbered frequency coefficient, but which is 90 degrees out of phase with that coefficient, the fact stands, that the coefficient in question remains zero for the current sampling interval.

But in that case, the even-numbered coefficients, and not only the two directly adjacent to this center frequency, will alternate between positive and negative values. When the coefficients are then laid out, a kind of decaying wave-pattern becomes humanly discernible, which happens to have its zero-crossings, directly at the odd coefficients.

Also, in this case, if we were just to add all the coefficients, we should obtain zero, which would also be what the time-domain sample at n=0 should be equal to, consistently with a sine wave and not a cosine wave.

And this is why, if a DCT is applied to the coefficients, and if the phase information of this chosen IDCT is correct, the original sine wave can be reconstructed.

Note: If the aim is to compress and then reproduce sound, we normalize the DCT, but do not normalize the IDCT. Hence, with the Inverse, if a coefficient stated a certain magnitude, then that one coefficient by itself is also expected to produce a ‘sine-wave’, with the corresponding amplitude. ( :1 )

I think that it is a kind of slip which people can make, to regard a Fourier Transform ‘as if it was a spectrum analyzer’, the ideal behavior of which, in response to an analog sine-wave of one frequency, was just to display one line, which represents a single non-zero data-point, in this case corresponding to a frequency coefficient. In particular because Fourier Transforms are often computed for finite sampling intervals, the latter can behave differently. And the DCT seems to display this the most strongly.

While it would be tempting to say, that a DFT might be better behaved, the fact is that when computers crunch complex numbers, they represent those as pairs of real numbers. So while there is a ‘real’ component that results from the cosine-multiplication, and an ‘imaginary’ component that results from the sine-multiplication, each of these components could leave a human viewer equally confused as a DCT might, because again, each of these is just an orthogonal component vector.

So even in the case of the DFT, each number is initially not yet an amplitude. We still need to square each of these, and to add them. Only then, depending on whether we take the square root or not, we are left with an amplitude, or a signal energy, finally.

When using a DFT, it can be easy to forget, that if we feed it a time-domain single-pulse, what it will yield in the frequency-domain, is actually a series of complex numbers, the absolutes of which do not change, but which do a rotation in the complex plane, when plotted out along the frequency-domain. And then, if all we could see was either their real or their imaginary component, we would see that the DFT also produces a fringing effect initially.

The fact that these numerical tools are not truly spectrographs, can make them unsuitable for direct use in Psychoacoustics, especially if they have not been adapted in some special way for that use.


1: ) This latter observation also has a meaning, for when we want to entropy-encode a (compressed) sound file, and when the time-domain signal was white noise. If we can assume that each frame states 512 coefficients, and that the maximum amplitude of the simulated white noise is supposed to be +/- 32768, Then the amplitude of our ‘small numbers’, would really only need to reach 64, so that when they interfere constructively and destructively over an output interval, they will produce this effect.

Now, one known fact about musical sounds which are based on white noise is, that they are likely to be ‘colored’, meaning that the distribution of signal energy is usually not uniform over the entire audible spectrum. Hence, If we wanted just 1/8 of the audible spectrum to be able to produce a full signal strength, Then we would need for the entropy-encoded samples to reach 512. And, we might not expect the ‘small numbers’ to be able to reproduce white noise at full amplitude, since the length of the big numbers is ‘only’ 15 bits+ anyway. One entropy-encoded value might already have a length of ~3 bits. So it could also be acceptable, if as many as 1/6 of the coefficients were encoded as ‘big numbers’, so that again, the maximum amplitude of the ‘small numbers’ would not need to carry the sound all by itself…

And yet, some entropy-encoding tables with high amplitudes might be defined, just in case the user asks for the lowest-possible bit-rates.