Sound Fonts: How something that I blogged, was still not 100% accurate.

Sometimes it can happen that, in order to explain a subject 100% accurately, would seem to require writing an almost endless amount of text, and that, with short blog postings, I’m limited to always posting a mere approximation of the subject. The following posting would be a good example:

(Link to an earlier posting.)

Its title clearly states, that there are exactly two types of interpolation-for-use-in-resampling (audio). After some thought, I realized that a third type of interpolation might exist, and that it might especially be useful for Sound Fonts.

According to my posting, the situation can exist, in which the relationship between the spacing of (interpolated) output samples and that of (Sound Font) input samples, is an irrational relationship, and then plausibly, the approach would be, to derive a polynomial’s actual coefficients from the input sample-values (which would be the result of one matrix multiplication), and compute the value of the resulting polynomial at (x = t), (0.0 <= t < 1.0).

But there is an alternative, which is, that the input samples could be up-sampled, arriving at a set of sub-samples with fixed positions, and then, every output sample could arise as a mere linear interpolation, between two sub-samples.

It would seem then, that a third way to interpolate is feasible, even when the spacing of output samples is irrational with respect to that of input samples.

Also, with Sound Fonts, the possibility presents itself, that the Sound Font itself could have been recorded professionally, at a sample rate considerably higher than 44.1kHz, such as maybe at 96kHz, just so that, if the Sound Font Player did rely on linear interpolation, doing so would not mess up the sound as much, as if the Sound Font itself had been recorded at 44.1kHz.

Further, specifically with Sound Font Players, the added problem presents itself, that the virtual instrument could get bent upward in pitch, even though its recording already had frequencies approaching the Nyquist Frequency, so that those frequencies could end up being pushed ‘higher than the output Nyquist Frequency’, thereby resulting in aliasing – i.e., getting reflected back down to lower frequencies – even though each output sample could have been interpolated super-finely by itself.

These are the main reasons why, as I did experience, to play a sampled sound at a very high, bent pitch, actually just results in ‘screeching’.

Yet, the Sound Font Player could again be coded cleverly, so that it would also sub-sample its output sample rate. I.e., if expected to play the virtual instrument at a sample rate of 44.1kHz, it could actually compute interpolated samples closer together than that, corresponding to 88.2kHz, and then the Sound Font Player could compute each ‘real output sample’ as the average between two ‘virtual, sub-sampled output samples’. This would effectively insert a low-pass filter, which would flatten the screeching that would result from frequencies higher than 22kHz, being reflected below 22kHz, and eventually, all the way back down to 0kHz. And admittedly, the type of (very simple) low-pass filter such an arrangement would imply, would be The Haar Wavelet again. :oops:

If you asked me what the best was, which a Soundblaster sound card from 1998 would have been able to do, I’d say, ‘Just compute each audio sample as a linear interpolation between two, Sound Font samples.’ Doing so would have required an added lookup into an address in (shared) RAM, a subtraction, a multiplication, and an addition. In fact, basing this speculation on my estimation of how much circuit-complexity such an early Soundblaster card just couldn’t have had, I’d say that those cards would need to have applied integer arithmetic, with a limited number of fractional bits – maybe 8 – to state which Sound Font sample-position, a given audio sample was being ‘read from’, ?  It would have been up to the driver, to approximate the integer fed to the hardware. And then, if that sound card was poorly designed, its logic might have stated, ‘Just truncate the Sound-Font sample-position being read from, to the nearest sample.’

In contrast, when Audio software is being programmed today, one of the first things the developer will insist on, is to apply floating-point numbers wherever possible…

Also, if a hypothetical, superior Sound Font Player did have as logic, ‘If the sample rate of the loaded Sound Font (< 80kHz), up-sample it 2x; if that sample rate is actually (< 40kHz), up-sample it 4x…’, just to simplify the logic to the point of making it plausible, this up-sampling would only take place once, when the Sound Font is actually being loaded into RAM. By contrast, the oversampling of the output of the virtual instrument, as well as the low-pass filter, would need to be applied in real-time… ‘If the output sample rate is (>= 80kHz), replace adjacent Haar Wavelets with overlapping Haar Wavelets.’

Food for thought.

Sincerely,
Dirk

Two inherently different types of interpolations, that exist.

One observation which I made about certain people has been, that they are able to conceive that given a certain audio sample-rate, the signal-processing operation on it, to perform some sort of interpolation, may be needed, especially when up-sampling, or, otherwise resampling the stream. But what I seemed to notice was, that those people failed to distinguish between two different categories of interpolation, which I would split as follows:

  1. There exist interpolations, in which the samples to be interpolated, have fixed positions in time, between the input samples.
  2. There exist interpolations, where for every interpolated sample, the time between the two adjacent, input samples, is not known, up to the very instant when the interpolation is finally computed, and where this time-position needs to be defined by an additional parameter, which may be called (t), and which would typically span the interval [0.0 .. 1.0).

For the type (1) above, if polynomials are going to be used, then all the values of (t) are known in advance, and therefore, all the values of (x) that define the polynomial, are also known in advance. This also means, that all the powers of (x) are known in advance. In that case:

To compute the interpolation, can be applied.

However, for the type (2) of interpolation above, IF polynomials are going to be used, then to derive the actual polynomial will become necessary, as well as, the need to ‘Plug parameter (t) in to the resulting polynomial.’ Because the polynomial could be of the 6th degree, this can become an expensive computation to perform in real-time, and implementors are likely to look for alternatives to using polynomials, that are also cheaper to compute.

Also, if the polynomial is to be plotted, then the positions along the X-axis are assumed to form a continuous interval, for which reason, the actual polynomial needs to be derived.

Dirk

 

When Audacity Down-Samples a Track

In This Posting, the reader may have seen me struggle to interpret, what the application ‘QTractor‘ actually does, when told to re-sample a 44.1 kHz audio clip, into a 48 kHz audio clip. The conclusion I reached was that at maximum, the source track can be over-sampled 4x, after which the maximum frequencies are also much lower than the Nyquist Frequency, so that if a Polynomial Filter is applied to pick out points sampled at 48 kHz, minimum distortion will take place.

If the subject is instead, how the application ‘Audacity‘ down-samples a 48 kHz clip into a 44.1 kHz clip, the problem is not the same. Because the Nyquist Frequency of the target sample-rate is then lower than that of the source, it follows that frequencies belong to the source, which will be too high for that. And so an explicit attempt must be made to get rid of those frequency components.

The reason Audacity is capable of that, is the fact that a part of its framework causes a Fourier Transform to be computed for each track, with which that track is also subdivided into overlapping sampling windows. The necessary manipulation can also be performed on the Fourier Transform, which can then be inverted and merged back into a resulting track in the time-domain.

So for Audacity just to remove certain frequency ranges, before actually re-sampling the track, is trivial.

If my assumption is, that QTractor does not have this as part of its framework, then perhaps it would be best for this application only to offer to re-sample from 44.1 kHz to 48 kHz, and not the other way around…

Dirk