# Linear Predictive Coding

Linear Predictive Coding is a method of using a constant number of known sample-values, that precede an unknown sample-value, and to find coefficients for each of the preceding sample-values, which they should be multiplied by, and the products summed, to derive the most-probable following sample-value.

More specifically, while the exercise of applying these coefficients should not require much explaining, methods for deriving them do.

Finding these coefficients is also called an Auto-Correlation, because the following sample is part of the same sequence, as the preceding samples belonged to.

Even though LPC travels along the stream of values, each preceding position relative to the current sample to predict is treated as having a persistent meaning, and for the sake of simplicity I’ll be referring to each preceding sample-position as One Predictor.

If the Linear Predictive Coding was only to be of the 2nd order, thus taking into account 2 Predictors, then it will often be simple to use a fixed system of coefficients.

In this case, the coefficients should be { -1, +2 }, which will successfully predict the continuation of a straight line, and nothing else.

One fact about a set of coefficients is, that their sum should be equal to 1, in order to predict a DC value correctly.

( If the intent was to use a set of 3 predictors, to conserve both the 1st and the 2nd derivatives of a curve, then the 3 coefficients should automatically be { +1, -3, +3 } . But, what’s needed for Signal Processing is often not, what’s needed for Analytical Geometry. )

But for orders of LPC greater than 3, the determination of the coefficients is anything but trivial. In order to understand how these coefficients can be computed, one must first understand a basic concept in Statistics called a Correlation. A correlation supposes that an ordered set of X-value and Y-value pairs exist, which could have any values for both X and Y, but that Y is supposed to follow from X, according to a linear equation, such that

Y = α + β X

Quite simply, the degree of correlation is the ideal value for β, which achieves the closest-fitting set of predicted Y-values, given hypothetical X-values.

The process of trying to compute this ideal value for β is also called Linear Regression Analysis, and I keep a fact-sheet about it:

This little sheet actually describes Non-Linear Regression Analysis at the top, using a matrix which states the polynomial terms of X, but it goes on to show the simpler examples of Linear Regression afterward.

There is a word of commentary to make, before understanding correlations at all. Essentially, they exist in two forms

1. There is a form, in which the products of the deviations of X and Y are divided by the variance of X, before being divided by the number of samples.
2. There is a form, in which the products of the deviations of X and Y are divided by the square root, of the product of the variance of X and the variance of Y, before being divided by the number of samples.

The variance of any data-set is also its standard deviation squared. And essentially, there are two ways to deal with the possibility of non-zero Y-intercepts – non-zero values of α. One way is to compute the mean of X, and to use the deviations of individual values of X from this mean, as well as to find the corresponding mean of Y, and to use deviations of individual values of Y from this mean.

Another way to do the Math, is what my fact-sheet describes.

Essentially, Form (1) above treats Y-values as following from known X-values, and is easily capable of indicating amounts of correlation greater than 1.

Form (2) finds how similar X and Y -values are, symmetrically, and should never produce correlations greater than 1.

For LPC, Form (2) is rather useless, and the mean of a set of predictors must be found anyway, so that individual deviations from this mean are also the easiest values to compute with.

The main idea when this is to become an autocorrelation, is that the correlation of the following sample is computed individually, as if it was one of the Y-values, as following each predictor, as if that was just one of the X-values. But it gets just a touch trickier…

(Last Edited 06/07/2017 … )

(Edit 05/27/2015 : To rephrase what I just wrote, the initial premise is that any one predictor has the capacity to predict the following Y-value by itself, but each predictor will be associated with a different coefficient, corresponding to the different slopes, and if it is a 4th-order autocorellation, there will initially be 4 predictors, aiming to predict the same result, so that their values would also need to be divided by 4 in that case, so that their sum will have the correct scale. )

The basic procedure should be alike to:

1. Compute the mean of all the predictors.
2. Compute the variance of the predictors.
3. Compute by how much a known, following sample-value differs from this same mean.
4. Multiply this deviation by all the individual deviations of the predictors, from their mean.
5. Divide the resulting vector of values by the variance of the predictors.
6. Average over the full set of known, computable following values.
7. Divide it again by the number of predictors, which is also the order of the LPC.
8. Add the reciprocal of the number of predictors to all the resulting coefficients.

Now, there is a way to simplify the computation slightly, by changing Step (2) above to

• Compute the Sum of (the deviations of the predictors from their mean, squared).

And by removing Step (7). But the benefits of such a simplification may be short-lived.

I suppose that a valid question to ask might be, whether it is correct for me to use the variance of X as belonging to one sequence of predictors, to compute the correlation that leads to Y. Technically, if we read the fine-print, we might be computing the variance of one predictor, over the entire data-set, to divide the numerous products with Y by.

But over the data-set, signal-amplitudes might change, and one concept which we’d like to predict, is that the amplitude of one value of Y be consistent with the amplitude of predictors that lead up to it. All the values of X eventually belong to the same data-set, that has variance (along with the Y-values), but we’d like for predicted Y-values to track short-term amplitude changes.

So there is the correlation with respect to one sequence of predictors, as opposed to the correlation with respect to the whole data-set.

If this approach was given nothing but a straight line

{ -1, 0, +1, (+2) }

to learn from, it would compute the 3rd-order coefficient-vector

{ -2/3, +1/3, +4/3 }

and this, if given the predictors from above, would again yield

( +6/3 ) == +2 .

The correct 2nd-order coefficients follow from the sample-sequence

{ -1, +1, (+3) } ->

{ (-1.5 + 0.5), (+1.5 + 0.5) }

== { -1, +2 } .

(Edit 05/31/2017 :

If we were given the example of teaching a 4-Predictor system a straight line again, then the ‘learning’ sample-values would be:

{ -3, -1, +1, +3, (+5) }

resulting in the coefficient-set

{ -10/20, 0/20, +10/20, +20/20 } .

And then, plugging in the 4 predictors we started from, we re-obtain (+100/20 == +5) . So it would seem then, that our method for determining the scale of the following-value is correct. But clearly, there are many combinations of 4 values, that could all add up to (+5). And if we reduced the amplitude of the learning-samples, then we’d be dividing a lower following-value (first multiplied by lower predictors), by a lower variance of the predictors, in a way that should remain proportional.

These initial tests would look disappointing, if we were aiming to reproduce the efficiency with which a fixed, 3-predictor-set predicted the 2nd derivative – as deduced by me. But for that purpose we’d need to bear in mind, that over an entire data-set- with at least one full cycle of a waveform, there would be three types of shorter sub-regions:

1. A part of the curve where the 1st derivative dominates,
2. A part of the curve where the 2nd derivative dominates,
3. A part of the curve where both the 1st and 2nd derivative are mixed in contributing to the variance of X.

If the 1st derivative was close to zero, then whatever variance is present in the predictors, should be due to the 2nd and higher derivatives. And so strong results opposing the middle predictors are possible. They just don’t last throughout the whole data-set. And so I still can’t be 100% sure that this system is scaled correctly for all cases.

Parabola with a Center-Value equal to (-5), x2 term = (+4) –>

{ +4, -4, -4, +4, (+20) } –>

{ +3/2, -1, -1, +3/2 }

But we could use Human Deduction, just to add these two sets of predictors together, like so:

{ +3/4, -5/4, -3/4, +9/4 }

The problem with this is the fact that while Human Deduction can recognize that the LPC representation of Sub-Regions (1) and (2) above are orthogonal, and that they can therefore just be added – minus the (+1/4) bias per sample, simply computing the autocorrelation will tend to average them over a longer data-set, in which each 4-sample set of predictors is just a narrow window. So I really need to study the subject more closely, of whether my approach to scaling the results is correct in all cases.

The relevant question would be, If we are to compute an 8th-order autocorrelation, would it still be the case that its coefficients store two types of information, orthogonally? And what I visualize is that as one sine-wave is advanced along another with an equal amplitude, a single sine-wave results at (1/2) amplitude, with no consideration that it should be phase-shifted. And as one computes the variance of a single sine-wave, the result will be (1/2) times its amplitude squared, so that the computed, peak correlation between equal sine-waves will follow as 1.

Conclusion as of 06/02/2017: If the number of predictors equals (n), also known as the order of the LPC, Step (7) above as belonging to Pass 1 needs to remain as it is:

• Divide the Coefficients stemming from the predictor-set X, by n .

)

(Edit 06/07/2017 :

In order for a better solution to exist, if the problem I thought of above was in fact a problem, it would be necessary to compute not only the products with Y separately, but also the variance of each predictor, as belonging to the Set X. But, because the same data-set is shifted along all the predictors, their variance as such will end up being similar, over the full data-set. And so the best we can do is to differentiate how the products could be unique to each predictor, while their variance should still be consistent with the interval of X, in relationship to Y.

Further, it will not always be predictable, that the predictors foresee a continuation of the first derivative, as a reference-point for normalizing their coefficients. For example, the set of predictors could witness a single sine-wave. Such a sine-wave will seem to be more positive on one end of their interval, while it would also be more negative on the other side. If the first derivative was always to be followed, that sine-wave should be followed by a continuation in the direction of its last half.

But in that case, a repetition of the sine-wave will see a reversal, becoming more positive again, after a half that was more negative, and vice-versa. And so the notion that the first derivative should always be learned by a longer sequence of predictors, is also a foregone conclusion. For which reason that cannot be used first to amplify, and then to normalize them.

One problem in using the precomputed system of 3 coefficients above, which conserves the 1st and 2nd derivative, in compressing signals, would be that it acts as a strong high-pass filter, since two of its coefficients are an adjacent (+3) and a (-3). If the signal contained a lot of white noise, then this would be amplified strongly. The advantage which the given 4-coefficient system has to offer, is that it would amplify the white noise less.

It would actually make more sense to use the 3-coefficient system, which continues to predict a straight line, but which dampens the white noise, more than a 2-coefficient system would.

And the plausible special cases in which a 2-coefficient or even a 1-coefficient system system will work best, is if we are to FLAC-compress a very-low-amplitude signal consisting of white noise. The predictors will act as a high-pass filter, since now they are differentiating the signal. But then the residual might have a much-lower amplitude than the signal format permits, just because the signal itself does. And then, FLAC will adjust the Rice Constant, with which it encodes that residual, to some number of bits fewer than 16 or 24, resulting in a slightly-smaller file, in ideal cases.

I think that my experiments with a FLAC Codec (computed by other people who are experts, not by me) have shown me, that the real results are often much worse than ideal, and with FLAC, we are relying on the Codec choosing whichever set of predictors produces the better results, based on a practical example of a signal to be compressed.

And yet, If FLAC developers were given even a slightly-better example of a 32-coefficient implementation, the result might be, that the Codec can select this format for a larger number of blocks than it presently does, and that some very slight improvement might take place, in the overall compression given to practical examples of signals. )

This approach produces results, with high orders, i.e. with high numbers of predictors, which may be sub-optimal. I don’t know if it still works well with an order of 8, but it should start to produce sub-optimal results with orders approaching 32 !

The reason for this is the fact that this approach fails to distinguish between low correlation, and accurately-predicted, low values. By the time we have 32 predictors, some of those just won’t correlate with the following sample-value, while others will correlate more-strongly, and would predict the following value, all by themselves. But, because we are aiming for 32 coefficients, the strongly-correlating samples will get washed out, maybe being divided by 32, while the majority will simply not offer any useful output.

Thus, if we need to compute 32 coefficients, I can suggest a better way. The predictors can be broken into groups, maybe starting with a group of the 4, closest to the following-value?

After that, a second pass can be made over the sample-data, which takes the following-value predicted by the first group into account, resulting in a difference, of known following-values, from the predicted values from the previous pass. This difference can then be put through another auto-correlation, this time revealing the best-suggested values for the preceding 4 coefficients, so that we obtain a sub-total of 8 coefficients.

But then a 3rd pass can be computed, that takes passes 1 and 2 into account, and that reveals another 8 coefficients, resulting in a total of 16. And after a total of 4 passes, 32 coefficients should be revealed.

It remains possible that the last of those coefficients to be computed will not correlate much, with the difference of the known following-value, from the results of passes 1, 2, and 3. But then those coefficients will simply receive values close to zero, and also not interfere, while the first 4 or the first 8 will not be neglected either…

I would suggest that the mean be recomputed, in order to derive the individual deviations – and variance – of the predictors of later passes, but that the following-value not be compared with anything other, than what the sum of the previous passes would have predicted, before these values are multiplied – and divided.

Hypothetically, (1/4) would need to be added to the coefficients from Pass 1.

The concept which I have just sketched works not only for sound compression, but could have quite a few other uses, such as using previously-known position-vectors of a submarine, to try to predict a following position-vector for the same submarine, while it’s trying to be evasive, and while we’re trying to fire a torpedo at it, which takes time to arrive at its target…

Furthermore, I suspect that modern examples of machine learning, are also based more-or-less on Statistical Regression Analyses.

Even many decades ago, photographs of objects were passed through “masks which contained scribbly lines and patterns”. A single light-value was gathered with each mask, from the same photo.

But, a Regression Analysis was performed, which predicted how strongly each light-value correlated, with the binary truth, of whether photos which the computer was supposed to learn from, constituted chairs, houses, bridges.

After that, the computers could gather the light-values from previously-unknown photos, with each mask, and multiply those light values with the respective correlations. And the computers were able then to do better, than just guess, as to whether the photo was of a bridge a house or a chair.

(Edit 05/27/2017 : )

The astute reader will notice that if this approach is applied to signal-processing, filtering, or sound-compression, it will yield a time-domain equivalent, of phenomena which are primarily shaped spectrally. I.e., if we are to compute the autocorrelation over a 32-sample interval of predictors, then the positions of the predictors which correlate will have something to do with the frequencies of the sound or signal we are computing them from, or ‘learning’ them from.

Hence, if the original sound was that of a violin playing a steady note, with its bow slowly being drawn over the string, there is every possibility that this approach will yield good predictions. And this would be due to non-zero correlations being repeated over the entire interval of 32 predictors. In that case, each coefficient is supposed to be ‘weak’, so that 32 of them can describe the waves of the violin’s sound in a way that partially repeats itself – and do so with correct final amplitude.

But, if the signal has transient features, with a duration shorter than 32 samples, or if it has stronger short-range order than it has long-range order, then a system of 32 predictors might mysteriously yield poorer results, than a system of 4 predictors did.

And it would be for this latter case, that I would suggest computing them in passes. The way I’ve visualized these passes, each subsequent pass adds the same number of coefficients as were already provided by all the preceding passes. So if the sequence can be described with the first 4 coefficients it will. But then the next 4 will modify the results of the first 4, if the long-range order deviates from the short-range order significantly. And then the subsequent 8 coefficients have the potential to modify the results from the preceding 8 again, in case long-range order differs from short-range order predictably…

So the results in coefficients would be quite different, from what we would obtain, if we were just to attempt to compute 32 coefficients in one pass.

If we’re specifically given a compression-scheme such as FLAC, then the underlying assumption is, that a representation of the predictor as consisting of 32 coefficients will only be applied by the Codec, if it gives better results than applying 4 coefficients would have given. In that case, we are not setting out to devise a way of computing 32 coefficients, designed to be universal.

And so in that case, since an application of a static, standard set of 4 coefficients did not yield a better result, it might make sense to downgrade what I described here into a 3-pass process, in which the first pass tries to compute 8 coefficients.

According to what I’ve written, doing so could break what a 4-predictor system can predict, but then the 4-predictor system can be selected, instead of the 32-coefficient system.

(Edit 06/02/2017 : In fact, for that usage, I could further simplify my hypothetical system into a 2-pass process, in which the 2nd pass computes the remaining 24 coefficients in one shot. Doing so might improve the ability of Pass 2 to tune in, to lower frequencies, than those which a last 16-sample stretch could tune in to.

)

There is an issue with this subject which I put off mentioning so far.

I had written, that the sum of the coefficients needs to equal 1, so that any DC value is preserved. This necessity goes beyond the preservation of signals. Instead of the samples holding real numbers, they could hold position vectors, at any distance from the origin. Regardless of whatever other errors the LPC method may inject, a sudden shift of predicted position towards or away from the origin would be unacceptable. Errors in position, with respect to earlier positions, are still more likely to be acceptable.

In my summary of the method, Steps 1-6 will always produce coefficients, the sum of which is zero. And my personal enhancement to this method, which is to add predictors in groups, counts on this.

If the method was, to compute 32 coefficients in one pass, then the easiest way to ensure that their sum is 1, is to add 1/32 to all of them.

My method was, compute 4 coefficients in Pass 1, and add 1/4 to each. It would therefore become the responsibility of the first pass, to make sure that the entire set of coefficients has a sum of 1.

The only way I could make sure of this, was to recompute the mean of any group of predictors, added in Passes 2-4. The sum of the deviations, of all the samples belonging to one group, from the mean, of the same group of samples, will always equal zero. Multiplying all of them by the deviation of the one Y-value – my so-called following-value – still keeps their sum equal to zero. Dividing them all by one value for variance, or by one value for the number of predictors in the group – still keeps their sum equal to zero.

If I have a group of coefficients whose sum is equal to 1, and add another group of coefficients to the sequence, whose sum is zero, then I arrive at a new group of coefficients whose sum is 1. I can do the latter part as often as I like, and the result will still be a group of coefficients, whose sum is 1.

If I had done this any differently, I would have broken my method.

But, if the number of predictors in Pass 1 is merely 4, then 1/4 is a very strong additive. There could be an adverse effect on Passes 2-4, simply because this additive is so high. Also, if this group was fed a sequence of 4 zero sample-values, its result would be zero.

The computer on our submarine could only predict, that the enemy sub is somewhere in the vicinity of its 4 last-known positions.

If I could start with 8 predictors in Pass 1, then my corresponding additive would be 1/8, which would exert less bias on Pass 2. And, this group would need to be fed a sequence of 8 zero sample-values, so that its following-value became zero.

The computer on our submarine could now predict, that the enemy sub is somewhere in the vicinity of its 8 last-known positions…

The only real solution to this problem, might be to design a system, by which the sum of all the coefficients is actually zero, but by which a virtual set of coefficients is created by code, which have all had (1/n) added to them, whenever (n) predictors are actually being used, to predict a following-value. This virtual structure would also be used in subsequent passes, to determine by how much the known following-value differs from that predicted by earlier passes.

But, as soon as we export, say, 32 coefficients, those would have (1/32) added to them permanently, so that they can be applied easily.

I can think of two reasons not to do this:

1. In computing the effect of Pass 1 on Pass 2 etc., it creates consistency issues, which did not exist, when the coefficients from Pass 1 had one set of values, that did not change,
2. Even if the groups of predictors added by Passes 2-4 only had means of zero, If they formed a path pointing in one direction, leading up to their values or positions according to Pass 1, their variance would not be zero, and could still predict that the true following-value differs greatly from that, as predicted by Pass 1. My unease with leaving it the responsibility of Pass 1, to ensure that the sum is equal to 1, is not specific and fully-formed in my mind.

(Edit 05/31/2017 : )

There is an example I had run in to some time ago, of how certain Students of this subject might have known, that they were supposed to compute correlations between the Y-value, with the preceding X-values in their positions – i.e. with the predictors – but how those Students might have simply plugged in the wrong equation, to find correlation. These Students had used the equation, which divides the product of X and Y by the square root, of the product, of the variance of X over the set of predictors, and the variance of Y over the entire data-set. This was the type of symmetrical correlation which I had mentioned above as being incorrect, because it will never result in correlations greater than +/-1.

To understand how this might be incorrect, one needs to take into account, that all the correlations may eventually be multiplied by (1/n), if there are (n) predictors. Since none of the correlations could then be greater than 1, and since some of the correlations will be less than 1, it would follow that the amplitude of the Y-values which are predicted, will generally be lower than that of the predictors.

And so accurate predictions would not follow.

If the amplitudes of X had an average of 1, but if those of Y had an average of 2, and if perfect correlation was taking place, then

Y = 2 X

And, the variance of Y would be 4, where the variance of X would be 1. The square root in question would be (2), so that a correlation of 1 would still be computed.

However, if the autocorrelation was to be computed in some way, that causes all the products to be divided by (n/2) instead, then this conclusion is no longer obvious, because individual correlations greater than (1/n) may still result, so that the overall amplitude of the predictors might still be reproduced.

A related question turns out to become irrelevant in the long run, about whether the mean of Y needs to equal zero. It is more important that the mean of all the predictors X are eventually zero, so that greater or lesser values of Y are multiplied by positive versus negative values of X, to yield the same correlation of Y as a function, of changes in X.

(Update 4/05/2019, 6h50 : )

After much contemplation, I’ve decided that I have more to add, about the subject of LPC as it applies to audio compression, that is, as it applies to FLAC.

What FLAC does, is to subdivide the audio stream into blocks, and for each block, to decide anew, which out of a finite set of methods to use, to compress the audio samples in that block, which also results in some header information for the block. That header information may simply state, that a set of 3 Predictors is to be used as the method, or, it may state what an actual vector of coefficients is, for an arbitrary number of Predictors.

FLAC happens to possess a set of fixed Predictor-coefficients, up to an order of 4, just so that if an LPC order up to 4 is selected for any block, the header information does not need to specify the coefficients.

Without my simplifying overly, FLAC will choose whichever method results in the lowest-amplitude residual, which is also the error of the prediction, stated as an exact, literal value, to make the scheme lossless.

I’ve come to realize that from the above text, the method which had 3 coefficients would also be the least likely to be useful. The reason for that is the fact, that it just acts too strongly as a high-pass filter. But, because of the way audio compression works, any method which is the most efficient available at least part of the time, may be a useful one to include in the compression scheme.

And so as a 3-Predictor method, a set of coefficients may also be useful, that keeps predicting the continuation of whatever signal component is at the Nyquist Frequency, as well as predicting the continuation of a linear trend. That set of coefficients would just be, [ -1, +1, +1 ] .

But, one problem which may exist as explained, in incorporating this hypothetical idea into FLAC, is that aside from stating that ‘the 3-Predictor method’ is going to be used, the header information may not be able to change, whatever that set of coefficients is, in any encoded stream. Thus, the people who actually designed FLAC would either:

• Need to have predicted that this set of coefficients has a good chance of pulling its weight, before the first FLAC encoding program was written, or
• Have the option to state what coefficients are to be used, even when their number is 3.

Actually to try to change one of the fixed sets of coefficients, after the first version of FLAC has been implemented, is infeasible because it would just introduce incompatibilities.

Dirk