A Note on FLAC -Compressing 24-bit

One note which I had commented about before my blog began, was that if authors decide to capture sound at 96k samples /second, the resulting sound should compress well using FLAC.

But now that I have experimented with ‘QTractor‘ and an external sound card, I have realized that we will probably also be capturing that sound in 24-bit sample-format, instead of 16-bit. And the sad fact is, that FLAC will not compress the 24-bit format as well, as it did 16-bit.

The reason seems clear. Using ‘Linear Predictive Coding’ means that FLAC will be able to predict the next sample in a set of so-many, to maybe 8 bits of precision, except that the next sample will always deviate from this prediction by a small residual. So 8-bit sound should compress brilliantly.

But then with 16-bit, the accuracy of the encoding stays the same. So again, the ‘LPC’ is really only 8-bits accurate at best, meaning that we get a larger residual. The size of that residual is what makes up most of a FLAC File.

Well at 24-bit, again, the LPC will only predict the next sample, accurately to within 8 bits. And so the residual is likely to be twice as large, as it was with 16-bit, completing 24-bit accuracy this time. We are not left with much compression then.

When I recorded my 14-second sound session the other day, I selected FLAC as my capture file format. I had a noisy air-conditioner running in the background. Additionally, the compression level defaults to Fastest, because the file needs to be written in real-time, and not chewed on.

At 96 kHz, 24-bit stereo, raw audio will take up about 4.6 mbps. At 44.1 kHz, 16-bit stereo, raw audio takes up about 1.4 mbps.

Well I was capturing to a stereo FLAC File, but was only using one channel out of the two. So the FLAC File that resulted, had a bit-rate of 2.3 mbps. This means that FLAC recognized the silent track and used ‘Run-Length Encoding’ on it, but that was about all this CODEC could do for me.

Now, we do have a command-line tool which will-re-compress that file:


$flac -8 infile.flac -o outfile.flac$ flac -8 infile.flac --channels=1 -o outfile.flac
\$ flac -8 infile.flac --channels=1 --blocksize=8192 -o outfile.flac



The -8 means to use maximum compression.

For me, the bit-rate went down to 2.2 mbps either way.

It beats using a raw format, because using the latter would have meant, nothing would have detected my silent stereo channel, and the file would have been twice as large.

Dirk

Testing of USB Sound Device Complete.

According to my previous posting, I needed to do a more thorough test of the USB Sound Card I have bought, which is a “Focusrite Scarlett 2i2“.

In particular, I needed to address the discrepancy according to which, the Linux JACK daemon reports capture at 32 bits, while the specifications of the sound card state a 24 bit sample format.

Also, I needed to be sure whether it would run as well at 96 kHz, as it already did at 48 kHz.

According to my more complete test, the 32-bit sample-format which ‘QJackCtl‘ shows me, which can be viewed in its Messages box, state the ALSA parameters and not the JACK internals. Therefore, JACK has after all chosen to capture and/or play back audio at a physical 32 bits, at the 96 kHz sample-rate. This is not, after all, a statement of the JACK internal behavior.

Since I am using Linux, and since the manufacturer chose to rate this capture device as only being capable of 24-bit capture, I must assume that for hardware reasons the device uses 32-bit registers, but that only the first, most-significant 24 of those bits are accurate. Therefore, when I open ‘QTractor‘ – the Digital Audio Workstation / Tracker application, it is best to truncate its capture format to 24 bits as well, which is most probably what the Windows or Mac drivers for this device do.

Aside from that, using QTractor next, to capture a 96 kHz, 24-bit, stereo FLAC file was easy and uneventful. Further, the stability of my software suggests that I can play with the GUIs as much as I need to, to figure them out, and I will not screw anything up.

After I closed JACK, I next imported this FLAC file, that plays for 14 seconds, into “Audacity“, which has been set up to use the default sound settings (‘PulseAudio‘), and which performs an on-demand re-sampling of the FLAC file.

The on-demand FLAC playback is not filtered well by Audacity, but since it is running at 96 kHz, compared with the 44.1 kHz that the internal sound of the laptop runs at, this observation is not surprising.

And then the captured sound clip simply contains, what I spoke into my microphone.

Dirk

aptX Handles Polyphonic Sound Surprisingly Well.

Right now, as I am typing this, I am listening to Beethovens 9th Symphony on my real “LG Tone Pro HBS-750″ Bluetooth Headphones. The quality of sound is dramatically better, than what the fake HBS-730s had produced, simply because those were fake.

This recording of Beethoven is stored on my phone, as a series of FLAC files, and Android Lollipop devices are well-able to play back FLAC files. I did this, in order to test the fake headphones at first, because I was not sure whether their poor performance then was due to some interaction of aptX, with MP3 or OGG compression, rather than due to the implementation of aptX I was getting. Playing back a FLAC file is equivalent to playing back a raw audio file.

From what I read, aptX not only splits the uncompressed spectrum into 4 sub-bands, but then quantizes each sub-band. The 4 sub-bands are approximately from 0 to 5.5 kHz, from 5.5 to 11 kHz, from 11 kHz to 16.5 kHz, and finally from 16.5 kHz to 22 kHz. These sub-bands are then compressed using ADPCM, which allocates 8,4,2,2 bits to each.

This implies, that the first sub-band contains the bass and the mid-range, and that what I would call ‘melodic treble’ sounds, do not extend beyond sub-band 2, since treble notes with fundamental frequencies higher than 11 kHz are not usually played. And sub-bands 3 and 4 simply add texture to the sound. This means, that to allocate fewer bits of precision to sub-bands 3 and 4 ‘makes sense’, since our natural way of interpreting sound, already sees less detail at those frequencies.

A question which I had raised earlier, was if the act of quantizing the sub-bands 3 and 4 greatly – down to 2 bits in fact – will damage the degree of polyphony that can be achieved.

And now that I possess true headphones I am finding, that the answer is No. The sub-bands 3 and 4, are still capable of being played back in a multi-spectral way, even though their differentials have been quantized that much.

(Edit 06/25/2016 : ) Instead of receiving a regular sequence of +1, 0 and -1 data-points, it is possible to receive an atactic sequence of them. The first thing that happens when decoding that, is an integration, which will already emphasize lower, original frequency components that have been deemphasized. After that, the degree with which the analog signal can be reconstructed is only as good, as the interpolation. And in practice, interpolation is often provided by means of a linear filter which has more than two coefficients. Having a longer sequence of coefficients, such as maybe 6 or 8, provides better interpolation, even in sub-bands 3 and 4, which we supposedly hear less-well.

I do find though, now that the entire signal is much more clear, that when I listen closely, the highest frequencies belonging to Beethovens 9th, seem to have slightly less resolution than they are truly supposed to have. But not as much less resolution, than I am used to hearing, due to poor headphones, or due to MP3 compression.

It is already a dramatic improvement over what my past told me, that today, Some Bluetooth Headphones can play back high-quality music, in addition to being usable for telephony.

Now, Beethoven died before he finished his 9th symphony, and later artists officially completed it, by adding the 5th movement, which is actually “Shiller’s Ode To Joy”. According to what I am hearing, that 5th movement is compromised more by the aptX compression than the first 4 were, that were actually written by Beethoven.

The reason seems to be the fact, that Shiller’s work is more operatic, and has choruses singing very high notes, which results in a lot of the signal energy being in the 2nd, 3rd and 4th sub-bands. So when I hear that movement, I can hear the quantization quite clearly.

It is usually not a preference of mine, to listen to this 5th movement, because I don’t find it to be authentic Beethoven. Right now I am listening to it, and observing this effect with some fascination.

Dirk

Not Being Sure about the Sign Bit

One format in which MP3 can encode stereo, is in the form of “Joint Stereo”. In this form, the signal is sent as a sum-channel, and a difference-channel. The left and right channels can be reconstructed from them, just as easily as a sum and a difference, could be computed. The reason this is done, is to save on the bit-rate of the difference channel, along the argument that human stereo-directionality is more limited in frequencies, than straightforward hearing is.

But this is also one example, in which the difference channel needs to have a defined sign, as either being positive when left is more so, or being more positive when right is more so.

And so one way in which this could be encoded, if the decision was made to include doing so into the standard, could be to encode an additional sign-bit into each non-zero frequency coefficient of the difference channel. But doing so would also affect the overall bit-rate of the signal enough, that this deters professionals from doing it. Also, the argument is made that lower bit-rates can lead to higher sound quality, because if they want, users can increase the bit-rate anyway, resulting in greater definition of the more audible components of the sound.

And so with Joint Stereo, a feature built-in to MP3 is, “Side-Bits”. These side-bits are included if the mode is enabled, and declare as header information, how the signal should be panned, either to the left or the right, as well as what the sign of the difference-channel might be. ( :1 )

Well there is an implementation detail about the side-bits, which I do not specifically know: Whether they are stored only once per frame, or once per frequency sub-band, since the compression schemes do divide the spectrum into sub-bands already. Thus, if the side-bits were encoded for each sub-band, for each frame, then a good compromise could be reached, in terms of how much data should be spent on that.

The only coefficient which I imagine would have a sign-bit to itself each time, would be the zero coefficient, that corresponds to DC, but that also corresponds to F = Sampling Interval / 2.

This question becomes more relevant for surround-sound encoding. If, rather than using the Pro Logic method, some other scheme was decided on, for defining a ‘Back-Front’ channel, then it would suddenly become critical, that this channel have correct sign information. And then it would also become possible for this channel to correlate negatively with frequency components that also belong to the stereo channels. Hence, a more familiar method of designing the servos of Pro Logic II would be effective again, than what would be needed, if none of the cosine transforms could produce negative coefficients. ( :2 )

Dirk

P.S. When we listen to accurately-reproduced signals on stereo headphones, If the signals are perfectly in-phase, humans tend to hear them as if they came from inside our own heads. If they are 180 degrees out-of-phase, we tend to hear them as coming from a non-specific place outside our heads.

I have had a personal friend complain, that when he listens to Classical Music via MP3 Files, he cannot ascertain the direction which sounds are coming from – i.e. the positions of the instruments in the orchestra, but can hear panning and this in-versus-out condition. Listening to Classical Music via MP3 is a very bad idea.

What this tells me is that my friend has very good hearing, which is tuned to the needs of Classical Music.

The reason he hears some of the sound as being in the out-of-phase position, may well just be due to the fact that Joint Stereo was being selected by the encoder, and that certain frequency components in the difference-channel, did not have substantial counterparts in the sum-channel. Mathematically, this results in a correlation of zero between the encoded channels, but in a correlation of -1 between the reconstructed left and right…

Subjectively, I would say that I have observed better sound quality in this regard, when using OGG Compression, and at a higher bit-rate. I found that “Enya” required 192kbps, while “Strauss” did not sound good until I had reached 256kbps.

But I do not know objectively, what it is in the OGG Files, that gives me this better experience. I do not have the precision hearing, which said friend has. I have used FLAC to encode some “Beethoven” and “Schubert”, but mainly just in order to archive their music without any loss in information at all, and not as a testament, to the listening experience really being that much better.

1: ) In the case of Joint Stereo with MP3, what I would expect, is that the ‘pan-number’ will also direct the decoder to set the polarity of the difference-channel, to be positive with whichever side the sum-channel is being panned-towards more strongly. And I expect this to happen, regardless of what the phase information was when encoding.

If there was explicit sign information here, such information would first also have had to be measured, when encoding, as relative to the phase-position of the sum-channel. Since phase-information is generally relative. And I do not hear speak, of correlation information being collected first when encoding, between the stereo and difference channels.

2: ) This subject peaked my interest into how OGG Compression deals with multi-channel sound. I did an experiment using “Audacity”, in which I prepared a 6-channel project, chose to export it in a custom channel-configuration, and then chose different settings in the channel-meta-data window.

While AC3 was ‘limited’ to allowing a maximum of 7 channels, OGG allows a compressed stream with up to 32 channels. But, I seem to have observed that when compressing more than 2 channels, OGG forgoes even joint stereo optimization, instead only compressing each channel individually. This seems to follow from the observation, that If I mix channels 3 and 4, assigned to a hypothetical front-center and LFE, I should have turned those two into a monaural signal repeated once. But doing so does not improve the OGG File size.

There was a 3m45s stream, which took 5.2MB as a 6-channel AC3. The same stream takes up 18.1MB as a 6-channel OGG. And these bit-rates result from choosing a rate of 192kbps for the AC3 File, while choosing ‘Quality Level 8/10′ for the OGG.

I think that one reason for the big difference in bit-rates, is the fact that my stream consisted of a stereo signal originally, of which there were merely 3 copies. The AC3 File takes advantage of the correlations to compress, while OGG is not as able to do so.

Further, I read somewhere that OGG takes the remarkable approach, to convert the stereo into joint stereo, after quantization (in the frequency domain), while MP3 does so before quantization. This makes the conversion which OGG performs, of a signal into stereo, a lossless process, and also seems to imply, encoding one sign bit with each coefficient of the difference-channel. Any advantage OGG gives to the bit-rate would need to stem, from the majority of low-amplitude coefficients in the difference-channel, as well as from limiting its frequencies.

By contrast, this would seem to suggest, that MP3 will compute an ‘FFT’ of each channel, also in order to determine the side-bits, after which it will compute a sum and a difference channel in the time-domain, and then compute the ‘DCT’ of each…