Debiasing Bit-Streams

One subject which has received a lot of attention in popular computing, is how to generate random numbers, that are not just old-fashioned pseudo-random numbers from the old days, but that are truly random, and of sufficiently good quality to be used in cryptography.

In order to achieve this goal, there exist certain packages such as

  • ‘bit-babbler’ (Requires specialized USB-keys that can be bought.)
  • ‘rng-tools’ (Assumes a special CPU-feature, which only the most recent CPUs will have.)
  • ‘haveged’ (Is only present in Debian / Jessie repos, and later. Not present in Debian / Lenny repos.)
  • ‘randomsound’ (An age-old solution, present in all the older repos.)

An interesting observation about ‘rng-tools’ that I will make, is that as of Kernel-version 3.17, this daemon doesn’t strictly need to be installed, because the kernel has intrinsic support for the CPU, random-number generator, if any is detected. The higher kernel-versions will incorporate its output into ‘/dev/random’ automatically, but not exclusively. Whether the reader has this H/W feature can be determined by installing the package ‘cpuid’ and then running:

cpuid | grep DRAND

The only real advantage which might remain to using ‘rng-tools‘, would be the configurability of this package for user-defined sources of random data. In other words, a clever user could write a custom script, which looks for random data anywhere to his liking, which hashes that, and which then offers the hash as a source in his configuration of ‘rng-tools’.

If the user has the ALSA sound-system installed, then the following script might work:



NOISE_CMD="sudo -u dirk arecord --file-type=raw --format=S16_LE --duration=1 -q"

if [ -e /opt/entropy ]
  rm -f /opt/entropy || exit 1

mkfifo /opt/entropy || exit 1
exec 3<> /opt/entropy

head -c 256 /dev/urandom >/opt/entropy

sleep 300

while true

  read -n1 -t 1 FIFO_HAS </opt/entropy

  if [ $? -eq 0 ]
    sleep 2

    $NOISE_CMD >>/dev/null || exit 1

    while [ $n -le 8 ]
      $NOISE_CMD | \
        shasum -b -a 512 - | cut -c -128 | xxd -r -p >/opt/entropy
      n=$(( n+1 ))


Because this script would run as root, especially on PulseAudio-based systems, it’s important to replace ‘user’ above with the main user’s username.

(Edit 01/04/2018 : One tricky fact which needs to be considered, if the above script is to be of any practical use, is that it needs to be started before ‘rng-tools’ is, so that when the later daemon starts, the object ‘/opt/entropy’ will already exist.

If my script was just to delete an existing object by that name, and create a new one, after ‘rng-tools’ was running, then the later script will simply be left with an invalid handle, for the deleted object. The newly-created pipe will not replace it, within the usage by ‘rng-tools’.

By now I have debugged the script and tested it, and it works 100%.

I leave it running all day, even though I have no use for the generated ‘/opt/entropy’ .

Further, I’ve added a test, before running the command 8 times, to verify that accessing the sound device does not result in an error-condition. The default-behavior of ‘arecord’ is blocking, which is ideal for me, because it means that if the device is merely busy, my script’s invocation of ‘arecord’ will simply wait, until the device is available again. )

(Edit 01/06/2018 : If an external application starts to read from the named pipe at an unpredicted point in time, and depletes it, the maximum amount of time the above script will wait, before replenishing the pipe, is 5 seconds, assuming that the CPU has the necessary amount of idle-time.

The script may spend 2 seconds ‘sleep’ing, then, 1 second trying to read a byte from the pipe, then, 1 second testing the function that’s supposed to generate the ‘Noise’, and then 1 more second, actually generating the first block of random bits, out of 8 consecutive blocks.


I should also add, that the OSS sound system still exists, although its use has largely been made obsolete on mainstream PCs, due to either ‘PulseAudio’, or ‘ALSA’, which coexist just fine. But there is a simple reason to avoid trying to access the device-files of the sound-device directly:

On-board sound devices usually default to 8-bit mode. And in that mode, the probability is much too high, that sequences of samples will actually have an unchanging value, because there is actually a limit to how poor the behavior of the analog amplifiers can be, that act as input to the A/D converter. And so it is critical that the sound-device be switched into some 16-bit mode. This is much easier to do using the ‘arecord’ command, than it would be with direct access to device-files.

Given nothing but sequences of 8-bit sound-samples, it should come as no surprise if eventually, the resulting hash-codes actually repeat. :-)  )


I’m going to focus my attention on ‘randomsound’ during the rest of this posting, even though on most of my computers, I have better solutions installed. What the package ‘randomsound’ does, is put the sound-input device, that presumably has a typical, cheap A/D converter, into Unsigned, 16-bit, Mono, 8kHz mode, and to extract only the Least-Significant Bit from each sound sample. In order for that to work well, the bits extracted in this way need to be “debiased”.

What this means is that on old sound cards, the A/D converter is so bad, that even the LSB does not have a probability of exactly 50%, of being either a 1 or a 0. Thus, the bits obtained from the hardware in this way need to be measured first, for what the probability is of this bit being 1, and must then be processed further in a way that compensates for a non-50% probability, before bits are derived, which are supposedly random.

I’ve given some private thought, on how the debiasing of this bit-stream might work, and can think of two ways:

  1. A really simple way, that only uses the XOR function, between the current input-bit, and the most-recent output-bit,
  2. A more-complex way, based on the idea that if 8 biased bits are combined arbitrarily into a byte, on the average, the value of this byte will not equal 128.

The first approach would have as a disadvantage, that if the bias was strong, a sequence of bits would result, which would still not be adequately random.

The second approach seems better to me, but will also require that 8x as many samples be read in from the sound-card, than bits are to be added to the entropy pool.

The way my own, mental version of the debiasing could work, is that at first, 256x 8 bits could be read from the sound card, just to calibrate one block of random bits. The byte-values could be averaged, to arrive at a temporary ‘bias’-value.

Then, the same 256x 8 bits could be read in from the sound card, so that each of the resulting bytes is compared with the value ‘bias’, and if the byte read-in is greater than that value, a single 1 could be derived. Otherwise, a single 0 would be derived.

The disadvantage of doing things the harder way would be, that 2048 samples of sound would need to be input, to produce only 256 bits of truly random noise.


(Update 11/05/2019, 14h40 : )

A reader named “Valentin” has pointed out a valid concern about my script. Firstly, whether it is of any practical use depends on the situation. It assumes that for whatever reason, ‘haveged’ is not installed. Yet, the computer could be of the variety that has a high-quality sound card which, when actually capturing silence, successfully performs the 16-bit Analog-To-Digital conversion, either resulting in a stream of zeros, or resulting in a stream with very low variance, so that even though each hash-code ‘looks random’, there is little or no ‘true randomness’, i.e., little or no entropy.

The script, as I wrote this, does not test for such situations. In order to do so, basically requires that a small C++ program be written, assuming that people like to program in C++, which reads in pairs of bytes as signed short integers, and which outputs a single integer, indicating what the variance of the stream is. Here is an example program which does this:

In turn, the script can be edited to call this program, in a way that tests its text output numerically, to represent a number smaller than some minimum amount of variance. And if a temporary sound-file does not satisfy this test, output to the entropy file can be made to wait, and a new sound-byte captured…


(Update 11/07/2019, 5h40 : )

If the 16-bit sound samples captured by the sound card deviate by an average of at least ±1, then the variance which the helper program above will compute, will become the integer (1). Now, sound streams are possible which only deviate by ±1 less than once per sample, in which case my program will give them a variance of (0). But in such a case I also cannot think of any function of the sample-values, which would give good randomness.

As long as the variance is in fact (1), given a sample-rate of 44.1kHz in stereo, this means that 88200 samples will either contain (+1) or (-1) relative to their average, which means that the result should be random enough.

And my script does not try to force the sample-rate, just because some hardware may not support the full sample-rate of 44.1kHz, that so many of us take for granted these days. In fact, one of my computers has a sound card set up, with a default sample rate of 96kHz, and my script does in fact capture variance of (0), just because that sound card is designed for 24-bit capture.



Print Friendly, PDF & Email

6 thoughts on “Debiasing Bit-Streams”

  1. I suppose that this Sanity Check would also need to take into account, that the 16-bit sound samples would be in two’s complement, which means that a D.C. level of (-1) is just as likely as a D.C. level of (0).


  2. sha512sum -b

    My mistake, the “-b” is “read in binary mode”, I thought it was “write in binary mode”…

    It’s easy to build a noise generator with good enough randomness (amplifier with reverse biased zener diode and automatic gain control). In this case the sanity check would be just checking if there is high enough level of the input signal.

    But in reality just plugging in a microphone and setting up amplification to a high level would work. It’s unrealistic that the attacker would record the same noise and reproduce the state of the kernel’s random number generator.

    I tested your code and it looks like “cut” is not necessary.

    valentin@computer:~$ echo test | sha512sum | cut -c -128 | xxd -r -p | hexdump
    0000000 3e0e 2375 bc4a f468 8a37 b386 b3f4 192a
    0000010 a38b 8401 0c5b e5d6 0601 74e8 5734 cc00
    0000020 6366 6ca8 a11e dc25 925e 17be 8fc9 0f9a
    0000030 ca85 5f9d 5d59 01b2 7c2f 57c3 4519 23c1
    valentin@computer:~$ echo test | sha512sum | xxd -r -p | hexdump
    0000000 3e0e 2375 bc4a f468 8a37 b386 b3f4 192a
    0000010 a38b 8401 0c5b e5d6 0601 74e8 5734 cc00
    0000020 6366 6ca8 a11e dc25 925e 17be 8fc9 0f9a
    0000030 ca85 5f9d 5d59 01b2 7c2f 57c3 4519 23c1
    You described well why the sanity check should be on the data before hashing it.

    Haveged supply large amount of random data, however I do not trust it. It’s better to have another randomness source, just in case.

    Another idea – making two recordings and computing the difference between them. Then, computing difference between two computed differences. This can be applied to image data (from a webcam) too. Of course, at the end – hashing.

    Because the sanity check algorithm can become very complex, just using John Von Neumann’s debiasing algorithm will make the sanity check algorithm not needed (or it can be simple – just monitoring the level of the input signal and some rngtest).

    1. Because such a sanity check is ultimately called for, and because it would not be feasible to implement entirely as a bash-script, what I’ve just done is to write a small helper program in C++, that treats any binary file it is fed, as an array of 16-bit signed integers, in little-endian format. And this happens to coincide with how my script writes its captured sound, temporarily. My helper-program then outputs a single integer, in the format of ASCII text, stating the variance of that captured sound, so that level can in turn be compared with a fixed level, in a modified script.

      BTW, Variance is Standard Deviation squared, and corresponds to signal energy.


  3. Using only least significant bits and then debiasing it using algorithm like John Von Neumann’s debiasing algorithm is the usual way of doing it.

    But most practical and simple approach is to just get some data from the microphone input, make a hash from it (like sha512), get more data, make another hash, and then XOR both hashes. For better security more data can be grabbed from the microphone, hashed and XOR-ed again, etc.

    There should be also a sanity check – if the input signal (from the microphone input) is big enough and random enough.

    1. Hello.

      Using a feedback form, you had also suggested the following:

      I think that there is no need of using the rng-tools daemon. You can just throw random data to /dev/random and the kernel will handle it. Example code:

      amixer sset 'Mic Boost' 100% amixer sset 'Mic' 100% while true ; do date; arecord -f cd -t raw -d 10 > /dev/random ; done

      Also, if you want to give hashed output to the /dev/random instead of raw data:

      amixer sset 'Mic Boost' 100% amixer sset 'Mic' 100% while true ; do arecord -f cd -t raw -d 10 | sha512sum -b > /dev/random ; done

      The code can be executed by any user, because anyone can write to /dev/random. I don’t understand why you are dong shasum -b -a 512 - | cut -c -128 | xxd -r -p instead of just sha512sum -b.
      Maybe you need different size of the data chunk, but why? From the docs:

      Writing to /dev/random or /dev/urandom will update the entropy pool with the data written, but this will not result in a higher entropy count. This means that it will impact the contents read from both files, but it will not make reads from /dev/random faster.

      So, throwing some data from the microphone input will make the random numbers more random, but will not increase the counter. Usually there is a haveged and/or hardware random numbers from the CPU, so /dev/random will not block (haveged is dealing with the counters).

      Thank you for your reply. I was making the assumption for the moment, that ‘haveged’ is not installed. For that reason, I felt that my method should increase the counters. Thus, to have the ‘rng-tools’ package installed would have been the easiest way to do so.

      Simply giving the following command:

      sha512sum -b

      Has as effect, that it will output ASCII characters, which just happen to be the hexadecimal representation of the hash-code. Adding the code:

      cut -c -128 | xxd -r -p

      Will have as effect, to convert that ASCII / Hex code into binary, which is what ‘/dev/random’ is supposed to store.

      And finally, I felt that the greatest enemy of my script would have been, microphone output which is just zero, because an 8-bit audio format was being used. In that case, to hash the mike output several times might on occasion result in random-looking data, but will contain little real entropy, because most sound chips are capable of recording sound close enough to silence, that the 8-bit format will make it out as silence. XOR’ing two non-random bit-sequences will not make a random one out of them.

      Having said that, I no longer run that daemon on any of my computers, because I rely more on ‘haveged’ being installed and working. It’s so much less of a hassle.


    2. I should also add that, given enough ingenuity, a shell-script can be written that extracts, say, the least 8 significant bits out of every 16 input. However, inputting a longer message, every second byte of which happens to be zero, but then computing its hash-code, will give the same benefit.

      In fact, the whole point of my posting was the observation that, to compute such a hash-code is a good alternative, to actually debiasing such a stream.

      Further, in order for sound-card inputs to read an actual analog input as having 16 bits of silence, would require one cold input amplifier. It’s not impossible, but beyond what simple MB sound chips can do. But just because it’s not 100% impossible, you may write your own code to perform a sanity check on the captured audio, so that if it does just consist of zeros, it will be ignored. However, what I suspect your sanity check will find instead, will be that some sequences of input words are zeros, while certain other input words – audio samples – are not, belonging to the same message. And then, whether your sanity check still has the AI to categorize such input messages as being Null, will depend on how well you wrote that.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>