Why some Linux devs still use spinlocks.

There is An interesting article, on another BB, which explains what some of the effects are, of spinlocks under Linux. But, given that at least under Linux, spinlocks are regarded as ‘bad code’ and constitute ‘a busy-wait loop’, I think that that posting does not explain the subject very well of why, in certain cases, they are still used.

A similar approach to acquiring an exclusive lock – that can exist as an abstract object, which does not really need to be linked to the resource that it’s designed to protect – could be programmed by a novice programmer, if there were not already a standard implementation somewhere. That other approach could read:

 

from time import sleep
import inspect
import thread

def acquire( obj ):
    assert inspect.isclass(obj)
    myID = thread.get_ident()
    if obj.owner is None:
        # Replace .owner with a unique attribute name.
        # In C, .owner needs to be declared volatile.
        # Compiler: Don't optimize out any reads.
        obj.owner = myID
    while obj.owner != myID:
        if obj.owner == 0:
            obj.owner = myID
        sleep(0.02)

def release( obj ):
    assert inspect.isclass(obj)
    if obj.owner is None:
        # In C, .owner needs to be declared volatile.
        obj.owner = 0
        return
    if obj.owner == thread.get_ident():
        obj.owner = 0

 

(Code updated 2/26/2020, 8h50…

I should add that, in the example above, Lines 8-12 will just happen to work under Python, because multi-threading under Python is in fact single-threaded, and governed by a Global Interpreter Lock. The semantics that take place between Lines 12 and 13 would break in the intended case, where this would just be pseudo-code, and where several clock cycles elapse, so that the ‘Truth’ which Line 13 tests may not last, just because another thread would have added a corresponding attribute on Line 12. OTOH, adding another sleep() statement is unnecessary, as those semantics are not available, outside Python.

In the same vein, if the above code is ported to C, then what matters is the fact that in the current thread, several clock-cycles elapse between Lines 14 and 15. Within those clock cycles, other threads could also read that obj.owner == 0, and assign themselves. Therefore, the only sleep() instruction is dual-purpose. Its duration might exceed the amount of time the cache was slowed down to, to execute multiple assignments to the same memory location. After that, one out of possibly several threads would have been the last, to assign themselves. And then, that thread would also be the one that breaks out of the loop.

However, there is more that could happen between Lines 14 and 15 above, than cache-inefficiency. An O/S scheduler could cause a context-switch, and the current thread could be out of action for some time. If that amount of time exceeds 20 milliseconds, then the current thread would assign itself afterwards, even though another thread has already passed the retest of Line 13, and assumes that it owns the Mutex. Therefore, better suggested pseudo-code is offered at the end of this posting…

)


 

This pseudo-code has another weakness. It assumes that every time the resource is not free, the program can afford to wait for more than 20 milliseconds, before re-attempting to acquire it. The problem can crop up, that the current process or thread must acquire the resource, within microseconds, or even within nanoseconds, after it has become free. And for such a short period of time, there is no way that the O/S can reschedule the current CPU core, to perform any useful amount of work on another process or thread. Therefore, in such cases, a busy-wait loop becomes The Alternative.

I suppose that another reason, for which some people have used spinlocks, is just due to bad code design.


 

Note: The subject has long been debated, of what the timer interrupt frequency should be. According to kernel v2.4 or earlier, it was 100Hz. According to kernel v2.6 and later, it has been 1000Hz. Therefore, in principle, an interval of 2 milliseconds could be inserted above (in case the resource had not become free). However, I don’t really think that doing so would change the nature of the problem.

Explanation: One of the (higher-priority, hardware) interrupt requests consists of nothing but a steady pulse-train, from a programmable oscillator. Even though the kernel can set its frequency over a wide range, this frequency is known not to be terribly accurate. Therefore, assuming that the machine has the software installed, that provides ‘strict kernel marshalling’, every time this interrupt fires, the system time is advanced by a floating-point number, that is also adjusted over a period of hours and days, so that the system time keeps in-sync with an external time reference. Under Debian Linux, the package which installs that is named ‘ntp’.

 

There exist a few other tricks to understand, about how, in practice, to force an Operating System which is not an RTOS, to behave like a Real-Time O/S. I’ve known a person who understood computers on a theoretical basis, and who had studied Real-Time Operating Systems. That’s a complicated subject. But, given the reality that his operating system was not a Real-Time O/S, he was somewhat stumped by why then, it was able to play a video-clip at 24 FPS…

(Updated on 3/12/2020, 13h20 …)

Continue reading Why some Linux devs still use spinlocks.

Generating a Germain Prime efficiently, using Python and an 8-core CPU.

I have spent extensive time, as well as previous blog postings, exploring the subject of how to generate a Germain prime (number), in the Python (3.5) programming language, but taking advantage of parallel computing to some slight degree.

Before I go any further with this subject, I’d like to point out that generally, for production-ready applications, Python is not the best language to use. The main reason is the fact that Python is an interpreted language, even though many modern interpreted languages are compiled into bytecode before being interpreted. This makes a Python script slower by nature, than very well-written C or even C++. But what I aim to do is to use Lego-Blocks to explore this exercise, yet, to use tools which would be used professionally.

The main way I am generating prime numbers is, to start with a pseudo-random, 512-bit number (just as a default, the user can specify different bit-lengths), and then to attempt to divide this number by a list of known, definite primes, that by now only extend to 4096 (exclusively, of course), in an attempt to disprove that number prime. In about 1/8 of all cases, the number survives this test, after which I undertake a more robust, Miller-Rabin approach to try disproving it prime 192 times, probabilistically. If the number has survived these tests, my program assumes it to be prime.

Even though I’ve never been told this, the probability of a non-prime Candidate surviving even a single Miller-Rabin Test, or Witness, is very small, smaller than 1/8. This could be due to the fact that the Witness in question is raised to a high, odd exponent, in its 512-bit modulus etc., after which it would be squared some number of times. Because the first Candidate in the modulus of 4 is 3, that one actually gets squared a subsequent total of zero times. And after each exponentiation, the result could be any number in the modulus, with the possible exception of zero. It needs to become either (1) or (n-1) in the modulus of (n), for the Candidate to survive the test. (:1)

Further, there is no need for the numbers that get used as witnesses, which are pseudo-random, to be of the same, high, cryptographic quality of pseudo-random, as the candidate is, which is being tested.

But there is a sub-category of prime numbers which have recently been of greater interest to me, which is known as the Germain prime numbers, such that the Totient of that main candidate, divided by two, should also be prime. And so, if the density of prime numbers roughly equal to (n) is (1 / log(n)), and if we can assume a uniform distribution of random numbers, then the probability of finding a Germain prime is roughly (1 / (log (n))2), assuming that our code was inefficient enough actually to test all numbers. The efficiency can be boosted by making sure that the random number modulo 4 equals 3.

But one early difficulty I had with this project was, that if I needed to start with a new pseudo-random number for each candidate, on my Linux computers, I’d actually break ‘/dev/urandom’ ! Therefore, the slightly better approach which I needed to take next was, to make my initial random number the pseudo-random one, and then just to keep incrementing it by 4, until the code plodded into a Germain prime.

Even when all this is incorporated into the solution I find that with Python, I need the added advantage of parallel computing. Thus, I next learned about the GIL – The Global Interpreter Lock – and the resulting pitfalls of multi-threaded Python, which is not concurrent. Multi-threading under Python tells the O/S to allocate CPU cores as usual, but then only allows 1 core to be running at any one time! But, even under those constraints, I found that I was benefiting from the fact that my early code was able to test at least 2 candidates for primality simultaneously, those being the actual candidate, as well as its Totient divided by 2. And as soon as either candidate was disproved prime, testing on the other could be stopped quickly. This reduced the necessary number of computations dramatically, to make up for the slowness of multi-threaded Python, and I felt that I was on the right track.

The only hurdle which remained was, how to convert my code into multi-processing code, no longer merely multi-threaded, while keeping the ability for two processes, then, to send each other a shutdown-command, as soon as the present process disproved its number to be prime.

(Updated 9/19/2020, 17h55 … )

Continue reading Generating a Germain Prime efficiently, using Python and an 8-core CPU.

How SDL Accelerates Video Output under Linux.

What we might know about the Linux, X-server, is that it offers pure X-protocol to render such features efficiently to the display, as Text with Fonts, Simple GUI-elements, and small Bitmaps such as Icons… But then, when it’s needed to send moving pictures to the display, we need extensions, which serious Linux-users take for granted. One such extension is the Shared-Memory extension.

Its premise is that the X-server shares a region of RAM with the client application, into which the client-application can draw pixels, which the X-server then transfers to Graphics Memory.

For moving pictures, this offers one way in which they can also be ~accelerated~, because that memory-region stays mapped, even when the client-application redraws it many times.

But this extension does not make significant use of the GPU, only of the CPU.

And so there exists something called SDL, which stands for Simple Direct Media Layer. And one valid question we may ask ourselves about this protocol, is how it achieves a speed improvement, if it’s only installed on Linux systems as a set of user-space libraries, not drivers.

(Updated 10/06/2017 : )

Continue reading How SDL Accelerates Video Output under Linux.

I have now installed ‘xine’ on my Linux tablet.

In this earlier posting, I had written that I’ve installed Linux on an older tablet of mine, that being my Samsung Galaxy Tab S, First Generation, with only 16GB of storage.

In order to do so, I used the (non-rooted) applications from Google Play, ‘GNURoot’ and ‘XSDL’.

One feature which the author of ‘XSDL’ pointed out, is the fact that we may download a shared library to run under Linux, which when preloaded, makes the shared-memory extension available, for the purpose of running one application. By default pure X-server protocol does not have this, even though any half-decent Linux system has shared memory extension, X-Video extension, and beyond that, ‘vdpau‘, to allow fast video playback.

One Linux application which I had been using this way, was ‘gnome-mplayer’ , for which I had also written a shell-script, that preloads the shared-memory library. The video-player application was launching and running fine, but I’m no longer convinced that it was ever benefiting from shared memory. More specifically, we can set in the preferences of the player application, to use ‘X11′ as its video output-mode, and ‘pulseaudio’ as its audio output-mode.

Literally, selecting X11 in this way, does not mean shared memory as the output-mode, although the player could have bee negotiating with the (fake) X-server over this parameter…

So. To make sure I’d be obtaining the full benefit of shared memory, when playing back video-streams more seriously, I next proceeded to install ‘xine-ui’. It is highly-configurable, in that we can choose shared memory video-output explicitly.

Continue reading I have now installed ‘xine’ on my Linux tablet.