Generating a Germain Prime efficiently, using Python and an 8-core CPU.

I have spent extensive time, as well as previous blog postings, exploring the subject of how to generate a Germain prime (number), in the Python (3.5) programming language, but taking advantage of parallel computing to some slight degree.

Before I go any further with this subject, I’d like to point out that generally, for production-ready applications, Python is not the best language to use. The main reason is the fact that Python is an interpreted language, even though many modern interpreted languages are compiled into bytecode before being interpreted. This makes a Python script slower by nature, than very well-written C or even C++. But what I aim to do is to use Lego-Blocks to explore this exercise, yet, to use tools which would be used professionally.

The main way I am generating prime numbers is, to start with a pseudo-random, 512-bit number (just as a default, the user can specify different bit-lengths), and then to attempt to divide this number by a list of known, definite primes, that by now only extend to 4096 (exclusively, of course), in an attempt to disprove that number prime. In about 1/8 of all cases, the number survives this test, after which I undertake a more robust, Miller-Rabin approach to try disproving it prime 192 times, probabilistically. If the number has survived these tests, my program assumes it to be prime.

But there is a sub-category of prime numbers which have recently been of greater interest to me, which is known as the Germain prime numbers, such that the Totient of that main candidate, divided by two, should also be prime. And so, if the density of prime numbers roughly equal to (n) is (1 / log(n)), and if we can assume a uniform distribution of random numbers, then the probability of finding a Germain prime is roughly (1 / (log (n))2), assuming that our code was inefficient enough actually to test all numbers. The efficiency can be boosted by making sure that the random number equals 3 in the modulus of 4.

But one early difficulty I had with this project was, that if I needed to start with a new pseudo-random number for each candidate, on my Linux computers, I’d actually break ‘/dev/urandom’ ! Therefore, the slightly better approach which I needed to take next was, to make my initial random number the pseudo-random one, and then just to keep incrementing it by 4, until the code plodded into a Germain prime.

Even when all this is incorporated into the solution I find that with Python, I need the added advantage of parallel computing. Thus, I next learned about the GIL – The Global Interpreter Lock – and the resulting pitfalls of multi-threaded Python, which is not concurrent. Multi-threading under Python tells the O/S to allocate CPU cores as usual, but then only allows 1 core to be running at any one time! But, even under those constraints, I found that I was benefiting from the fact that my early code was able to test at least 2 candidates for primality simultaneously, those being the actual candidate, as well as its Totient divided by 2. And as soon as either candidate was disproved prime, testing on the other could be stopped quickly. This reduced the necessary number of computations dramatically, to make up for the slowness of multi-threaded Python, and I felt that I was on the right track.

The only hurdle which remained was, how to convert my code into multi-processing code, no longer merely multi-threaded, while keeping the ability for two processes, then, to send each other a shutdown-command, as soon as the present process disproved its number to be prime.

Continue reading Generating a Germain Prime efficiently, using Python and an 8-core CPU.

A disadvantage in running Linux, on a multi-core CPU that’s threaded.

One of the facts about modern computing is, that the hardware could include a multi-core CPU, with a number of virtual cores different from the number of full cores. Such CPUs were once called “Hyper-Threaded”, but are now only called “Threaded”.

If the CPU has 8 virtual cores, but is threaded as only 4 full cores, then there will only be a speed advantage, when running 4 processes. But because processes are sometimes multi-threaded, each of those 4 processes could consist of 2 fully-busy threads, and benefit from a further doubling of speed because each full core has 2 virtual cores.

It’s really a feature of Windows to exploit this fully, while Linux tends to ignore this. When Linux runs on such a CPU, it only ‘sees’ the maximum number of virtual cores, as the logical number of cores that the hardware has, without taking into account that they could be pairing in some way, to result in a lower number of full cores.

And to a certain extent, the Linux kernel is justified in doing so because unlike how it is with Windows, it’s actually just as cheap for a Linux computer to run a high number of separate processes, as it is to run processes with the same number of threads. Two threads share a code segment as well as a data segment (heap), but have two separate stack segments as well as different register-values. This makes them ‘enlightened processes’. Well they only really run faster under Windows (or maybe under OS/X).

Under Linux it’s fully feasible just to create many processes instead, so the bulk of the programming work does not make use as much of multi-threading. Of course Even under Linux, code is sometimes written to be multi-threaded, for reasons I won’t go into here.

But then under Linux, there was also never effort put into the kernel recognizing two of its logical cores, as belonging to the same full core.

(Updated 2/19/2019, 17h30 … )

Continue reading A disadvantage in running Linux, on a multi-core CPU that’s threaded.

One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

One of the subjects which many programmers have been studying, is not only, how to write highly parallel code, but how to write that code for the GPU, since the GPU is also the most-readily-available highly-parallel processor. In fact, any user with a powerful graphics card, may already have the basis to program using CUDA or using OpenCL.

I had written an earlier posting, in which I ended up trying to devise a way, by which the compiler of the synthesized C or C++, would detect that each variable is being used as ‘rvalues’ or ‘lvalues’ in different parts of a loop, and by which the compiler would then choose, to allocate a local register, allocate a shared register, or to make a local copy of a value once provided in a shared register.

According to what I think I’ve learned, this thinking was erroneous, simply because a CUDA or an OpenCL compiler, does not take this responsibility off the hands of the coder. In other words, the coder needs to declare explicitly and each time, whether a variable is to be allocated in a local or a shared register, and must also keep track of how his code can change the value in a shared register, from other threads than the current thread, which may produce errors in how the current thread computes.

But, a command which CUDA offers, and which needs to exist, is a ‘__syncthreads()’ function, which suspends the current thread, until all the threads running in one core-group have executed the ‘__sycnthreads()’ instruction, after which point, they may all resume again.

One fact which disappoints about the real ‘__syncthreads()’ instruction is, that it offers little in the way of added capabilities. One thing which I had written this function may do however, is actually give the CPU a chance to run briefly, in a way not obvious to CUDA code.

But then there exist capabilities which a CUDA or an OpenCL programmer might want, which have no direct support from the GPU, and one of those capabilities might be, to lock an arbitrary object, so that the current thread can perform some computation which reads the object – after having obtained a lock on it – and which then writes changes to the object, before giving up its lock on it.

(Updated 04/19/2018 : )

Continue reading One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.