According to an earlier posting, I had applied the (probabilistic) Miller-Rabin test, after testing whether large prime number candidates are divisible by any in a list of smaller, definite prime numbers, to generate pseudo-random numbers first, and then to continue until one was found to be prime.
That earlier effort had numerous issues, including the fact that it did not generate the kind of strong prime numbers needed for Diffie-Hellman key exchange. I have broadened my horizons slightly, and written more-up-to-date Python 3, which will generate such strong primes, as well as computing the resulting, Primitive Root, which would be used as the generator (g), if the results were ever actually used in cryptography.
I’d like to thank my friend, François Dionne, for giving me help with this project.
The resulting Python script can be downloaded from this URL:
(Updated 5/23/2019, 18h40 : )
(As of 5/10/2019 : )
One observation which I should add, is that the ‘random.SystemRandom.randint()’ function is similar in use to ‘…randrange’, but instead of using the Mersennes Twister Pseudo-Random Number Generator, uses the system-provided source, if one is available on the given platform. And under Linux, this is equivalent to using the file ‘/dev/urandom’. This is a device-file that generates a potentially endless series of Pseudo-Random numbers, based on the operating system’s entropy pool as seed information. What this means is that as a source of random numbers, this function is still quite flawed, especially when used to fetch a very long series of them.
It’s non-blocking, which means that it won’t limit the length of the output read. But, after a very long stream has been read from this source, some amount of repetition and/or redundancy will creep in. For cryptographic purposes, the Linux device file ‘/dev/random’ is preferable. But, it has the disadvantage that, once the entropy pool is depleted, it will force the program to wait for further output. This can happen after a few kilobits of data have been read. Yet, this is a more-random source, of bits. Therefore, if the user’s Python installation offers the module named ‘secrets’ the following lines:
import sys from random import randrange, SystemRandom # from secrets import SystemRandom
May be changed to:
import sys from random import randrange from secrets import SystemRandom
This module is only an included part of Python 3.6 and up. However, under Debian, backported versions of the ‘secrets’ module can be installed by giving the commands:
# pip install python2-secrets # pip3 install python2-secrets
In order for this to work, the packages need to be installed first, named ‘python-pip’ and ‘python3-pip’ respectively.
The Python function ‘secrets.SystemRandom.randint()’ will generate the highest-quality random numbers that the O/S can provide, but I suspect that it, like the Linux device-file ‘/dev/random’, will block after a certain number of bits.
(Update 5/11/2019, 20h25 : )
An added feature of this latest code is its use of Python ‘ThreadPools’, in order to test candidates for primality on two CPU cores simultaneously, while using a third core minimally, in order to precompute the next prime-number candidate to be tested.
The concept of strong prime numbers is such, that the prime number (n), as well as the Totient divided by two, ((n – 1) / 2), should both be prime. This rarefies the prime numbers that are finally to be accepted, but also seems to do two things:
- Allow for the Totient to be prime-factorized, which is necessary to compute the Primitive Root,
- Make the modulus stronger, against any possible Discrete Logarithm attack.
I took advantage of this situation in order to write code that tests both the random number (m), and ((m – 1) / 2) for whether they are prime, in two separate threads. As soon as either thread disproves its value to be prime, it also sends a message, that stops the other thread in the pair, so that a new candidate (m) can start to be tested.
(Update 5/15/2019, 21h00 : )
What the previous link – which I did update – used as its main strategy, was ‘Inter-Thread Communication’, so that if either of the relevant pair of candidates was disproved to be prime, the testing of the other candidate would also be discontinued. And my reasoning was, that to disprove most non-prime candidates would only take a few microseconds, while the testing on prime candidates continues for more than a millisecond, typically. Therefore, since both candidates in such a pair would need to be prime, in order for the first to provide an answer to the problem, and, since all the Threads in the ThreadPool need to complete, before a new set of Threads is created, it seemed an optimal strategy.
But there was a flaw in this approach, which only became apparent to me yesterday. The percentage with which each CPU core was used decreased with the total number of threads created. And the reason for this seemed to be the fact that, notwithstanding, the amount of time needed to disprove a candidate prime is finally random, and random in an open-ended sense. Therefore, the more simultaneous threads needed to complete, the longer the process would take, and the higher the percentage of threads within that ThreadPool would become, which had already completed, before the entire ThreadPool had. And so I needed to rethink my paradigm.
What I seemed to discover was, that if I switched to a multiprocessing framework, using Python 3 Pools rather than ThreadPools, yes, the possibility of Inter-Thread Communication was gone. So the computation becomes less efficient off the bat. But then, if the user has an 8-core CPU, and if only half that number of processes is run, then sharing of the L1 cache between processes can be kept to a minimum – i.e., only Real Cores are being used instead of Virtual, Hyper-Threaded cores – and then the sheer number of parallel processes can overcome whatever inefficiency was otherwise inherent in the approach.
Under Linux, ‘user time’ exceeds ‘real time’ for the first time in this project, and almost doubly so. This is because ‘user time’ adds up time spent on all the CPU cores, while ‘real time’ does not. And so more computations can be carried out!
And the Python 3.4 code which implements this is to be found using the following link:
(Update 5/16/2019, 0h40 : )
One observation I had about the multiprocessing version of this algorithm was, that it was a tad slow in computing 1024-bit Germain primes, even though at its default of 512 bits, it seemed to perform adequately. And so to correct this weakness, I’ve optimized it a little, to deal with such, more-realistic scenarios.
AFAICT This slowness was due to the fact that a main process needed to serialize a list of known, definite primes, each time it sent a candidate to a worker-process, for the worker-process to disprove. That list of definite primes went up to 65535 (exclusively of that endpoint), thereby containing 5909 entries theoretically. When working with shared memory one doesn’t think twice about such things, but Inter-Process pipelines being what they are, causes the program to become one main process which has a CPU core running at 100%, while the other 4 cores are mostly idle, just reacting to its output.
What I did was to make that list smaller, down to a maximum value of 16384, therefore containing 1688 entries theoretically. Further, I no longer valued the algorithm finding the lowest real GM-prime, after the first pseudo-random candidate. Reversing the list of resulting, successive candidates wasted some more processing power, belonging to the main process, so I got rid of that.
I find that the program works well enough now, when instructed by the user to generate 1024-bit Germain primes. However, the actual time it will require has a strong random component to it, for which reason I have not benchmarked it.
(Update 5/16/2019, 13h50 : )
What I have found today was, that a similar optimization – to reduce the size of the list of known primes – also improves the speed with which the multi-threaded version runs, but for a different reason.
Apparently, each prime number in the list of known primes takes up 32 bytes of memory on a 64-bit machine. It takes 1 chunk to store, plus 1 chunk of header information, plus 2 more chunks because it’s an entry in a linked list. Well, when I had known primes up to 65535, the amount of memory being shared by two threads lapped the size of the L1 cache. Usually, an L1 cache only has about
128 KB of memory, divided into even smaller ‘lines’. Now that I have adjusted the list of known prime numbers, up to a value of 16384, that will only take up about 54 KB in total, which means that for the most part, threads sharing one L1 cache instance, can keep looking up entries within the same cache, without requiring (slower) replacement operations.
(Update 5/18/2019, 19h20 : )
Some readers may find it strange of me to write, that an object in memory ‘laps the cache’. The way the lines of the cache are computed, is simply as a modulus of the memory address. But because the strategy is ‘set-associative mapping’, each line in the cache additionally stores an ‘exact candidacy tag’, that states the most-significant bits of the real frame in memory, which that line is currently caching. If that tag does not match the most-significant bits the CPU core is asking for, then a replacement operation is triggered.
If there is a data-object in memory which is larger than a given cache, and if the first thing each thread does is to traverse the entire object, which my ‘is_prime()’ function does, because that function first divides its first parameter by every entry in that object, doing so will also traverse the L1 cache more than nonce, with certainty, at the beginning of the execution of the thread. This will force a certain number of replacement operations to take place implicitly.
And just to recap how my ‘is_prime()’ function works… It can happen that the first parameter – the candidate – is divisible by an entry in the list of known primes, long before the end of that list has been reached, in which case the candidate was easily proven not-prime. But if the function reaches the end of this list of known primes, then ‘the real work’ begins, of proving whether a 512-bit number… was actually prime. That real work is the Miller-Rabin Test, which involves long exponentiation within a modulus, repeated 192 times.
Divisibility by an entry in the list of known primes is only meant to speed up the process of disproving a number to be prime, in cases where it’s feasible to do so. I.e., having a list of known primes, up to a value of 2256, would not be feasible.
Actually, the size of the caches can be found under Linux, using the command:
Which can be run as user. But There’s an added catch, when one CPU core merely writes data to a line of the L1 cache. As long as the requested address matches the exact candidacy tag, writing data will not force a replacement operation. It will merely cause the line in the cache to be marked ‘dirty’, at which point it might not be replaced at all. It’s only once the line of the cache does get replaced that, before a new frame is read in, if the line was marked ‘dirty’, the preceding contents of the same line are written out.
What this means is that some programmers may create a piece of code, the entire purpose of which is, to address a region of memory the size of which equals or exceeds that of the L1 cache, just to force a flush. But I don’t really think that for me to do so explicitly will make this Python example more efficient.
What I can do is to write a variant, of the multi-threaded version of this program, that uses an official ‘threading.Event’ object, instead of just a flag, inside a shared list. The result which I obtain is approximately the same performance, give or take the random component present in every execution time:
As a matter of fact, the sameness of the result does surprise me. What this suggests is that, when the Python interpreter makes the work-unit cue a shared region of memory, the interpreter has already done whatever it needs to do, to optimize scheduling efficiency.
I would hazard a guess that if the Python-programmer wants to share events between more than 2 threads at a time, that have always been fed into the queue in a fixed sequence, the official Event object will perform better, because it may have been coded in low-level C functions, to order a replacement of one line of L1 cache, every time one of the threads sets or clears the object’s only Boolean value.
In addition, the amount of randomness in the execution time of the variant of this program, that uses the official Event object, may be slightly less, meaning better, just because the naked variant may not always end up being scheduled correctly, and may leave that up to chance more.
(Update 5/19/2019, 11h55 : )
There is an additional detail which I’ve improved, in the way the example ‘Generate_Prime_DM_ITC.py’ increments its prime-number candidate, to create a list of them, one of which will eventually be prime. On line 288 I changed the code to:
length_v = 1 << length
While on lines 327-329 I changed the code to:
p += 4 if p >= length_v: p -= length_v >> 1
This should not really create a dramatic improvement in speed, as the CPU cores are fast enough, to compute the earlier method quickly enough, to keep up with the worker threads testing the candidates for being prime. That earlier method was slightly less-efficient, in that it required computing a bit-wise OR of the entire 512-bit / 1024-bit / etc. candidate with a mask, as well as computing a modulus / remainder function for each of them.
But remarkably I found that making this code ‘better’, did in fact improve the speed of the multi-threaded example. A possible reason could be that again, the earlier, more cumbersome method ‘lapped’ the L1 cache on one of the cores, so that one entire pair of logical cores was slowed down, which can also slow down the amount of time it takes for the ‘async_result.get()’ call to wait for all the threads to complete.
After all, the function ‘is_prime()’ will first compute a modulus function for every entry in a list of known primes, in order to attempt disproving one candidate prime. Yet, that was an example in which I was deliberately running an odd number of threads concurrently, which mans that one of the worker threads could easily have become lined up with the asymmetrical parent thread.
I have made the corresponding correction to ‘Generate_Prime_DMFD_ITC.py’ as well, just so that if my readers ever do test either code example, they will also have a systematic basis to compare their efficiency.
(Update 5/17/2019, 17h50 : )
The most recent versions of this script make use of the Python module ‘psutil’. It may need to be installed.
(Update 5/23/2019, 18h40 : )
I have finally discovered why it is, that the core usage remains inversely proportional to the number of cores used. Apparently, Python thread-pools have as a main drawback that, even though they’re scheduled by the kernel to run on different cores, at any one point in time, only one of their threads is allowed to be running. Therefore, there is nothing else for me to do, to optimize the multi-threaded versions of my program.