A disadvantage in running Linux, on a multi-core CPU that’s threaded.

One of the facts about modern computing is, that the hardware could include a multi-core CPU, with a number of virtual cores different from the number of full cores. Such CPUs were once called “Hyper-Threaded”, but are now only called “Threaded”.

If the CPU has 8 virtual cores, but is threaded as only 4 full cores, then there will only be a speed advantage, when running 4 processes. But because processes are sometimes multi-threaded, each of those 4 processes could consist of 2 fully-busy threads, and benefit from a further doubling of speed because each full core has 2 virtual cores.

It’s really a feature of Windows to exploit this fully, while Linux tends to ignore this. When Linux runs on such a CPU, it only ‘sees’ the maximum number of virtual cores, as the logical number of cores that the hardware has, without taking into account that they could be pairing in some way, to result in a lower number of full cores.

And to a certain extent, the Linux kernel is justified in doing so because unlike how it is with Windows, it’s actually just as cheap for a Linux computer to run a high number of separate processes, as it is to run processes with the same number of threads. Two threads share a code segment as well as a data segment (heap), but have two separate stack segments as well as different register-values. This makes them ‘enlightened processes’. Well they only really run faster under Windows (or maybe under OS/X).

Under Linux it’s fully feasible just to create many processes instead, so the bulk of the programming work does not make use as much of multi-threading. Of course Even under Linux, code is sometimes written to be multi-threaded, for reasons I won’t go into here.

But then under Linux, there was also never effort put into the kernel recognizing two of its logical cores, as belonging to the same full core.

(Updated 2/19/2019, 17h30 … )

Continue reading A disadvantage in running Linux, on a multi-core CPU that’s threaded.

One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

One of the subjects which many programmers have been studying, is not only, how to write highly parallel code, but how to write that code for the GPU, since the GPU is also the most-readily-available highly-parallel processor. In fact, any user with a powerful graphics card, may already have the basis to program using CUDA or using OpenCL.

I had written an earlier posting, in which I ended up trying to devise a way, by which the compiler of the synthesized C or C++, would detect that each variable is being used as ‘rvalues’ or ‘lvalues’ in different parts of a loop, and by which the compiler would then choose, to allocate a local register, allocate a shared register, or to make a local copy of a value once provided in a shared register.

According to what I think I’ve learned, this thinking was erroneous, simply because a CUDA or an OpenCL compiler, does not take this responsibility off the hands of the coder. In other words, the coder needs to declare explicitly and each time, whether a variable is to be allocated in a local or a shared register, and must also keep track of how his code can change the value in a shared register, from other threads than the current thread, which may produce errors in how the current thread computes.

But, a command which CUDA offers, and which needs to exist, is a ‘__syncthreads()’ function, which suspends the current thread, until all the threads running in one core-group have executed the ‘__sycnthreads()’ instruction, after which point, they may all resume again.

One fact which disappoints about the real ‘__syncthreads()’ instruction is, that it offers little in the way of added capabilities. One thing which I had written this function may do however, is actually give the CPU a chance to run briefly, in a way not obvious to CUDA code.

But then there exist capabilities which a CUDA or an OpenCL programmer might want, which have no direct support from the GPU, and one of those capabilities might be, to lock an arbitrary object, so that the current thread can perform some computation which reads the object – after having obtained a lock on it – and which then writes changes to the object, before giving up its lock on it.

(Updated 04/19/2018 : )

Continue reading One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.