One of the features which CPUs were advertised as having several years ago, was “hyper-threading”. It can always happen to me that CPU cores have some feature which I’m not up-to-date on, but my own mind has a concept for this feature, which I find to be adequate:
Given a CPU with 8 cores, it can happen that an L1 cache is shared between every pair of cores, while the L2 and the L3 cache may be shared between all cores. Because of the way each L1 cache instance is optimized, this can lead to a performance penalty, if threads from different processes are assigned to the 2 cores belonging to one shared L1 cache. This can also be referred to as ‘false sharing’, since lines in the cache will be referenced repeatedly, that actually map to completely different regions of RAM. Every time that happens, a replacement operation needs to be done in the cache, to update the offending line with the most-recently-addressed memory region (called the corresponding “frame”), which happens more slowly than the full speed at which the CPU core-proper can read instructions and/or data, for which there are actually separate caches.
Meanwhile, if threads belonging to the same process are mapped to pairs of CPU cores that share an L1 cache, and if such a pair of threads needs to communicate data, there can be a boost to efficiency because each thread in such a pair is only communicating with cache, the lines of which do map to the same regions of memory. (:1)
I had my doubts as to whether the Linux kernel can increase the probability of this second scenario taking place successfully.
And the main reason for my doubt was, the mere observation that when the kernel re-schedules a single-threaded process, it has no preference for either even-numbered or odd-numbered cores to schedule it to. I can see this because I have a widget running on my desktops, which displays continuous graphs of hardware usage, from which I can infer information as I’m using my computers on a day-to-day basis.
When a programmer programs threads to run on CPU cores, he can make sure that his first thread only communicates with his second, that his third thread only communicates with his fourth, etc. But in that case, unless the kernel actually schedules the first thread of the program to run on an even-numbered logical core (starting from core zero), these pairs of threads which the programmer intended to communicate, will be communicating across the boundaries imposed by separate L1 cache instances. This will still succeed, but only at a performance penalty. (:2)
There was a Python programming exercise in which I felt I had overcome this problem, by assigning a number of threads exactly equal to the number of cores on each machine’s CPU. In that case, the kernel may schedule the threads to the cores in their natural order, so that physical pairing would be observed. But aside from trying this exercise, under Linux, hyper-threading mainly presented an avoidance issue to my mind.
Simultaneously, a modern CPU is plausible, which has 32 cores, but in which each L1 cache is actually shared between 4 cores, not 2. And so each programmer is left to his means, to optimize any threaded code.
Furthermore, I know of one CPU architecture in which the first 4 logical CPUs are mapped to the existing 4 real CPUs sequentially, after which the last 4 logical CPUs are mapped that way again.
Under Linux, the user may type in the following command, in order actually to see how his logical cores are mapped, at least numerically:
egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo
As it happens, I was able to find direct evidence of a Python function which actually chooses which CPU cores the present process wishes to run on. And what this means is, that the kernel must also expose such a feature to the user-space application…
(Updated 5/22/2019, 7h00 … )
(As of 5/16/2019 : )
The Python 3.3 functions which accomplish this, are named:
What this means is that, if the application programmer is aware for some reason that the CPU cores share cache in any specific configuration, he can choose which of those cores his application should run on. All the threads of the process will be restricted to the given set of core-numbers. And, if the CPU-core affinity of the program is set to two paired cores, the Linux O/S will take care of not running an excess of threads on the same core, thereby simplifying any approach which the programmer would need to use, to get his hyper-threaded code to run on exactly 2 cores.
(Update 5/17/2019, 16h15 : )
OTOH, If the Linux programmer would like to micro-manage how each of the threads is to run on given core numbers, then Python may not be the best platform to do so. In that case, programming in C / C++ and using ‘pthreads’ may be the way to go, and the following article explains how to proceed:
(Update 5/18/2019, 10h55 : )
There is every possibility that other programmers, who write threaded code, might not even dream of taking advantage of the shared L1 data-cache, but who only want to double the number of threads which can run on the CPU. In my earlier Python exercise, this may also be the only thing I was accomplishing.
(Update 5/22/2019, 7h00 : )
According to my recent findings, if a C or C++ program is to take advantage of hyper-threading, then one conditional measure would be, to stop flushing data-objects from cache, when they are to be synced between threads. Flushing them would actually slow down the process, when hyper-threading. But additionally, their ‘cpu_id’ would need to be computed as follows:
if hyper_threading: x = work_unit p = physical_cores l = logical_cores // p cpu_id = ((x % l) * p) + (x // l)