One of the features which CPUs were advertised as having several years ago, was “hyper-threading”. It can always happen to me that CPU cores have some feature which I’m not up-to-date on, but my own mind has a concept for this feature, which I find to be adequate:
Given a CPU with 8 cores, it can happen that an L1 cache is shared between every pair of cores, while the L2 and the L3 cache may be shared between all cores. Because of the way each L1 cache instance is optimized, this can lead to a performance penalty, if threads from different processes are assigned to the 2 cores belonging to one shared L1 cache. This can also be referred to as ‘false sharing’, since lines in the cache will be referenced repeatedly, that actually map to completely different regions of RAM. Every time that happens, a replacement operation needs to be done in the cache, to update the offending line with the most-recently-addressed memory region (called the corresponding “frame”), which happens more slowly than the full speed at which the CPU core-proper can read instructions and/or data, for which there are actually separate caches.
Meanwhile, if threads belonging to the same process are mapped to pairs of CPU cores that share an L1 cache, and if such a pair of threads needs to communicate data, there can be a boost to efficiency because each thread in such a pair is only communicating with cache, the lines of which do map to the same regions of memory. (:1)
I had my doubts as to whether the Linux kernel can increase the probability of this second scenario taking place successfully.
And the main reason for my doubt was, the mere observation that when the kernel re-schedules a single-threaded process, it has no preference for either even-numbered or odd-numbered cores to schedule it to. I can see this because I have a widget running on my desktops, which displays continuous graphs of hardware usage, from which I can infer information as I’m using my computers on a day-to-day basis.
When a programmer programs threads to run on CPU cores, he can make sure that his first thread only communicates with his second, that his third thread only communicates with his fourth, etc. But in that case, unless the kernel actually schedules the first thread of the program to run on an even-numbered physical core (starting from core zero), these pairs of threads which the programmer intended to communicate, will be communicating across the boundaries imposed by separate L1 cache instances. This will still succeed, but only at a performance penalty.
There was a Python programming exercise in which I felt I had overcome this problem, by assigning a number of threads exactly equal to the number of cores on each machine’s CPU. In that case, the kernel may schedule the threads to the cores in their natural order, so that physical pairing would be observed. But aside from trying this exercise, under Linux, hyper-threading mainly presented an avoidance issue to my mind.
Simultaneously, a modern CPU is plausible, which has 32 cores, but in which each L1 cache is actually shared between 4 cores, not 2. And so each programmer is left to his means, to optimize any threaded code.
Furthermore, I know of one CPU architecture in which the first 4 logical CPUs are mapped to the existing 4 real CPUs sequentially, after which the last 4 logical CPUs are mapped that way again.
Under Linux, the user may type in the following command, in order actually to see how his logical cores are mapped, at least numerically:
egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo
As it happens, I was able to find direct evidence of a Python function which actually chooses which CPU cores the present process wishes to run on. And what this means is, that the kernel must also expose such a feature to the user-space application…
(Updated 5/18/2019, 10h55 … )