My personal answer, to whether Hyper-Threading works under Linux.

I have been exploring this subject, through a series of experiments written in Python, and through what I learned when I was studying the subject of System Hardware, at Concordia University.

When a person uses a Windows computer, this O/S provides all the details of scheduling processes and threads. And arguably, it does well. But when a person is using Linux, the kernel makes all the required information available, but does not take care of optimizing how threads are scheduled, specifically. It becomes the responsibility of the application, or any other user-space program, to optimize how it will take up threads, using CPU affinity, or using low-level C functions that instruct the CPU to replace a single line in the L1 cache

In the special case when a person is writing scripts in Python, because this is an interpreted language, the program which is actually running, is the Python interpreter. How well the scheduling of threads works in that case, depends on how well this Python interpreter has been coded to do so. In addition, how well certain Python modules have been coded, has a strong effect on how efficiently they schedule threads. It just so happens that I’ve been lucky, in that the Python versions I get from the Debian repositories, happen to be programmed very well. By other people.

Dirk

 

Questioning whether the Linux kernel can take advantage of hyper-threading in a positive sense?

One of the features which CPUs were advertised as having several years ago, was “hyper-threading”. It can always happen to me that CPU cores have some feature which I’m not up-to-date on, but my own mind has a concept for this feature, which I find to be adequate:

Given a CPU with 8 cores, it can happen that an L1 cache is shared between every pair of cores, while the L2 and the L3 cache may be shared between all cores. Because of the way each L1 cache instance is optimized, this can lead to a performance penalty, if threads from different processes are assigned to the 2 cores belonging to one shared L1 cache. This can also be referred to as ‘false sharing’, since lines in the cache will be referenced repeatedly, that actually map to completely different regions of RAM. Every time that happens, a replacement operation needs to be done in the cache, to update the offending line with the most-recently-addressed memory region (called the corresponding “frame”), which happens more slowly than the full speed at which the CPU core-proper can read instructions and/or data, for which there are actually separate caches.

Meanwhile, if threads belonging to the same process are mapped to pairs of CPU cores that share an L1 cache, and if such a pair of threads needs to communicate data, there can be a boost to efficiency because each thread in such a pair is only communicating with cache, the lines of which do map to the same regions of memory. (:1)

I had my doubts as to whether the Linux kernel can increase the probability of this second scenario taking place successfully.

And the main reason for my doubt was, the mere observation that when the kernel re-schedules a single-threaded process, it has no preference for either even-numbered or odd-numbered cores to schedule it to. I can see this because I have a widget running on my desktops, which displays continuous graphs of hardware usage, from which I can infer information as I’m using my computers on a day-to-day basis.

When a programmer programs threads to run on CPU cores, he can make sure that his first thread only communicates with his second, that his third thread only communicates with his fourth, etc. But in that case, unless the kernel actually schedules the first thread of the program to run on an even-numbered logical core (starting from core zero), these pairs of threads which the programmer intended to communicate, will be communicating across the boundaries imposed by separate L1 cache instances. This will still succeed, but only at a performance penalty. (:2)

There was a Python programming exercise in which I felt I had overcome this problem, by assigning a number of threads exactly equal to the number of cores on each machine’s CPU. In that case, the kernel may schedule the threads to the cores in their natural order, so that physical pairing would be observed. But aside from trying this exercise, under Linux, hyper-threading mainly presented an avoidance issue to my mind.

Simultaneously, a modern CPU is plausible, which has 32 cores, but in which each L1 cache is actually shared between 4 cores, not 2. And so each programmer is left to his means, to optimize any threaded code.

Furthermore, I know of one CPU architecture in which the first 4 logical CPUs are mapped to the existing 4 real CPUs sequentially, after which the last 4 logical CPUs are mapped that way again.

Screenshot_20190520_132333

Under Linux, the user may type in the following command, in order actually to see how his logical cores are mapped, at least numerically:

 


egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo

 

As it happens, I was able to find direct evidence of a Python function which actually chooses which CPU cores the present process wishes to run on. And what this means is, that the kernel must also expose such a feature to the user-space application…

(Updated 5/22/2019, 7h00 … )

Continue reading Questioning whether the Linux kernel can take advantage of hyper-threading in a positive sense?