A common-sense observation about CPU caching policies.

One of the subjects which I did write about recently, concerned CPU caching, and special considerations for multi-threaded processes, in which more than one thread, running on different cores, need to share information. What I had written was that, if a CPU core merely writes data to its L1 cache, then the corresponding line in the cache is merely marked ‘dirty’ – not, flushed to RAM. And a common-sense question which my readers could have about that would be ‘Why such a policy? Why Not flush that line of the cache immediately?’ And I’d give a 2-part answer to that question:

  1. When a CPU core writes data to a line of cache – aka, to a range of memory addresses, as the program sees things – it will typically do so numerous times successively. The cache will only speed up the operation of the CPU, if the replacement operations are considerably less frequent, than the number of times the core reads or writes data.
  2. Most of the time, the CPU core that wrote the data will also be the core which needs to read that data back ‘from memory’. The only real exception takes place in a multi-threaded program.

What I did go on to write however was that, once a replacement operation is performed on a dirty line of the cache, or, once the core explicitly asks to flush that line of the cache, its data is written out, towards RAM. Then, specifically in the second scenario, the operation presumably propagates from L1 cache to L2 cache, if there is any, and then to L3 cache as well… There is an added observation to make about such propagation. If there is more than one cache, this also needs to be bidirectional to some extent. The reason for this would be the infrequently-stated intent, that to flush a line of cache, should make any changes written to its data visible to the other cores of the same CPU.

What this should also mean is that when there are other L1 caches’ lines that correspond to L2 cache lines being written to, or other L2 caches’ lines that correspond to L3 cache lines being written to, write-back needs to take place, so that ultimately, an L1 cache serving a different core will not contain any erroneous data. And the easiest way to accomplish that might be, just to make sure that the lines of cache affected, are left cleared, so that once their respective cores try to read the same memory addresses again, the updated data can be replaced into them.

Effectively, if the caching policy is inclusive, and if there are two separate L1 caches with dirty lines corresponding to one L2 cache line being written to, one of those L1 caches’ lines is orphaned… That line of cache may best just be cleared, too bad.

An entirely separate question could exist, of whether an L3 cache line then also needs to be cleared, when all the cores of a CPU map through it. It could simply be marked ‘dirty’ in turn.

But such realizations also make the real-world design of CPU cache, a nightmare which only the highest-ranking Electronics Engineers can tackle.

Continue reading A common-sense observation about CPU caching policies.

Print Friendly, PDF & Email

My personal answer, to whether Hyper-Threading works under Linux.

I have been exploring this subject, through a series of experiments written in Python, and through what I learned when I was studying the subject of System Hardware, at Concordia University.

When a person uses a Windows computer, this O/S provides all the details of scheduling processes and threads. And arguably, it does well. But when a person is using Linux, the kernel makes all the required information available, but does not take care of optimizing how threads are scheduled, specifically. It becomes the responsibility of the application, or any other user-space program, to optimize how it will take up threads, using CPU affinity, or using low-level C functions that instruct the CPU to replace a single line in the L1 cache

In the special case when a person is writing scripts in Python, because this is an interpreted language, the program which is actually running, is the Python interpreter. How well the scheduling of threads works in that case, depends on how well this Python interpreter has been coded to do so. In addition, how well certain Python modules have been coded, has a strong effect on how efficiently they schedule threads. It just so happens that I’ve been lucky, in that the Python versions I get from the Debian repositories, happen to be programmed very well. By other people.

Dirk

 

Print Friendly, PDF & Email

Questioning whether the Linux kernel can take advantage of hyper-threading in a positive sense?

One of the features which CPUs were advertised as having several years ago, was “hyper-threading”. It can always happen to me that CPU cores have some feature which I’m not up-to-date on, but my own mind has a concept for this feature, which I find to be adequate:

Given a CPU with 8 cores, it can happen that an L1 cache is shared between every pair of cores, while the L2 and the L3 cache may be shared between all cores. Because of the way each L1 cache instance is optimized, this can lead to a performance penalty, if threads from different processes are assigned to the 2 cores belonging to one shared L1 cache. This can also be referred to as ‘false sharing’, since lines in the cache will be referenced repeatedly, that actually map to completely different regions of RAM. Every time that happens, a replacement operation needs to be done in the cache, to update the offending line with the most-recently-addressed memory region (called the corresponding “frame”), which happens more slowly than the full speed at which the CPU core-proper can read instructions and/or data, for which there are actually separate caches.

Meanwhile, if threads belonging to the same process are mapped to pairs of CPU cores that share an L1 cache, and if such a pair of threads needs to communicate data, there can be a boost to efficiency because each thread in such a pair is only communicating with cache, the lines of which do map to the same regions of memory. (:1)

I had my doubts as to whether the Linux kernel can increase the probability of this second scenario taking place successfully.

And the main reason for my doubt was, the mere observation that when the kernel re-schedules a single-threaded process, it has no preference for either even-numbered or odd-numbered cores to schedule it to. I can see this because I have a widget running on my desktops, which displays continuous graphs of hardware usage, from which I can infer information as I’m using my computers on a day-to-day basis.

When a programmer programs threads to run on CPU cores, he can make sure that his first thread only communicates with his second, that his third thread only communicates with his fourth, etc. But in that case, unless the kernel actually schedules the first thread of the program to run on an even-numbered physical core (starting from core zero), these pairs of threads which the programmer intended to communicate, will be communicating across the boundaries imposed by separate L1 cache instances. This will still succeed, but only at a performance penalty.

There was a Python programming exercise in which I felt I had overcome this problem, by assigning a number of threads exactly equal to the number of cores on each machine’s CPU. In that case, the kernel may schedule the threads to the cores in their natural order, so that physical pairing would be observed. But aside from trying this exercise, under Linux, hyper-threading mainly presented an avoidance issue to my mind.

Simultaneously, a modern CPU is plausible, which has 32 cores, but in which each L1 cache is actually shared between 4 cores, not 2. And so each programmer is left to his means, to optimize any threaded code.

Furthermore, I know of one CPU architecture in which the first 4 logical CPUs are mapped to the existing 4 real CPUs sequentially, after which the last 4 logical CPUs are mapped that way again.

Screenshot_20190520_132333

Under Linux, the user may type in the following command, in order actually to see how his logical cores are mapped, at least numerically:

 


egrep "(( id|processo).*:|^ *$)" /proc/cpuinfo

 

As it happens, I was able to find direct evidence of a Python function which actually chooses which CPU cores the present process wishes to run on. And what this means is, that the kernel must also expose such a feature to the user-space application…

(Updated 5/18/2019, 10h55 … )

Continue reading Questioning whether the Linux kernel can take advantage of hyper-threading in a positive sense?

Print Friendly, PDF & Email

Latest Debian Security Update Breaks Jessie (Resolved).

In addition to my Debian / Stretch computer, I still operate two Debian / Jessie computers. Those latter two computers were subscribed to the Debian Security repository, as well as to the standard Debian / Jessie repository. Unfortunately, the package manager on one of my Debian / Jessie computers had made me aware of a conflict which existed, due to an update which Debian Security is pushing, to a package and its related packages, all belonging to:

liqt4-dev

The version which Debian Security is trying to install is:

4:4.8.6+git64-g5dc8b2b+dfsg-3+deb8u2

But, the version which the rest of Debian / Jessie was using, was:

4:4.8.6+git64-g5dc8b2b+dfsg-3+deb8u1

The problem was the fact that, if I told my package manager to go ahead with its suggested updates, doing so would have forced me to reject a long, long list of packages essential to my system, including many KDE-4-related packages. Now, I can just ignore that this problem exists, and rely on my package manager again not installing packages, that would break my system, on a daily basis. But this would turn into a very unsafe practice in the long run. And so, the only safe course of action for me currently seemed to be, to unsubscribe from Debian / Security instead.

(Update 17h55 : )

I have resubscribed to the Debian Security repository in question, and re-attempted the update, to find that this time, it worked. I can think of 2 possible reasons why it might not have worked the first time:

  1. My unattended-upgrades script is configured to break up an update into smaller pieces, and because this update involves a large number (over 20) of Qt 4 packages, this in itself could have broken the ability to perform the update, or
  2. Debian Security may not have put all the involved updates ‘out there’ on its servers, to be downloadable in one shot, even though every Qt 4 package needs to be updated, in order for any of the updates to succeed. But, only hours later, all the required packages may have become available (on the servers).

I rather think that it was due to reason (2) and not reason (1) above.

Dirk

 

Print Friendly, PDF & Email