## One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

One of the subjects which many programmers have been studying, is not only, how to write highly parallel code, but how to write that code for the GPU, since the GPU is also the most-readily-available highly-parallel processor. In fact, any user with a powerful graphics card, may already have the basis to program using CUDA or using OpenCL.

I had written an earlier posting, in which I ended up trying to devise a way, by which the compiler of the synthesized C or C++, would detect that each variable is being used as ‘rvalues’ or ‘lvalues’ in different parts of a loop, and by which the compiler would then choose, to allocate a local register, allocate a shared register, or to make a local copy of a value once provided in a shared register.

According to what I think I’ve learned, this thinking was erroneous, simply because a CUDA or an OpenCL compiler, does not take this responsibility off the hands of the coder. In other words, the coder needs to declare explicitly and each time, whether a variable is to be allocated in a local or a shared register, and must also keep track of how his code can change the value in a shared register, from other threads than the current thread, which may produce errors in how the current thread computes.

But, a command which CUDA offers, and which needs to exist, is a ‘__syncthreads()’ function, which suspends the current thread, until all the threads running in one core-group have executed the ‘__sycnthreads()’ instruction, after which point, they may all resume again.

One fact which disappoints about the real ‘__syncthreads()’ instruction is, that it offers little in the way of added capabilities. One thing which I had written this function may do however, is actually give the CPU a chance to run briefly, in a way not obvious to CUDA code.

But then there exist capabilities which a CUDA or an OpenCL programmer might want, which have no direct support from the GPU, and one of those capabilities might be, to lock an arbitrary object, so that the current thread can perform some computation which reads the object – after having obtained a lock on it – and which then writes changes to the object, before giving up its lock on it.

(Updated 04/19/2018 : )

## Finding Out, How Many GPU Cores we have, Under Linux

One question which I see written about often on the Web, is how to find out certain stats about our GPU, under Linux. Under Windows, we had GUI-based programs such as ‘GPU-Z’, etc., but under Linux, the information can be just a bit harder to find.

I think that one tool which helps, is to have ‘OpenCL’ installed, as well as the command-line utility ‘clinfo’, which exists as one out of several packages, and as an actual, resulting command-name.

If we’re serious about programming our GPU, then having a GUI won’t help us much. We’d need to get dirty with code in that case, and then to have text-based solutions is suitable. But, if we’re just spectators in this sport, then two stats we may nevertheless want to know are:

1. How many GPU-Core-Groups do we have – since GPU-Cores are organized as Groups, and
2. How many actual Shader-Cores do we have in each Group?

Interestingly, the grouping of shader-cores, also represents how many vector-processors such GPU-computing tools as OpenCL see. And so, on the computer which I name ‘Klystron’, which is running Debian / Jessie, when typing in these commands as user, I get the following results:


dirk@Klystron:~$clinfo | grep units Max compute units: 4 Max compute units: 6 dirk@Klystron:~$ clinfo | grep multiple
Kernel Preferred work group size multiple:     1
Kernel Preferred work group size multiple:     64
dirk@Klystron:~\$



This needs some explaining. On ‘Klystron’, I have the proprietary, AMD packages for OpenCL installed, since that computer has both an AMD CPU and a Radeon GPU. And this means that the OpenCL version will be able to carry out computing on both. And so I have the stats for both.

In this case, the second entries reveal that I have 6×64 cores on the GPU.

Please note that this posting does not describe

• Android GPUs, or
• Graphics Chips on PCs and Laptops, which use shared memory.

I am writing about the big graphics cards which power-users and gamers install into their PCs, which have a special bus-slot, and which cost as much money in themselves, as some computers cost.

The way those are organized physically, they possess one or more GPU, and DDR Graphics RAM, which loosely correspond to the CPU and RAM on the motherboard of your PC.

The GPU itself contains registers, which are essentially of two types:

• Per-core, and
• Shared

When coding shaders for 3D games, the GPU-registers do not fulfill the same function, as addresses in GRAM. The addresses in Graphics RAM typically store texture images, vertex arrays in their various formats, and index buffers, as well as frame-buffers for the output. In other words, the GRAM typically stores model-geometry and 2D or 3D images. The registers on the GPU are typically used as temporary storage-locations, for the work of shaders, which are again, separately loaded onto the GPUs, after they are compiled by the device-drivers.

A major feature which the designers of graphics cards have given them, is to extend the system memory of the PC onto the graphics card, in such a way that most of its memory actually has hardware-addresses as well.

This might not include the GPU-registers that are specific to one core, but I think does include shared GPU-registers.

## “Hardware Acceleration” is a bit of a Misnomer.

The term gets mentioned quite frequently, that certain applications offer to give the user services, with “Hardware Acceleration”. This terminology can in fact be misleading – in a way that has no consequences – because computations that are hardware-accelerated, are still being executed according to software that has either been compiled or assembled into micro-instructions. Only, those micro-instructions are not to be executed on the main CPU of the machine.

Instead, those micro-instructions are to be executed either on the GPU, or on some other coprocessor, which provide the accelerating hardware.

Often, the compiling of code meant to run on a GPU – even though the same, in theory as regular software – has its own special considerations. For example, this code often consists of only a few micro-instructions, over which great care must be taken to make sure that they run correctly on as many GPUs as possible. I.e., when we are coding a shader, KISS is often a main paradigm. And the possibility crops up often in practice, that even though the code is technically correct, it does not run correctly on a given GPU.

I do not really know how it is with SIMD coprocessors.

But this knowledge would be useful to have, in order to understand this posting of mine.

Of course, there exists a major contradiction to what I just wrote, in OpenCL and CUDA.