One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

One of the subjects which many programmers have been studying, is not only, how to write highly parallel code, but how to write that code for the GPU, since the GPU is also the most-readily-available highly-parallel processor. In fact, any user with a powerful graphics card, may already have the basis to program using CUDA or using OpenCL.

I had written an earlier posting, in which I ended up trying to devise a way, by which the compiler of the synthesized C or C++, would detect that each variable is being used as ‘rvalues’ or ‘lvalues’ in different parts of a loop, and by which the compiler would then choose, to allocate a local register, allocate a shared register, or to make a local copy of a value once provided in a shared register.

According to what I think I’ve learned, this thinking was erroneous, simply because a CUDA or an OpenCL compiler, does not take this responsibility off the hands of the coder. In other words, the coder needs to declare explicitly and each time, whether a variable is to be allocated in a local or a shared register, and must also keep track of how his code can change the value in a shared register, from other threads than the current thread, which may produce errors in how the current thread computes.

But, a command which CUDA offers, and which needs to exist, is a ‘__syncthreads()’ function, which suspends the current thread, until all the threads running in one core-group have executed the ‘__sycnthreads()’ instruction, after which point, they may all resume again.

One fact which disappoints about the real ‘__syncthreads()’ instruction is, that it offers little in the way of added capabilities. One thing which I had written this function may do however, is actually give the CPU a chance to run briefly, in a way not obvious to CUDA code.

But then there exist capabilities which a CUDA or an OpenCL programmer might want, which have no direct support from the GPU, and one of those capabilities might be, to lock an arbitrary object, so that the current thread can perform some computation which reads the object – after having obtained a lock on it – and which then writes changes to the object, before giving up its lock on it.

(Updated 04/19/2018 : )

Continue reading One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

The PC Graphics Cards have specifically been made Memory-Addressable.

Please note that this posting does not describe

  • Android GPUs, or
  • Graphics Chips on PCs and Laptops, which use shared memory.

I am writing about the big graphics cards which power-users and gamers install into their PCs, which have a special bus-slot, and which cost as much money in themselves, as some computers cost.

The way those are organized physically, they possess one or more GPU, and DDR Graphics RAM, which loosely correspond to the CPU and RAM on the motherboard of your PC.

The GPU itself contains registers, which are essentially of two types:

  • Per-core, and
  • Shared

When coding shaders for 3D games, the GPU-registers do not fulfill the same function, as addresses in GRAM. The addresses in Graphics RAM typically store texture images, vertex arrays in their various formats, and index buffers, as well as frame-buffers for the output. In other words, the GRAM typically stores model-geometry and 2D or 3D images. The registers on the GPU are typically used as temporary storage-locations, for the work of shaders, which are again, separately loaded onto the GPUs, after they are compiled by the device-drivers.

A major feature which the designers of graphics cards have given them, is to extend the system memory of the PC onto the graphics card, in such a way that most of its memory actually has hardware-addresses as well.

This might not include the GPU-registers that are specific to one core, but I think does include shared GPU-registers.

Continue reading The PC Graphics Cards have specifically been made Memory-Addressable.