“Hardware Acceleration” is a bit of a Misnomer.

The term gets mentioned quite frequently, that certain applications offer to give the user services, with “Hardware Acceleration”. This terminology can in fact be misleading – in a way that has no consequences – because computations that are hardware-accelerated, are still being executed according to software that has either been compiled or assembled into micro-instructions. Only, those micro-instructions are not to be executed on the main CPU of the machine.

Instead, those micro-instructions are to be executed either on the GPU, or on some other coprocessor, which provide the accelerating hardware.

Often, the compiling of code meant to run on a GPU – even though the same, in theory as regular software – has its own special considerations. For example, this code often consists of only a few micro-instructions, over which great care must be taken to make sure that they run correctly on as many GPUs as possible. I.e., when we are coding a shader, KISS is often a main paradigm. And the possibility crops up often in practice, that even though the code is technically correct, it does not run correctly on a given GPU.

I do not really know how it is with SIMD coprocessors.

But this knowledge would be useful to have, in order to understand this posting of mine.

Of course, there exists a major contradiction to what I just wrote, in OpenCL and CUDA.

These are both platforms, in which the GPU is not being used to create graphics output, but rather to perform computations that will benefit from the high number of parallel cores, that strong GPUs have to offer. OpenCL is capable of simulating C, providing a compiler that translates apparent C into GPU instructions. And CUDA goes so far as to simulate C++ partially.

The C or C++ code run on either platform can become complex.

(Edit 04/09/2017 : Unlike how it is with certain other platforms, such as OpenMPI for example, OpenCL and CUDA are both platforms where the code is written as standard C, with some awareness on the part of the programmer of the fact that it is being submitted for parallel processing, but where the compiler actually controls this aspect. In the case of OpenMPI in contrast, explicit instructions need to be given by programmers, to cause parallel processing. )

But my main explanation for why this is possible, is the premise that both these compilers rely heavily on out-of-line code. I.e., whenever either compiler translates something into GPU language, that originated as a valid C syntax component, the compiler will either insert a fixed set of instructions that stand for the given C component, or actually compile it in the case of computational syntax. And this will happen at some stage after the apparent C has been optimized.

Those out-of-line code-blocks were hand-coded, very expertly, such as always to do what they are supposed to do.

And, I know that this does not correspond to the standard definition of what out-of-line code is. By default, a C++ compiler will execute out-of-line code as a subroutine, which earns this code the naming ‘out-of-line’. In OpenCL or CUDA, these pieces of code are actually inline.


One of the interesting facts about this, is that C-like for loops have a special meaning to the OpenCL or the CUDA compiler. We can still nest them, but then the innermost for loop will execute as parallel instances, invoked by the outer loop. And they will execute on separate cores of the GPU.

If this did not in fact happen, then there would be no advantage to programming with OpenCL or CUDA, because our C-like code would never get to exploit the parallel hardware. ( :2 )

More peculiarly, allocating a GPU core requires some participation of the CPU, to the best of my knowledge. A GPU core is not allowed to wake up another GPU core. And this means that when we run practical examples of OpenCL or CUDA code, there is a load on the CPU as well as the load on the GPU.

This depiction agrees well with the fact that the GPU possesses registers which are shared between cores, belonging to the same core-group. But it poses questions as to where the computational work gets done, belonging to the outer loop?

  1. It could be that each compiler is capable of generating a CPU-instruction output, as easily as a GPU-instruction output, corresponding to the same C-language source-code.
  2. It could be that (some of the) the computations of the outer loop are performed on the same core, as one invocation of the inner loop, (while other computations, that affect the behavior of the inner loop, would need to be performed on the core running the outer loop. )

(Edit 04/09/2017 : Throughout I have been using the terminology, that one invocation of a loop includes its full set of iterations. And, once the program-flow is such that a loop is only nested 1 deep, there is no longer any reason to run its iterations in parallel. At that point all we need is a counter and a conditional branch. )

( … )

(Edit 04/09/2017 : As it happens, code that generates values which affect the behavior of the inner loop, is likely to occur before said loop in a program, and should be run on the core assigned to the outer loop, while code that post-processes the output of the inner loop, is likely to occur after said loop in the program, and should be run on the core assigned to the inner loop.

If instead the programmer placed code after the inner loop, that changes a variable, in a way meant to influence the next invocation of that loop, then that variable also had to be declared before the occurrence of the inner loop, according to what is legal in C. In this case, that variable is being shared, and being used as both an lvalue and an rvalue, in code that is running on a core assigned to the inner loop. In that case, all that I would expect of the compiler, is not to be able to vectorize the code.

However, upon reading some other postings on the subject, what I discover is that often, CUDA will vectorize the computation anyway, even if it is not capable of doing so, and will cause severe malfunctions, thus requiring even more awareness on the part of the C or C++ programmer. Also, real-world GPU computing seems to regard only the actual for loop to be the asynchronous subroutine. )

But in order to know for certain, I would actually need to study the subject formally.

AFAIK, It would pose a problem, if each invocation of the inner loop was to read and write the same output-location (and if that was also used by the outer loop). We would not even know in what order they complete. As long as each inner-loop-iteration writes to a different element of an array, it should still be fine. And if one invocation of the inner loop was merely followed by some computation (on its output), belonging to the same pass of the outer loop, this would need to take place on the same GPU core, on which the one inner-loop invocation was computed.

(Edit 04/09/2017 : The inner loop should not use the shared version of a GPU-register representing a variable of the outer loop, as an rvalue, because it is bound to change from one pass of the outer loop to the next, yet synchronous execution of the inner loop is to be simulated. Instead, when the inner code-block and loop are first executed, a local copy of certain shared registers needs to be made.

It is already an established fact, that compilers in general will keep track, of whether a given block of code uses a specific variable as an lvalue or as an rvalue. It just needs to be recognized as special, if the block of code does both, and if that variable is shared. If so, then that block of code cannot be run asynchronously.

However, it seems like the CUDA compiler may just plow ahead and try anyway. )

Yet, the language C possesses operators that write changes to their left-side variable, such as += and *= , for which the order in which right-side expressions are applied does not matter, as long as they do not include the left-side variable again. In each case, either a summation or an overall multiplication of all the right-side values will result… ( :1 )

(Edit 04/09/2017 : )

What this thinking suggests, is that the compiler optimizes the code very differently, from how it is normally done, when the object-code is supposed to run on the main CPU: The syntax is first parsed and tokenized, the resulting tokens are optimized with emphasis on trying to vectorize their execution, that the tokens resulting from that are converted into static pieces of GPU-code, and that no further optimization is to take place after that.


There is a caveat to this which I might as well point out. Earlier in this posting I loosely wrote, that in order for a GPU-program to allocate another GPU-core, the CPU must be involved. This implies that a way exists for the GPU-program to signal to the CPU, that it wants something done, even though it would not be normal for the CPU to be reading the values of GPU registers (only setting them). ( :2 )

It would seem that I have a whole concept in my head about how this could be managed, which differs from what real-world GPU computing does. It seems that real-world computing relies rather heavily on this quaky concept, of the programmer declaring any number of shared-register variables, and then on inner and outer loops accessing them.

A method has existed since advanced shaders by which they can do something else: A shader is generally allowed to write output to a buffer which has been set up by the CPU, as belonging to the GPU-program, yet CPU-readable. This output-buffer does not necessarily need to contain a pretty image; it may in fact just contain numbers.

The problem with this – or the benefit, from a certain point of view – is that once the shader has done so, it has finished its execution-cycle. The GPU-program is not expected to be doing anything, after all its output-buffers have been written.

What this could mean for GPU computing, is that the asynchronous execution of code would need to be split in half, with each half consisting of a separate GPU-program. The first half would contain what must be run, before each invocation of our inner loop, while the second half would contain whatever needs to be run, after the outer loop has finished. And then the output-buffer of the first GPU program is also the ideal place to state the parameters passed in to the inner loop (or to define uninitialized rvalues).

Luckily, a given GPU-program does not need to be reloaded onto a core, in order to be triggered many times. This is consistent with how in 3D gaming, a model can be rendered at many FPS, without its shaders needing to be reloaded. And so it would follow that the first half of our GPU-program-pair can remain efficient, and also count how many times it has called the inner loop, effectively forming an outer loop. It just needs to be retriggered as many times as needed by the CPU.

But the CPU would need to set the parameters with which a third GPU-program representing an inner loop is to be run, every time that inner loop is invoked – even though its GPU-code does not need to be reloaded – taking them from the output-buffer of the outer loop. One reason for this is the fact that even a header of instructions belonging to this inner loop, that would copy the shared GPU-registers before doing anything else, might be too slow to execute, to beat changes which the outer loop is making to those parameters. More securely, the CPU can transfer those parameters to the registers of the inner loop, before retriggering the outer loop.

Unfortunately, I think that the CPU must reset all the Uniforms with which it runs a shader, each time, including whatever parameters our outer loop was first called with. And, it would then follow that the conditional part of the outer for loop, that decides whether it is to be repeated each time, is actually being observed by the CPU.


It is my assumption that although registers exist on the GPU, that are shared between cores, those registers are only meant to be accessed by GPU-programs, not by the CPU. In theory, there may be ways for the CPU to access them, but attempting this strikes me as improper coding.

Those would be an ideal place to store variables, which have been declared inside a loop or a subroutine, compiled to run on the GPU, for use as return-values.

Actually what I am learning is, that the C or C++ programmer using OpenCL or CUDA, must in fact state in his source code, whether he wants to allocate the variable as a local or a shared register.


1: ) Many mundane examples of loops perform some type of mathematical operations on an output variable, that are of the summation and subtraction family. For example, to compute the average of many values can be seen as to apply the operators += and ++ to certain values within a loop, to form a sum, and a count of iterations that pass a certain test. After the loop has finished, the sum only needs to be divided by the count once, to find the average.

Similarly, there is a way to format the computation of Correlation and Standard Deviation, again just based on averaging per-iteration values.

But, the availability of these operators to the compiled C-like language, actually depends on the existence of accumulator-registers at the machine-level. Even the GPU executes machine instructions, and machine instructions often take this form,

  1. Form a mathematical result of the accumulator-register with another, supplied register,
  2. Store the result back into the accumulator.

The catch here is, that this must exist as one machine-instruction, even when applied to a shared register. It is acceptable if several cores attempt to perform this operations simultaneously, but if one or more of them are slowed down, so that the resulting value in the accumulator remains accurate. It is useless if the operation is not available at the machine-level, because any attempts at the code-level to read that register, and then to write data back to it, would fail. There is certainly no way to lock a shared register at the GPU-core level, to be available only to one core.

Can I be sure that the corresponding, accumulator-operation exists, for multiplication? No. But if it does not, then one possible workaround would be, to compute the logarithm per-iteration, and to perform a summation of those. After the loop has finished, the corresponding exponent-function can be computed (once). And this would be an example of the C-programmer being aware, of the fact that his code is supposed to run multi-core.

Can we determine whether the number of iterations in which a value is negative, is odd or even? Certainly:

  1. Determine whether each iteration meets an arbitrary condition.
  2. If it does, count it with ++ .
  3. After the loop is finished, compute mod2(sum) once.

Actually, one limitation that I was aware of, was the fact that this fast access to a shared register only exists within one core-group – aka within one block. However, another fact which I was not aware of, was that if a summation is required across multiple core-groups, there is a way for CUDA users to code around that limitation.

And, there can in fact sometimes be ugly hacks, to allowing arbitrary locked operations on shared registers.

2: ) By now I have learned, that real-world GPU computing not only has the concepts of threads and blocks, the latter meaning core-groups, but additionally the concept of warps. The concept of a warp seems to be, that instead of launching a piece of source-code – a kernel – to run on one GPU-core at a time, that same piece of code can be sent by the CPU, to start running on multiple cores at the same time – thus forming a warp.

In return for this, its own ID information is made available to each thread, to use in its calculations. That way, if 10 threads were meant to write their output to 1 array-element each, which element they are computing can be made to follow from the thread ID, and can therefore also be used to index the array.

(Edit 04/10/2017 : )

I think that the main way in which real-world GPU computing differed from my personal, optimized suggestion, is merely in the fact that code belonging to an outer loop, which follows the invocation of an inner loop, is still to be run on the core assigned to run the outer loop.

From a certain perspective, any subroutine which does not produce a return value per se, is a candidate for being run asynchronously, and a for loop just seems to be a special case of that. What does not seem to change, is the fact that source code running on the core assigned to run the outer loop, needs to decide at what point in its computations the asynchronous subroutine should be called, while to activate another GPU core remains the responsibility of the CPU.

Therefore, the thinking still strikes me as plausible, that the compiler will split the code in the outer loop in half if needed, and wait for the half that precedes the invocation of the inner loop, to finish, before running the inner loop.

But, practical examples on the Web on how to use CUDA, do not usually explain how the actual compiler works.


I suppose that an alternative approach might also make sense, according to which the outer loop simply performs a synchronization instruction, which signals a special parameter making it a system-defined sync, as opposed to a user-defined sync. This capability would also enable the conditional execution of the asynchronous subroutine.

(Edit 04/14/2017 : In such a case, the convention could exist by which one output buffer has the maximum size required between all output buffers. And the CPU could examine one value in the output buffer, to interpret what any given thread has asked it to do. In order for this to work however, the ability must additionally exist, for the CPU only to resume specific, synced threads, not all the synced threads. Alternatively, the onus might be on the coder, not to over-allocate cores. )

The documentation for the ‘__syncthreads()’ command may be failing to point out, that before all the threads are allowed to resume running, which have called this function from the GPU, some code on the CPU might get to run first.

Also, some people might find it to be a mystery, as to why GPU computing can write results to a large array – which is quite common. The way to visualize this, is that large or arbitrary arrays are represented on the GPU as 2D or 3D images, where one element corresponds to one pixel. They represent an additional output-buffer for a kernel to complete. The formats of the hypothetical pixels can be adjusted, so that these images accept the format of numbers that the kernel wants as elements.

It represents a challenge however, to set up these output buffers correctly. This challenge has been solved in the design of the compiler. The main problem I see is, that each thread represents a fragment shader, running for one fragment, and outputting one pixel. When potentially many fragment shaders are doing this, we do not want a performance penalty. And so this must be set up – by the compiler – as if it was a bucket-rendering, to one pixel at a time. I think that the details of doing so go beyond my personal expertize.

But then to read elements from an array is a question of sampling the same buffer as if it was a texture image, again while trying to avoid a performance penalty that could easily arise, just because many threads could be doing so at the same time.

Yet, it would become inefficient if the source-code was asking that a subroutine be invoked asynchronously, which only needs to be fed 3 parameters, and yet if it was always set up to receive those as a texture image with 3 texels, directly from the calling function. Doing so would exclude the CPU from passing the parameters through, but would ultimately result in way too many texture-images being allocated and sampled. In such a case it makes more sense to my mind, that the CPU be involved, and that it pass in whatever parameters as Uniforms.

Also, I expect that the CPU is really only signaled by the fact that the warp it has sent out to the GPU has finished running – i.e. that each GPU-program has reached the end of its code. And at that time, the expectation is that all the output buffers have been written, that the GPU-program was supposed to write. And this observation was also my initial reason for thinking that the compiler maybe splits the source code into parts, when translating it into GPU-instructions.

What some programmers seem to have missed, according to what I have seen on bulletin boards, is the fact that the CPU has sent out a warp to numerous GPU-cores – i.e. sent the same code many times – that had as purpose to assign enough cores, to solve the computing problem with one warp. Those programmers have then sometimes, written for loops inside that code, erroneously thinking that they needed to do so in order to perform the required set of computations. What some examples seem to suggest, is that to make use of the Thread ID in each thread would have solved their problem, together with writing a while loop.

So in those cases, each thread belonging to the warp started to run nested for loops, that should not have been needed.

I only learned what a warp was yesterday. I have never written C designed to be run on a GPU. But I would assume that if I was to take courses in the subject, somebody else would have informed me about the existence of warps.

(Edit 04/15/2017 : )

I should elaborate on why I find it difficult to picture, how the sync command can allow a GPU core request that additional threads be started and called asynchronously.

By default, the sync command will cause the core that gave it, to stop running, until all the cores in the core-group have also given the sync command and stopped. It does not by default wait until all the cores belonging to one warp have given it, nor until all the cores have given a sync command of the same type.

The reasoning could exist by which all the threads of a kernel will eventually give a sync command – thereby sidestepping the first problem.

The CPU can scan all the synced threads to see which of them have asked that a new core be allocated, in order for something to be run asynchronously. But the next problem would be, that there may not be any unallocated cores available, because they are all running threads – but synced.

In that case it would be most-useful if the CPU could force the threads to wait, which have asked for this, and tell all the others to resume running. But I have no indication that this ability exists.

Now, one way in which this problem could be handled – the most likely way according to what I have read about CUDA – is that it could be the responsibility of the GPU code, not to request more cores than the GPU has. This is plausible, because if variables are fed in that take into account how many cores the GPU has, and if GPU code also takes into consideration what the size of one warp was, intelligent programming can avoid the problem of more cores being allocated, than the GPU has.

Another way in which this problem could be solved, would be that every command by the GPU to launch an asynchronous additional thread, could be wrapped by code which checks for a return-code from the CPU, potentially telling it that to launch another thread was unsuccessful. And then, GPU-code which is invisible to the programmer could simply repeat, to sync the present thread, as many times as is required, until the additional thread was successfully launched.

I think that this second solution is less likely to be in place, simply because the Engineers behind CUDA have traditionally kept all these aspect to how the code works, in the hands of CUDA-code-programmers.

If I had ever done any coding for CUDA, I would know for certain.


Print Friendly, PDF & Email

2 thoughts on ““Hardware Acceleration” is a bit of a Misnomer.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>