One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.

One of the subjects which many programmers have been studying, is not only, how to write highly parallel code, but how to write that code for the GPU, since the GPU is also the most-readily-available highly-parallel processor. In fact, any user with a powerful graphics card, may already have the basis to program using CUDA or using OpenCL.

I had written an earlier posting, in which I ended up trying to devise a way, by which the compiler of the synthesized C or C++, would detect that each variable is being used as ‘rvalues’ or ‘lvalues’ in different parts of a loop, and by which the compiler would then choose, to allocate a local register, allocate a shared register, or to make a local copy of a value once provided in a shared register.

According to what I think I’ve learned, this thinking was erroneous, simply because a CUDA or an OpenCL compiler, does not take this responsibility off the hands of the coder. In other words, the coder needs to declare explicitly and each time, whether a variable is to be allocated in a local or a shared register, and must also keep track of how his code can change the value in a shared register, from other threads than the current thread, which may produce errors in how the current thread computes.

But, a command which CUDA offers, and which needs to exist, is a ‘__syncthreads()’ function, which suspends the current thread, until all the threads running in one core-group have executed the ‘__sycnthreads()’ instruction, after which point, they may all resume again.

One fact which disappoints about the real ‘__syncthreads()’ instruction is, that it offers little in the way of added capabilities. One thing which I had written this function may do however, is actually give the CPU a chance to run briefly, in a way not obvious to CUDA code.

But then there exist capabilities which a CUDA or an OpenCL programmer might want, which have no direct support from the GPU, and one of those capabilities might be, to lock an arbitrary object, so that the current thread can perform some computation which reads the object – after having obtained a lock on it – and which then writes changes to the object, before giving up its lock on it.

(Updated 04/19/2018 : )

(As of 04/16/2018 : )

Complex schemes have been suggested, by which data computed by one thread, can be broadcast to other threads. But the main strategy by which threads can decide to lock an object, has to do with setting aside a shared register, which all the contending threads simply write their thread-ID to, and after doing so, each thread can execute the ‘__syncthreads()’ instruction. After all the contending threads have done so, each thread can read the value of this register, and compare it with its own thread-ID. If the value read from the shared register equals the thread-ID of the current thread, then the current thread is agreed to have a lock on the object, which this register manages.

If the thread-Id read back from this shared register does not equal that of the current thread, then the current thread can decide, it does not have a lock, and may decide to repeat the attempt to obtain a lock, after which of course, it must execute another ‘__syncthreads()’ instruction.

Assuming that all the preceding threads perform whatever computations on the locked object, and then execute one final ‘__syncthreads()’ instruction again, this should eventually lead to each contending thread eventually obtaining a lock on the object in question.

But this sort of example illustrates how little the actual compiler, or the GPU, supports what the programmer is trying to do. And further, it should be clear that a considerable performance-loss takes place, if such a scheme is implemented, as compared to how fast computation continues, when no thread requires a lock on some object. And so programmers will try to avoid writing code, that requires each thread must obtain a lock on an object, before they can all continue computing…

(Edit 04/18/2018 : )

I suppose that one situation which could eventually stop a CUDA kernel from functioning would be, if numerous threads were to have invoked the ‘__syncthreads()’ function, but one or more threads failed to do so, eventually. The threads which called this function would continue to wait, until all the threads in the so-called ‘thread-block’ – i.e., running on the same core-group of the GPU – also did so, which maybe not all the threads would do.

  • One way in which the coder could avoid such a problem would be, to make sure that each of his threads does call this function periodically, perhaps even inside some loop which his threads are running.
  • Another way in which the coder might try to make up for this inadequacy, could be to make sure that each of the threads will have executed the ‘__syncthreads()’ function a fixed number of times, at certain positions in his loop. In the example I gave above, the earliest threads to obtain a lock on the object would have called this function the fewest number of times, in relation to the number of times other threads did so, trying to accomplish the same task. But if the coder was to compensate for this by calling the ‘__syncthreads()’ function an added number of times, to make the total number of calls equal for all the threads, after having obtained and relinquished the lock, then a plausible argument against this might be, ‘Why slow down some of the threads? If at some point, all the threads call this function again periodically, we might as well have all the cores perform the maximum amount of work, on different parts of our loop.’
  • Another way to state this problem would be, the fact that all the threads might have resumed from a ‘__syncthreads()’ call, but not in the same part of the kernel, if the kernel contains this call in different places. In that case, the need might arise, additionally to make sure that all the threads are at the same point in the kernel, for one specific call. And this can be accomplished using latest CUDA versions specifically, by using the ‘__syncthreads_and(1)’ function call, and repeating it, until it returns True…
  • Finally, the responsibility of the infrastructure would be, for each thread’s call of this function NOT to have to wait, on threads which have exited, or terminated in the meantime. If the infrastructure can make sure of that, then certain details also become less important, such as whether each thread will call this function again, at the end of its loop, prior to exiting…



Print Friendly, PDF & Email

2 thoughts on “One way, in which my earlier description of CUDA was out of touch, with the real-world implementation.”

  1. You actually make it appear really easy with your presentation however I in finding this topic to be really one
    thing that I feel I’d by no means understand.
    It kind of feels too complex and extremely
    huge for me. I am looking ahead for your subsequent post,
    I’ll try to get the grasp of it!message3149

    1. Dear Boris,

      I can empathize with you, because in reality such subjects as GPU-computing are vast and complex. My own main fear is only, that some people may not understand that the GPU is in fact hardware, and that therefore, GPU-computing truly exists. If I could convince everybody of that, I’d already be happy.
      But some of your confusion about the details of GPU-computing may simply stem from the possibility, that my postings contain technical errors, especially, since I do not have a machine set up in front of me, for GPU-computing.

      In any case, I wish you good progress, in your quest for the detailed answers.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>