A project which has fixated me for several days has been, to get the API which is named “ArrayFire” custom-compiled, specifically version 3.5.1 of that API, to enable CUDA support. This poses a special problem because when we use the Debian / Stretch repositories to install the CUDA Run-Time, we are limited to installing version 8.0.44. This run-time is too old, to be compatible with the standard GCC / CPP / C++ compiler-set available under the same distribution of Linux. Therefore, it can be hard for users and Debian Maintainers alike to build projects that use CUDA, and which are compatible with -Stretch.
Why use ArrayFire? Because writing kernels for parallel computing on the GPU is hard. ArrayFire is supposed to make the idea more-accessible to people like me, who have no specific training in writing highly parallel code.
When we install ArrayFire from the repositories, OpenCL support is provided, but not CUDA support.
I’ve hatched a plan by which I can install an alternative compiler set on the computer I name ‘Phosphene’, that being GCC / CPP / C++ version 4.9.2, and with which I can switch my compilers to that version, so that maybe, I can compile projects in the future, that do use CUDA, and that are sophisticated enough to require a full compiler-suite to build.
What I’ve found was that contrarily to how it went the previous time, this time I’ve scored a success. Not only did the project compile without errors, but then, a specific Demo that ships with the source-code version, that uses the ability of CUDA to pass graphics to the OpenGL rendering of the GPU, also ran without errors…
So now I can be sure that the tool-chain which I’ve installed, is up to the task of compiling highly-complex CUDA projects.
(Update 5/03/2019, 7h30 : )
(As of 5/02/2019, 13h05 : )
When I was running the Demos with graphics, belonging to ArrayFire 3.7.0, I was noticing a slow frame-rate. But now, with this build of ArrayFire, v3.5.1, what I notice instead is, that during the initial few seconds of OpenGL-based graphics output, the display appears frozen, while after that initial period, the number of FPS seems high and smooth.
This happens both when I’m running the CUDA-based, as well as when I’m running the OpenCL-based versions of each Demo. It represents a successful result.
In GPU computing, it’s normal that setting up a vector-processing pipeline is met with an initially high amount of overhead in CPU time – and other time – required, but that once the pipeline has been initialized, its computing speed on the GPU is much higher, than CPU-computing speeds would be. This is also why the primitive “helloworld” example cannot be made to run efficiently. It sets up a trivially-small amount of GPU computing to be performed, using a non-trivial amount of CPU time – as well as whatever amount of time is required in any case, for all GPU-processes to sync. (:1)
A possible reason for the improvement in graphics-output could be the fact, that with ArrayFire 3.7.0, I had compiled Forge 1.0.4 separately, the project configuration of which could only accept CUDA parameters, while with ArrayFire v3.5.1, Forge 1.0.2 was compiled in an integrated way, which means that the project parameters stated both CUDA and OpenCL libraries and header files, potentially for both compiled libraries to link to…
I may also mention the fact, that the way I used ‘cmake-gui’ to build ArrayFire 3.5.1, the option was checked by default, ‘CUDA_USE_STATIC_CUDA_RUNTIME’. I left this option checked, as per the default. What this option means is, that the compilation of the project links to static CUDA RT libraries, that are presently of v8.0.44. And what it also should mean is, that if Debian ever updates their shipped CUDA RT to something higher, then I should not need to recompile my project.
Effectively, the libraries which I just compiled contain their own CUDA RT, just like the binary installer did, which I had complained about earlier. Only in my case, it’s CUDA RT 8.0, not CUDA 10.
(Update 5/02/2019, 18h50 : )
The observed, momentary freeze of the display can be easy to explain.
The GPU cores are organized into ‘blocks’, and one of the operations which all the threads can execute, that could belong to some vector-computation, is to ‘Wait, until all the threads in the same block have encountered the same operation.’
If the vector-computation was only that, one might never notice that this instruction was given. But in the example above, output is simultaneously being sent as a graphic. One assumption that I would make, on a GPU that possesses 7 core-groups, is that generally, the core-group required to run graphics output as part of the ‘desktop compositing‘, is kept a separate core-group, from the one intended to perform a vector-computation.
One small piece of speculation which I’ve stated in my blog before was that once all the threads running on a GPU-block have reached such a checkpoint, some involvement of the CPU is required, to set them all back in motion.
Well in this case, what may be required is that the block generating output to the screen, may have to wait as well, until all the threads in the vector-computing block have been synced.
Even though the PC hasn’t frozen, the mere fact that the appearance of
the screen has, may make it look so. Further, this hypothesis highlights a vulnerability in the system:
If one thread simply fails to reach such a checkpoint, the display-output may remain frozen, and
the whole session could therefore end up locked, requiring a hard boot and a subsequent file-system check. Yet, the actual vector-computation can be defined in user-space. In that regard, I hope for reliable GPU-code, having been incorporated into my ArrayFire libraries, that are accessible from user-space.
(Update 5/03/2019, 7h30 : )
I have observed this initial freezing of the display-output, from Forge, more closely. It appears that when I run the CUDA-based Demo with graphical output, purely the display window of this one application initially seems frozen, for approximately one second, not the entire desktop. I.e., the ‘gkrellm’ widget and desktop compositing continue to animate. What this suggests is that whatever synchronization is required, is limited to GPU threads belonging to the one application-window, and not to the entire desktop session.
A logical way to organize GPU-computing might be, that more than one (logical) block may be allowed to run on one (physical) GPU core group, simultaneously, as long as the number of threads doesn’t exceed the number of cores in the core group. That way, threads belonging to one block could need to wait until all the others reach the same checkpoint, but without interfering with threads running on the same core group, belonging to a different block.
One of the ‘facts’ which I seem to have observed before, is that the GPU core’s instruction set is so simplified, that there is really only one ‘syncthreads’ instruction at the hardware-level. But, a more-advanced API can seem to offer variants of it. When these variants are examined closely, it can be seen that they can all be implemented by leaving certain values in special GPU registers, and/or executing the GPU, hardware instruction conditionally, and, by the CPU examining those registers, before allowing the sync’ed threads to resume execution.
The thought has also occurred to me that when the GPU is left to perform hardware-graphics-rendering, If the GPU in question is at least capable of Unified Shader Model, as all OpenGL 3 -capable GPUs are, each stage of the rendering pipeline could also be running as one block, and, invisibly to the shader-programmer, each GPU-program could contain a ‘syncthreads’ instruction, which causes each stage of the pipeline to sync, once per output frame-cycle. Thus, one specific rendering pipeline can be made to hang, pending the completion of such a synchronization.