In this earlier posting, I described how I had replaced the open-source ‘Nouveau’ drivers for my ‘nVidia’ graphics card, with nVidia’s proprietary drivers. And one of my goals was, to enable ‘OpenCL’ as well as ‘CUDA’ capabilities, which are both vehicles towards ‘GPU-computing’.
In order to test my new setup, I had subscribed to some ‘BOINC Projects‘, some of which in turn used OpenCL to power GPU Work Units.
The way in which I was setting up my computer ‘Plato’, on which all of this was to happen, was that I’d be able to use that computer, among other things, in order to run OpenGL applications and play 3D games during the day, but that at night–time hours, when I was at bed, the computer would fetch BOINC Work Units and run them – partially, on my GPU.
(Updated 05/04/2018, 13h50 … )
What was already observed:
When my BOINC-client starts up, it first consumes some GPU power, in order to reserve resources on the GPU, for those work-units. Then, when the signal is given actually to run the work-units, even more Graphics RAM is allocated, and actual GPU Cores are allocated to run them. Shutting down or restarting the BOINC-client could lead to corruption, because doing so would create a lack of synchronization, between the GPU Work Units as they exist in system RAM and on the CPU, and what they’re supposed to have allocated on the GPU. Rebooting the computer can resolve that issue.
What was expected to happen:
When I merely run an OpenGL application / 3D game, the OpenCL drivers are supposed to keep segregated, resources on the GPU reserved by BOINC, and resources taken just to perform 3D rendering. That way, once the 3D gaming has stopped, during the night-time scheduling-period, the GPU Work Unit should be able to run again and not be compromised.
What really happens:
If I did run certain 3D games – actually, only “Summoning Wars”, apparently, this prevents the BOINC, GPU Work Units from running again. Instead, after running for 19 seconds, those report some sort of internal computing error, and not even a project result which needs to be validated.
An added observation about this would be, that starting from the very first work-unit run after the daytime session, they all started to produce compute errors. If the compute errors were due, for example, to an overheated GPU, then I wouldn’t expect this exact onset of errors, directly after the one day I played “Summoning Wars”. Instead, I would expect that some number of work-units would be reported Okay, after which the GPU would have been overheated, and therefore, after which further work-units would error out.
Also, nobody decided to change the ‘GPU Utilization Factor’, let’s say belonging to the Project Admins, which could have caused overheating.
The Conclusion for me:
I could either use this computer for BOINC Computing, or for daytime play, but not both. Having played a 3D game, effectively requires that I reboot the computer next, before BOINC Computing can resume. This is not feasible for my lifestyle.
In reality, I may or may not be able to run OpenCL kernels from someplace else – we’ll have to wait and see about that – while not gaming, but not, keep resources on the GPU reserved for BOINC in the background, while I do 3D Gaming.
(Update 05/04/2018, 13h50 : )
During a typical night when my computer was scheduled to run BOINC Work Units, the temperature of its GPU would climb to about 70⁰C. Under those conditions, a type of hardware-problem which is imaginable, is that thermal agitation results in sufficient signal-noise, to result in some computational errors on the GPU. Those errors are supposed to be recognized, when BOINC validates the Work Unit – not, through a complete failure of the work unit.
But, during a typical night when my computer was scheduled to run BOINC Work Units, one out of exactly two results would greet me the next day:
- A total of maybe 5 completed GPU Work Units, which always passed validation,
- A long slew of computation errors – more than 20 – which could not even be submitted for validation, and zero completed Work Units.
This behavior would not be consistent, with thermal agitation causing bit-errors on the GPU, and the GPU should have been just fine running at 70⁰C periodically.
However, this observation is not airtight proof, that my GPU is working fine. There exist other possible events which can take place, including temperatures exceeding 80⁰C and resulting in the catastrophic failure of the GPU. I’ve never seen the temperatures go that high. And, FWIW, I’m still able to enjoy the graphics normally, including (unaffected) desktop compositing, and including 3D games such as “Summoning Wars”.
But, just to be sure, I now intend to take further steps, just to test, whether maybe my GPU is not working correctly at the hardware-level.