Kernel Update today, Downtime, Multiple Reboot-Attempts

Today, the PC which is hosting my site and blog, which I name ‘Phoenix’, received a kernel update.

Debian Team has not been following standard guidelines in their propagation of kernel updates, as the last 3 updates produced the same kernel-version number:


3.16.0-6-amd64



Because even Linux computers require a reboot after a kernel-update, this blog was temporarily off-line from about 13h05 until 13h25. I apologize for any inconvenience to my readers.

There is a fact about the build of Linux on this computer which I should bring up. I have the following on-board graphics-chip:


GeForce 6150SE nForce 430/integrated/SSE2



And this proprietary graphics driver is the only one, capable of working with the said graphics-chip:


NVIDIA 304.137



The graphics driver is installed from standard Debian repositories.

Somewhere between these software-packages there is a problem, which Debian Team has never been aware of, but which has existed ever since I installed Debian / Jessie on this computer. Directly after a reboot, the ability of the X-server to start, is not reliable. Sometimes, the X-server starts on the first try, but on other occasions I need to make 7 reboot attempts, before the X-server will start, and from one reboot-attempt to the next, I change nothing.

Once the X-server has started successfully, this graphics-chip will work 100% for 30 days !

I have been reluctant to point this out for the past few years, because if a Debian developer finds out about it, he will try to fix this problem. And when he does, he will brick my computer.

This afternoon, 7 reboots were in fact required, before the X-server started. That is why the reboot-procedure took 20 minutes of time.

(Updated 07/14/2018, 16h45 … )

Another Caveat, To GPU-Computing

I had written in previous postings, that I had replaced the ‘Nouveau’ graphics-drivers, that are open-source, with proprietary ‘nVidia’ drivers, that offer more capabilities, on the computer which I name ‘Plato’. In this previous posting, I described a bug that had developed between these recent graphics-drivers, and ‘xscreensaver’.

Well there is more, that can go wrong between the CPU and the GPU of a computer, if the computer is operating a considerable GPU.

When applications set up ‘rendering pipelines’ – aka contexts – they are loading data-structures as well as register-values, onto the graphics card and onto its graphics memory. Well, if the application, that would according to older standards only have resided in system memory, either crashes, or gets forcibly closed using a ‘kill -9′ instruction, then the kernel and the graphics driver will fail to clean up, whatever data-structures it had set up on the graphics card.

The ideal behavior would be, that if an application crashes, the kernel not only clean up whatever resources it was using in system memory, and within the O/S, but also, belonging to graphics memory. And for all I know, the programmers of the open-source drivers under Linux may have made this a top priority. But apparently, nVidia did not.

And so a scenario which can take place, is that the user needs to kill a hung application that was making heavy use of the graphics card, and that afterward, the state of the graphics card is corrupted, so that for example, ‘OpenCL‘ kernels will no longer run on it correctly.

I question the amount of VRAM on Phoenix.

I am still contemplating, why the server-box I name ‘‘ was crashing, and my attention keeps coming back to the graphics chip. Before this computer was resurrected, it was running in 32-bit mode, as ‘‘. At that time, it only had 2GB of RAM. But now it runs in 64-bit mode, with 4GB of RAM.

When I boot, the BIOS message still tells me that it has 128MB of shared memory, for the graphics chip. But strangely enough, the piece of text I pasted into this posting, reads that the graphics driver has set aside 256MB of VRAM, near the top of the 4GB of physical addresses. I did not know that the kernel can override a BIOS setting in this way, let us say just because processing has been switched to 64-bit mode.

One mishap which could naively go wrong, is that the legacy driver, unaware of the specifics of this motherboard, could be allocating 256MB of shared memory, but that physically, the hardware cannot share past the address ‘‘. That is, the address ‘‘ may have become forbidden territory for the graphics card. It is however uncommon, that the programmers of kernel-space modules, would make such a simple mistake.

This is a 64-bit system, which only accepts up to 4GB of RAM, thus only possessing 32-bit physical addresses, to go with its 64-bit virtual addresses.

According to this screen-shot:

I only have 3.74GB of RAM available to the system, instead of 4GB. The reason for this, is the fact that 256MB have in fact been reserved for the graphics chip. By itself this would seem to suggest, that the allocation has succeeded.

Also, the fact that 49.26MB of shared memory was momentarily being indicated, is not too telling, because several types of processes could be using shared memory for some purpose. This feature does not only exist, for user-space processes to make texture images available to the graphics card.

There has been a Dist-Upgrade on my Server.

This server is hosted on a Debian / Jessie (Linux) computer which I own and run myself. Under Debian – Linux systems, the most thorough kind of update which can be carried out is called a ‘dist-upgrade’ or a ‘d-u’ for short. Just this evening, I saw that suddenly there were 93 software packages, which all did need an upgrade, and saw, that I could not just leave this type of upgrade to the usual, automated services. Therefore, I decided to administer the 93 package-upgrades given, via a dist-upgrade command. This can be stressful, or exciting, or both, because it can give a Linux user an improvement, or it can in some cases actually cripple our systems. I’m glad to say that this Linux box I name ‘Phoenix’ did not get crippled. It’s still fully bootable.

But due to this procedure, the Web-server was also down, from 20h15 through until 20h40 or so. I see that my blog is still here though, after the d-u .

I think that most software updates can be fun and games. But this particular upgrade also chose to include my graphics driver, which I was particularly fussy about. The past version of the graphics driver on this box was extremely stable, and I was trying to avoid doing any sort of upgrade to it, but now doing so was the only way to keep my box compatible with future upgrades.

It has sometimes happened to me, that the screen might just freeze – even though it’s a Linux computer – due to stability problems with other graphics drivers – especially with the ‘mesa’ driver, which tries to software-render an OpenGL equivalent. But what has been most stable for me in recent months, was the ‘GLX’ driver, which does full hardware, OpenGL rendering as it’s supposed to, and which under modern Linux systems is even capable of a ‘TDR’ equivalent, a Timeout Detection and Recovery, which will restart a crashed GPU without harming the active session.

If in the near future I find that my screen does freeze, or that there are TDR issues, a sinking feeling will go through my heart, because that would signal that a completely stable graphics driver has been replaced unnecessarily, with an unstable one. And in the act of doing so, all my package-management scripts even recompiled the DKMS kernel module for the graphics driver in question, because that is the correct way to install it.

Oh Yes, I see that the Apache Web-server software, which my machine hosts, has been given an upgrade as well. But as I see it, this was the least likely set of packages, for the maintainers to have botched. So it’s my full assumption that Web-server activity will continue without error.

Dirk