During previous postings, I had written about crashes, which the computer I name ‘Phoenix’ was suffering from. And I had written that one possible reason could have been the failed case-fan, which could have been causing something on the motherboard to overheat.
Just today, this box suffered from another similar crash. This time, I opened up the case, and replaced the 92mm case-fan. Therefore, the reader might expect some optimism on my part, that this server-box will not crash again. But in reality I have two reasons, for which my optimism does not overwhelm:
- If an overheated chip has already caused crashes, there is some tendency for it to suffer from a memory-effect, of wanting to fail again, whenever it gets slightly warm, or just so. Therefore, due to the first crash possibly having happened for that reason, this machine could now have a penchant for crashing, even though the initial cause has been removed.
- The cause may not have been an overheated chip, but rather, a pure software-problem with the legacy graphics driver (nVidia). On such a big display, the graphics driver may have been suffering from some sort of resource leak – aka memory leak – and during boot-up, the BIOS displays it only possesses 128MB of shared RAM! Thus,
the problem could be cumulative and result from regular copying-and-pasting, with many HW-accelerated drawing surfaces and many compositing effects enabled. Once we have an unstable graphics driver – and the graphics driver has received several updates recently – having a stable one could be a luxury we cannot easily reproduce.
I was down from roughly 19h00 until 20h00, and apologize to my readers for any inconvenience.
BTW: I have an additional reason, not really to believe, that these crashes are due to an overheated graphics chip. During the actual reboot, the graphics chip should get especially hot, and especially so, if the case-fan is not turning.
I can see that if this chip did overheat, the TDR would not be able to reboot it.
But the crashes never seem to occur, directly after the reboot. I generally seem to obtain about 6 days of smooth computing, before another crash happens.
Also, it should not be a VRAM leak, because this is a pre-GPU-type graphics chip. With the old graphics chips, that maximally had several pixel and several vertex pipelines, VRAM consumption was more or less static, while with the more-modern GPUs, some amount of VRAM-creep is at least plausible.
root@Phoenix:/home/dirk# lspci | grep vga
root@Phoenix:/home/dirk# lspci | grep VGA
00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2)
root@Phoenix:/home/dirk# lspci -v -s 00:0d.0
00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2) (prog-if 00 [VGA controller])
Subsystem: Hewlett-Packard Company Device 2a61
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 21
Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at fc000000 (64-bit, non-prefetchable) [size=16M]
[virtual] Expansion ROM at f4000000 [disabled] [size=128K]
Capabilities:  Power Management version 2
Capabilities:  MSI: Enable- Count=1/1 Maskable- 64bit+
Kernel driver in use: nvidia