Another possible hypothesis, for why my server-box sometimes crashes.

I have written before, that my Linux computer ‘Phoenix’, which acts both as my server and a workstation, sometimes crashes. I have another possible explanation for why.

The graphics chip on this machine is only a , capable of OpenGL 2.1.2 using proprietary (legacy) drivers. It only has 128MB of shared memory with my motherboard.

Under Windows 10, this chipset is no longer supported at all.

I may simply be pushing this old GPU too hard.

My display is a 1600×1200 monitor, and much of the graphics memory is simply being taken up by that fact. Also, I have many forms of desktop compositing switched on. And at the time of the last crash, I had numerous applications open at the same time, which use hardware 2D acceleration as part of their canvas. And I was copying and pasting between them.

I am hoping that this is easing the burden on my equally-dated CPU.

But then the triggering factor may simply be an eventual error in the GPU.

The fact that the Timeout Detection and Recovery (‘TDR’) does not kick in to save the session, may be due to the possibility that the TDR only works, in specific situations, such as OpenGL, 3D rendering windows. If the GPU crash happens as part of the compositing, it may take out the X-server, and therefore my whole system.

The only workaround I may have, is to avoid using this box as a workstation. When I avoid doing that, it has been known to run for 60 days straight, without crashing…

Dirk

(Edit 01/28/2017 : )

I use a widget on my desktops, which is named ‘‘, and I find that it gives me a good intuitive grasp of what is happening on my Linux computers.


 

phoenix_temperatures_1


 

This widget has as a disadvantage, that when extensions have been installed to display temperatures, sometimes we do not know which temperature-sensors stand for which temperature. This is due to the fact that Linux developers have to design their software, without any knowledge of the specific hardware it is going to run on. Inversely, the makers of proprietary drivers know exactly which machine those are going to run on, and can therefore identify what each of them stands for.

This also means, sometimes we have temperature readings in ‘‘, which may just be spurious, and which may just constantly display one meaningless number, in which case we reduce our selection of indicated temperatures to ones we can identify.

In the context of answering my own question, another detail which becomes relevant, is the fact that this tower computer has a failed case-fan, which is accurately being indicated as the ‘‘ entry, running at 46 RPM at the moment of the screen-shot. I know that this case-fan is in fact stalled, from past occasions when I opened up the tower.

In other situations I might have simply dismissed the temperature show here as ‘‘. But given the fact that spurious crashes have taken place, I am beginning to suspect that this is some real temperature on my motherboard, especially since the displayed number frequently changes.

That could be my GPU. But in reality, since generic software cannot place the temperature, it could be some other sample. Either way, this could be related to the crashes.

The proper solution to this problem would be, to replace the case fan.

When it comes to the actual CPU temperature, which is indicated as ‘‘ above, this is under control, because the CPU has its own fan ‘‘, which is in fact spinning at 6000 RPM as shown.


phoenix_temperatures_2


Also, there is no guarantee, that the temperature-numbers reported by the sensors represent degrees Celsius, quoted at 1:1. There could exist sensors, which are calibrated to state 5x ⁰C as their output values. In this case, there is really no way for Linux developers to know this, because they do not take specific hardware into account.

The normally has an idle temperature around 45⁰C . So I could arbitrarily decide to adjust my ‘‘ widget to multiply the sensor output by (0.2), just based on guesswork. But then, the validity of what the widget will display, will only be as accurate as my guesswork:


 

phoenix_temperatures_3


 

What I do know, is that the crashes this posting writes about, only take place infrequently – so far. And so, a real temperature of 248⁰C is unlikely, because that would already be hot enough to melt some types of solder. So some correction factor would be called for…

Another oddity about this sensor seems to be, that its reported values wrap around, back to 0.0⁰C, beyond some temperature. This could also have been an additional reason for me to have skipped its readings, during an initial setup of ‘‘, at which point it might have been indicating a temperature close to the freezing point of water.

I can now observe that this happens for higher temperatures and not lower ones, because it happens when my Web-browser is loading – i.e. when the GPU is doing more work. And so, without knowing the exact temperatures, I can eye this reading as indicating something potentially dangerous, when it has wrapped around to ‘0.8 ⁰C’, ‘1.5 ⁰C’, etc..

Dirk

 

Print Friendly, PDF & Email

One thought on “Another possible hypothesis, for why my server-box sometimes crashes.”

Leave a Reply

Your email address will not be published. Required fields are marked *

Please Prove You Are Not A Robot *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>