Overheated Circuitry

One of the things which I do frequently, is ‘walk around’, or, ‘use public transit’, with my disposable earphones plugged in to my Samsung Galaxy S9 Smart-Phone, and listening to music. These earphones are clearly not the ones, which had the AKG seal of approval, and which shipped with the phone. But this week-end marks the second heat-wave this Summer, when outside daytime temperatures exceeded 31⁰C, with direct sunlight and not a cloud in the sky. And under those conditions, the battery of my phone starts to hit a temperature of 42⁰. One of the facts which I know is, that Lithium-Ion batteries like the one in my phone do not tolerate temperatures exceeding 41⁰C.

A peculiar behaviour which has set in for the second time, during this second heat-wave of the season, is that the music I was listening to would either back-space to the beginning of the song, or skip ahead one song, or just stop. So, a catastrophic sort of explanation I could think of would be, that the entire phone, with its battery, is finally just having a meltdown. But, a second possibility exists, that merely the chip in the earphone-cord could be malfunctioning. After all, the little pod in the earphone-cord has one button and a mike, and it’s actually cheaper to mass-produce the chip that makes it work, than it would be to mass-produce other sorts of discrete components. One cheap chip could just be malfunctioning in the extreme heat, and not the entire, complex circuitry of the phone. (:1)

The earphones cost me about $15, while the phone is much more expensive than that.

But even if it was true, that only the little remote-control in the earphone-cord was malfunctioning, this can lead to impractical situations, because just random patterns, of unreal button-press-combinations, could also send the software of my phone into a confused state, and even so, if the circuitry in the smart-phone never malfunctioned. This behaviour could get misinterpreted by the security apps of the phone, let’s say, as though somebody had ripped the earphone-cord off my head, and thrown all my possessions around.

All that was really happening was that my music was no longer playing, as I was walking home normally, in the heat, with my overheated electronics. And when I got home, my actual phone never displayed any signs of having malfunctioned.

(Updated 8/17/2019, 17h50 … )

Continue reading Overheated Circuitry

One of my A/Cs has just failed (Not).

In the Greater Montreal Area (Canada), we have been subject to a prolonged heat-wave, with daily high-temperatures of 35⁰C (+), for approximately a week in a row now. This is expected to continue at least until tomorrow (Thursday, July 5). Luckily, my own home has been protected by two working, 8500BTU air-conditioners until now.

The way an A/C works is such, that it has a compressor-motor, the windings of which are cooled by the return-flow of refrigerant in its gaseous form, after that refrigerant has evaporated in the evaporator, and done its job cooling the home. Yet, these motors are not designed to run with 100% duty, 24/7. They need to cycle off periodically, and one reason for which they normally cycle, is that their function has achieved some sort of (low) target temperature. But, another reason fw the compressor-motor can switch off, is the possibility that its windings themselves, which are linked to yet another temperature-sensor, have overheated.

Even worse than to have the temperature-protection trip once, because of overheated windings, is the very common problem that eventually, the enamel-insulation of the windings may itself fail, causing a permanently defective motor! This tends to happen eventually, because of the cheap way the motors are made.

If that happens, certain turns of the enameled wire, within the motor-windings, will act as if they were the secondary winding of a transformer, to which the still-healthy turns would form the primary winding. A heavy current flows through the short-circuited turns in this way, that can be hard to detect, unless one also measures the exact amount of current drawn by a running motor, and compares it to a known, correct amount of current, which I do not know for the motor in question. But if a winding has in fact started to short in this way, the amount of heat that builds up inside it becomes more acute, of course.

What I am used to from my A/Cs, is that they will run for about 15 minutes, if they fail to reach a target temperature, before their compressors cycle ‘off’. But the A/C in my bedroom, where the temperature is 24⁰C right now, has started to run for only 5 minutes, before turning off. And I have it set to achieve an evaporator-temperature of 20⁰C.

I’ve decided to switch off both my A/Cs temporarily, even though the temperature outside is 35⁰ at the moment, in hopes that they will recuperate. As they are switched off, of course it will start to get warmer in all parts of my home, including in the computer room.

If I should not be able to keep my indoor temperatures under control, I will need to shut down my actual computers next, which are more important to me than the A/Cs, or than my own, personal comfort. In such an event, my blog will also go offline. For the moment, my site and blog are still accessible. But depending on what happens next, there could be some downtime.

(Edit 07/04/2018, 23h05 : )

Apparently, my A/C is still fine. But in order for me to understand this strange behavior, I need to take into account the peculiar way in which my present A/Cs are designed. They are both indoor, portable A/Cs, which have air-ducts that send warm air, with the waste heat, out a window.

Continue reading One of my A/Cs has just failed (Not).

New Case-Fan Installed

During previous postings, I had written about crashes, which the computer I name ‘Phoenix’ was suffering from. And I had written that one possible reason could have been the failed case-fan, which could have been causing something on the motherboard to overheat.

Just today, this box suffered from another similar crash. This time, I opened up the case, and replaced the 92mm case-fan. Therefore, the reader might expect some optimism on my part, that this server-box will not crash again. But in reality I have two reasons, for which my optimism does not overwhelm:

  1. If an overheated chip has already caused crashes, there is some tendency for it to suffer from a memory-effect, of wanting to fail again, whenever it gets slightly warm, or just so. Therefore, due to the first crash possibly having happened for that reason, this machine could now have a penchant for crashing, even though the initial cause has been removed.
  2. The cause may not have been an overheated chip, but rather, a pure software-problem with the legacy graphics driver (nVidia). On such a big display, the graphics driver may have been suffering from some sort of resource leak – aka memory leak – and during boot-up, the BIOS displays it only possesses 128MB of shared RAM! Thus, the problem could be cumulative and result from regular copying-and-pasting, with many HW-accelerated drawing surfaces and many compositing effects enabled. Once we have an unstable graphics driver – and the graphics driver has received several updates recently – having a stable one could be a luxury we cannot easily reproduce.

I was down from roughly 19h00 until 20h00, and apologize to my readers for any inconvenience.

Dirk

BTW: I have an additional reason, not really to believe, that these crashes are due to an overheated graphics chip. During the actual reboot, the graphics chip should get especially hot, and especially so, if the case-fan is not turning.

I can see that if this chip did overheat, the TDR would not be able to reboot it.

But the crashes never seem to occur, directly after the reboot. I generally seem to obtain about 6 days of smooth computing, before another crash happens.

Also, it should not be a VRAM leak, because this is a pre-GPU-type graphics chip. With the old graphics chips, that maximally had several pixel and several vertex pipelines, VRAM consumption was more or less static, while with the more-modern GPUs, some amount of VRAM-creep is at least plausible.

 


root@Phoenix:/home/dirk# lspci | grep vga
root@Phoenix:/home/dirk# lspci | grep VGA
00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2)
root@Phoenix:/home/dirk# lspci -v -s 00:0d.0
00:0d.0 VGA compatible controller: NVIDIA Corporation C61 [GeForce 6150SE nForce 430] (rev a2) (prog-if 00 [VGA controller])
        Subsystem: Hewlett-Packard Company Device 2a61
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 21
        Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Memory at fc000000 (64-bit, non-prefetchable) [size=16M]
        [virtual] Expansion ROM at f4000000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel driver in use: nvidia

root@Phoenix:/home/dirk#


 

Another possible hypothesis, for why my server-box sometimes crashes.

I have written before, that my Linux computer ‘Phoenix’, which acts both as my server and a workstation, sometimes crashes. I have another possible explanation for why.

The graphics chip on this machine is only a , capable of OpenGL 2.1.2 using proprietary (legacy) drivers. It only has 128MB of shared memory with my motherboard.

Under Windows 10, this chipset is no longer supported at all.

I may simply be pushing this old GPU too hard.

My display is a 1600×1200 monitor, and much of the graphics memory is simply being taken up by that fact. Also, I have many forms of desktop compositing switched on. And at the time of the last crash, I had numerous applications open at the same time, which use hardware 2D acceleration as part of their canvas. And I was copying and pasting between them.

I am hoping that this is easing the burden on my equally-dated CPU.

But then the triggering factor may simply be an eventual error in the GPU.

The fact that the Timeout Detection and Recovery (‘TDR’) does not kick in to save the session, may be due to the possibility that the TDR only works, in specific situations, such as OpenGL, 3D rendering windows. If the GPU crash happens as part of the compositing, it may take out the X-server, and therefore my whole system.

The only workaround I may have, is to avoid using this box as a workstation. When I avoid doing that, it has been known to run for 60 days straight, without crashing…

Dirk

(Edit 01/28/2017 : )

I use a widget on my desktops, which is named ‘‘, and I find that it gives me a good intuitive grasp of what is happening on my Linux computers.


 

phoenix_temperatures_1


 

This widget has as a disadvantage, that when extensions have been installed to display temperatures, sometimes we do not know which temperature-sensors stand for which temperature. This is due to the fact that Linux developers have to design their software, without any knowledge of the specific hardware it is going to run on. Inversely, the makers of proprietary drivers know exactly which machine those are going to run on, and can therefore identify what each of them stands for.

This also means, sometimes we have temperature readings in ‘‘, which may just be spurious, and which may just constantly display one meaningless number, in which case we reduce our selection of indicated temperatures to ones we can identify.

In the context of answering my own question, another detail which becomes relevant, is the fact that this tower computer has a failed case-fan, which is accurately being indicated as the ‘‘ entry, running at 46 RPM at the moment of the screen-shot. I know that this case-fan is in fact stalled, from past occasions when I opened up the tower.

Continue reading Another possible hypothesis, for why my server-box sometimes crashes.