In the corner of my parents’ office is a small computer acting as a file and email server. It’s a little workhorse that’s been going for ages, but late last year it started turning itself off. Mum and Dad would arrive in the morning and find it inexplicably lightless, despite no power cuts.
I tracked it down to the weekly full HD-clone backup, and could at least reproduce the problem: the machine just conked out, with no blue-screen or anything in the windows event log. This suggested it was hardware-related, and there were a few possibilities, the most likely being one of the hard drives. I tried running full sector scans, but it conked out halfway through. I took them away, but found no errors. The next most likely cause (I thought) was the power supply. So I replaced that and the next backup promptly reported sector errors on one of the hard drives. So I replaced that too, and the backup completed – yay!
A week later: same problem.
So I swapped out the RAM. No difference. At this point I was starting to think it must be Random Motherboard Crap. Sometimes you get a problem you can’t trace, so you replacing the motherboard+cpu, and everything’s ok but for the lingering feeling that maybe you missed something. In this case, I would indeed have missed something if I’d replaced the lot.
Last week I sent an email around to various friends, asking if they had any other thoughts. My friend Ben got back to me the same evening with a list of things to try, one of which was a stress test – did it fall over under high cpu load? I tried Prime95 last night and the machine fell over in minutes – far less time than the backup typically took to take down the system. It was pretty late by this point, so I stopped for the night and while driving home suddenly realised the incredibly obvious possiblity.
This evening I ran the old-school Motherboard Monitor, and watched as the stress test took the CPU temperature up to 105 degrees celcius. 105 degrees! That is a ridiculously high temperature for a non-ancient CPU. So I took a look inside, really really hoping I hadn’t somehow missed the CPU fan having fallen off, or something. No – the fan was in place and still going, but there was a bit of dust clogging the heatsink. I vacuumed it up and re-ran the stress test.
It peaked at 66. A not-crazy amount of dust had increased the temperature by 40 degrees! It survived the stress test without issue, and is currently running a full backup for the first time in months1.
I’m a little annoyed I didn’t think of this. Overheating used to be the go-to problem for random shutdowns, but modern computers run so cool that it’s now pretty uncommon. But it shouldn’t have taken me four months to figure it out. Oh well, at least they’ve got a shiny new power supply.
How come the backup completed that once? Could be chance, but I bet I left the side of the case off, having just installed the new drive. Still didn’t twig, though.
I think Ben was thinking ‘overheating’, but didn’t want to say it so bluntly so I wouldn’t feel bad. He’s subtle that way. Thanks, Ben!
- not quite so bad as it sounds – the important files were being backed up online, but this is far from optimal. Golden rule of backups is always have two systems, because one of them will always be broken at the critical moment [↩]