Hey y’all! Recently my homeserver (an old laptop) has started crashing every night (after weeks of uptime just working), without anything useful in the logs. Any suggestion about what it might be? (Just started logging battery info to test tonight)
I don’t know the root of your ills on your server, but I have an interesting story to share (shared by my husband who was an engineer at the company mentioned below).
Back in 1998, the engineers at Be,Inc (who were developing BeOS, a beloved OS at the time) were experiencing kernel panics right after 7 am, on a specific computer. All of the crashes at around the same, while the computer was running tests all night. It had become a big mystery because they couldn’t find the bug.
It took them days, but they decided to sit around at 7 am to see what was happening. They saw that a single, strong sun ray was entering the room from the window, and was directly hitting the PC’s floppy drive (the PC was not completely closed up with its cover, since it was a test machine). They found that the sun ray would alter some bits in the electronics and what not, and would crash the kernel! :o)
I’ve got a fun one to share from my college programming professor. Similar situation, they had a machine that kept locking up, and this was back in the days of huge mainframes the size of rooms. So they call the repair tech from the manufacturer.
So the repair tech shows up to the office gets the run down on what’s been going on, and goes out to his car and brings in a huge piece of wood and just starts wailing on the thing as hard as he could. The whole office was freaking out thinking this guy had lost it, and he later explained that the memory was a grid of magnetic coils, and the coils would rust and the rust shavings would fall between the coils below, corrupting the memory bits. So he was shaking them loose by slamming the machine with this piece of wood. Lol wild times.
TIL “wailing on something” = hit it with a stick.
Well, I guess after looking it up, its actually ‘whaling on something’.
https://www.merriam-webster.com/grammar/usage-of-whale-wail-wale
Bit o’ the ol’ percussive maintenance.
That’s a great story, thanks! Old electronics were particularly sensitive to light and other EM disturbance 😄
Still are, though most often it’s heat rather than photons from sunlight since it’s not really necessary to disassemble hardware to that extent these days. And there’s available processing power to retry or do other error handing for any interference. Like running an unshielded Ethernet cable through a wall next to a power cable or through a room with heavy machinery can definitely cause data corruption from EM interference, but it will likely manifest as slowness rather than crashing a whole system. But there are lots of things that still cause computers or applications to crash that are related to stray energy, we just are so used to buggy software now that it rarely is noticed. 😁
a friend of mine uses a non modern ryzen that has issues with sleep states. his home server ran fine until the os got an update and managed to idle much lower than before. that made his machine crash and was a really weird error to catch. dunno if this could apply to you at all. just throwing it out there.
Had the same problem. Reason in my case was high ssd temperature (passive case) caused by high load from some jellyfin job.
Looking for overnight jobs might help. He could try disabling them and see if the issue stops, and if it does then re-enable them in a controlled way to determine which one was causing it.
Check the dmesg? Could be a hardware issue. Is it maybe overheating or something?
Seems you are using NixOS. Maybe you can try one of those fancy rollback features it has and see if that makes a difference?
worth a try (though somewhat useless given the number of shenanigans i’ve been changing constantly, lol, will be hard to find the correct commit)
Check your cron and systemd timers to see if a regular scheduled job is running at that time.
Maybe look into using the pstore, it can store kernel panics in ACPI or UEFI variables to be read by the next boot. Usually this is accessible at
/sys/fs/pstore
, but if systemd-pstore is installed then it should be in the journal, but it can also be here:/var/lib/systemd/pstore
.Thanks, I checked some forums, and despite allegedly being enabled by default, pstore doesn’t exist as a folder and sys/fs/pstore sits empty
Check more logs - your webserver logs, auth logs, dmesg, journalctl entries, etc.
Define “crashing”
Turning off without leaving logs. When it turn it on (by pressing the power buttons), it does so normally
Does it have a system log in the BIOS? What’s the battery like, present and healthy or dead/removed? Fans working, not overheating? Does it die if you stress test the CPU/GPU/memory?
Ive tried stress testing for about an hour, didn’t die, fans work, does run hot and thermal throttle, but apparently under control. Battery is in, level is stable, even during stress testing level doesn’t budge
Strange. Memtest? I can’t think of anything else off the top of my head, unless you sit up and watch it.
Running memtest to check if it’s a RAM issue might be worth it.
Also could be overheating storage, that can cause weird issues.
Make sure you don’t have a full disk somewhere