Fuuuuuuuuck. Ever since I moved those dual P3′s boards into 1U cases I’ve been having shittons of trouble with the one with the PATA HDD. Whenever you put the HDD under any kind of moderate load, the entire system freezes up with something like:
ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=
ad0: FAILURE - READ_DMA48 status=51 error=10 LBA=
g_vfs_done():ad0s1g[READ(offset=48, length=131072)]error = 5
ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=
Which is shit because I want to turn this machine into a fileserver hosted elsewhere, and I can’t have it crashing every time someone needs to fetch a file off of it.
So I’ve been searching around for a solution to this fucking problem. Looks like it was introduced in FreeBSD-5.0, at least, if it’s a bug in the ATA driver. It could very well be a hardware error, though I’ve already swapped out the PSU and RAM (which didn’t have any effect on the bug). Later tonight I’m planning on sticking the HDD into another machine and seeing if I can duplicate the issue (since my spare motherboard appears to be broken).
Anyway, the most useful post I found on the subject was one from the freebsd-hackers list, which suggested that there might be a one-off error in the 48-bit addressing mode change.
/* only use 48bit addressing if needed (avoid bugs and overhead) */
- if ((lba > || count > 256) && atadev->param &&
+ if ((lba > || count > 256) && atadev->param &&
atadev->param->support.command2 & ATA_SUPPORT_ADDRESS48) {
/* translate command into 48bit version */
This is from back in 2004, but looking at the 7.0-BETA4 sauce:
suigintou# grep -n *
/usr/src/sys/dev/ata/ata-all.h:309:#define ATA_MAX_28BIT_LBA UL
Changed that to , recompiled and installed the kernel, and the errors still popped up. I’m kind of tempted to set that value fairly low and see what happens (off-by-two error? unlikely)
The other weirdness is the output from smartctl -a /dev/ad0:
blah blah
ATA Error Count: 7 (device log contains only the most recent five errors)
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 e0 Error: ICRC, ABRT at LBA = 0x00000000 = 0
Hrm, I was thinking that the LBA was fucked up, but ICRC (the error code) probably stands for “invalid CRC”. Anyway, all 7 errors are the exact same. The error counter doesn’t seem to increase when the system crashes. Everything else from the smartctl output seems reasonable (no errors, hdd is fairly new and in good condition).
Anyway, I’m kind of frustrated at this point. Not only am I basically down a spare motherboard, but this fileserver (which I have 2x 500GB Seagates in the mail for) might not be stable by next month, when I plan to have everything racked up and serving…
pffft.
UPDATE: I tried the HDD in another machine, and unsurprisingly, it worked fine. I was about to give up on this issue and try to get that extra motherboard to POST when I had the idea of booting the HDD-issue board with only one CPU. Surprise surprise, it booted with one CPU and is now csup’ing the source tree – something that caused the system to lock up almost immediately.
THE PLOT THICKENS. thankfully I have like 2084 spare P3′s to test with, to see if this is a hardware issue…
UPDATE: nm it still broke ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA= etc it just took a lot longer this time. I guess SMP just aggravates the issue.
UPDATE: Okay, so I wanted to test the CPUs on the dead extra mobo so I was taking the CPUs out of the broken controller one when all of a sudden one of the heatsink stubs on the socket fucking BROKE OFF. This had actually happened before on this board, but it was on the other socket. So I got out my superglue and repaired the second of two broken sockets on the board… sheesh.
So instead, I took the two CPUs from the broken controller board and stuck ‘em into the dead extra motherboard and… IT BOOTED!!! I’ve got it rigged up with the HDD and shit, running csup now as a stress test (which, if it succeeds, I’ll follow up with a make buildworld…). HOORAY.
tl;dr:
TODAY’S CASULTIES
- Tyan 2515 Motherboard: IDE Controller is dead.
- Tyan 2515 Motherboard: Broken heatsink stub (repairing).
- 2 Pentium III CPUs: Presumed dead, pending further tests.
UPDATE: shit nm it crashed. It’s either the CPU (since I took the ones from the broken-controller-mobo), FreeBSD, or the HDD. I’ll probably swap the CPUs after I recover from my drenched elation, then try swapping the HDD when the new ones arrive. I have a SCSI drive coming and 2 fat PATA’s, gonna have the SCSI machine running as the master + netbooting the other machines (only one “other” right now) which will all have nice fat HDDs to save shit on. Sigh.
No comments