Tiger MPs suck, or Why the hell do all kernels freeze except one?

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
I'm kind of worn out from spending hours dealing with this, I'm about as frustrated as a person could possibly be. So I'll keep this somewhat short and sweet and add more info later if needed.

Basically, linux freezes on my Tiger MP. Knoppix freezes, the gentoo live eval cd freezes, every single of my 7 (so far) kernels freezes, except ONE. And go figure, this magical kernel is one I built myself a while back and no longer have the .config for. It's a 2.4.20. I have tried everything with the other kernels.. changing video cards, removing every single piece of hardware possible (except the cd drive, for booting knoppix), and I've also removed the cd drive for booting others. I'm 99% sure it's the motherboard, and I think it's not necessarily a pure hardware problem, but a hardware problem that somehow can be avoided with the right software (i.e. my magical kernel). This magical kernel has had uptime of something like 40 or 60+ days. If the memory is bad, then I sure would be amazed, since that kernel runs rock solid for that long with NO problems at *all*. It's all ECC memory too. I've tried booting with noacpi, noapic, apm=off, and different combinations, nothing. I've tried turning off ACPI in the bios, nothing. My latest bright idea was to compile the kernel for only i586 instead of for k7, but it still froze. I think I've actually gone past the end of my rope, since I've just decided that I'm going to mess with this for as long as it takes to get working, just to get revenge on the POS. Oh, and not all of these kernels are custom, I've done lots of the testing on official debian kernels, they freeze too.

Sooo, if anyone has any ideas, I'm all ears...

edit: and my next machine will be all intel, budget and principles be damned ;)
 

Chaotic42

Lifer
Jun 15, 2001
34,852
2,020
126
Check your ATX power connector. It might be getting crispy. There is an operation you can perform, if this is the problem, but it involves soldering.

I have a Tiger MP sitting right here. It just needs the surgery.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: Sunner
Tried one of the BSD's?

I used to run NetBSD on it, but switched to debian when I got the second CPU, since NetBSD's SMP wasn't worthy of use at that point. And honestly, I'm not sure If I have the heart to keep using NetBSD at all. Apt is sooooo much more convenient than pkgsrc (or FreeBSD ports even), and I just know debian well and overall I feel very comfortable with it. I like NetBSD a lot too, but it's just not quite as nimble.

Check your ATX power connector. It might be getting crispy. There is an operation you can perform, if this is the problem, but it involves soldering.

I have a Tiger MP sitting right here. It just needs the surgery.

But if it were strictly a hardware problem, why can one kernel run fine for 60 days, while others lock up in minutes?
 

n0cmonkey

Elite Member
Jun 10, 2001
42,936
1
0
BIOS update maybe? That board isn't the greatest, and I think there were several issues with it at one point or another. Mine doesn't like ECC ram. Locks up on boot if I have ECC enabled (yes, I let it sit for hours). I keep meaning to check for a BIOS update to see if that helps. Have you tried a 2.6 kernel?
 

Nothinman

Elite Member
Sep 14, 2001
30,672
0
0
BingBongWongFooey: strange but I've noticed a similar thing happening to me. I have 2.4.21 that I compiled myself many moons ago and it runs great but if I compile a new 2.4.22 kernel bad things happen. Perhaps it's a compiler or linker bug in the versions in sid? I was running a 2.6 kernel and I think it ran ok too, can't remember 100% cause things have been a little hectic here lately.

These people make a kernel which will work on your board.

Too bad their kernel won't run any of the software we want.
 

drag

Elite Member
Jul 4, 2002
8,708
0
0
Maybe you can try comparing the outputs of dmesg between the working kernel and one that causes lock-ups. Maybe that can give you a idea of what resource your utilizing differently. Providing of course you can get that far. :(

Maybe try disabling smp support at compile time or disable one of the cpus and see if that can make a difference.

Maybe if you can't get a 2.4 series kernel working or a 2.6, try going back to a 2.2 series.

Maybe try underclocking it...
 

sciencewhiz

Diamond Member
Jun 30, 2000
5,885
8
81
Originally posted by: Nothinman
Perhaps it's a compiler or linker bug in the versions in sid?

That's what I was going to suggest. It's still suggested to compile the kernel with gcc 2.95.

Also, have you tried one of the debian binary kernels? if not, then I'd try kernel-image-2.4-386 (no it's not smp nor optimized, but it's a good base to start with)
 

n0cmonkey

Elite Member
Jun 10, 2001
42,936
1
0
Compiling a uniprocessor kernel is an interresting idea. Might help limit the possibilities.
 

Haden

Senior member
Nov 21, 2001
578
0
0
That 2.4.20 magic kernel, you tried it recently I assume (maybe some hw broke after that uptime)?
Maybe try taking one cpu from mobo, if it works atleast you'll know it's probably smp problem.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: n0cmonkey
BIOS update maybe? That board isn't the greatest, and I think there were several issues with it at one point or another. Mine doesn't like ECC ram. Locks up on boot if I have ECC enabled (yes, I let it sit for hours). I keep meaning to check for a BIOS update to see if that helps. Have you tried a 2.6 kernel?
I have the bios updated to the latest, and I couldn't get a 2.6 kernel going since it needs (?) devfs and I'm not running that, so the 2.6 kernel I have just barfs while booting, something related to /dev. ECC is a good idea, I will try messing with the ECC settings and see if it's related to that.

Originally posted by: CTho9305
These people make a kernel which will work on your board. They actually provide a LOT of binaries.

Heh, I've owned two kt133a motherboards and windows (2000) never ran without freezing/BSODs on them, yet they ran linux and netbsd perfectly. ;)

Originally posted by: Nothinman
BingBongWongFooey: strange but I've noticed a similar thing happening to me. I have 2.4.21 that I compiled myself many moons ago and it runs great but if I compile a new 2.4.22 kernel bad things happen. Perhaps it's a compiler or linker bug in the versions in sid? I was running a 2.6 kernel and I think it ran ok too, can't remember 100% cause things have been a little hectic here lately.

That's a good idea too, I can install 2.95 and see if compiling with that makes a difference.

Originally posted by: drag
Maybe you can try comparing the outputs of dmesg between the working kernel and one that causes lock-ups. Maybe that can give you a idea of what resource your utilizing differently. Providing of course you can get that far.

Maybe try disabling smp support at compile time or disable one of the cpus and see if that can make a difference.

I actually have been comparing dmesges, but I couldn't seem to find anything that stood out to me. Disabling smp is an idea, and it has crossed my mind, but I kinda figured, "why would it matter, since I don't want to run with one cpu"? But I should probably give it a try regardless.


Originally posted by: sciencewhiz
Also, have you tried one of the debian binary kernels? if not, then I'd try kernel-image-2.4-386 (no it's not smp nor optimized, but it's a good base to start with)

Yeah, the kernel I've been using the most to test is from the debian 2.4.21-k7-smp kernel. I did try lowering to i586 (my own custom kernel anyways), but trying just 386 is an idea, as well as trying with an official kernel instead of building one.

Originally posted by: Haden
That 2.4.20 magic kernel, you tried it recently I assume (maybe some hw broke after that uptime)?

Yeah I'm running it right now, this is my main desktop machine so in between testing I end up booting this kernel so I can actually use the thing for a while.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Mess with ECC settings: did it, still froze
gcc 2.95: still froze
disable smp: still froze
try (official debian) i386-compiled kernel: still froze

:|

edit: Poking around in kern.log and I see that in fact it did NOT compile with gcc 2.95. What the hell do I need to do to tell it to use gcc 2.95? Setting $CC seemed to do nothing, so I just temporarily changed the symlink in /usr/bin myself, yet kern.log has no record of a kernel that was compiled with 2.95. Something interesting though: my one good kernel was compiled with gcc 3.2.3, while all others that I've been testing with thus far were compiled with 3.3.1. Hmm.....
 

chsh1ca

Golden Member
Feb 17, 2003
1,179
0
0
Umm, that's odd. I had a network services box up on a Tyan Tiger MP + dual Athlon MP 1800+s and 1GB of ram that had no problems at all. It hosted some LDAP + DNS, and was under quite a bit of load most of the time, and never had a problem with it. It was completely rock solid. One thing though, IME DP systems REQUIRE fantastic power supplies. Any slight variance in the current seems to make them crap themselves. I know this because we had an issue with said box that caused it to crash shortly after boot, and it turned out that the power supply was only putting out ~ 11V on the 12V rail, and ~6.7V on the 7V rail. It wasn't a lock up, and would restart okay, it would just report a kernel panic for little reason. It was running Slack 8.0, with a 2.4.20 kernel (updated to patch for a vuln) and had no problems.

I'd start with the power supply. It could be that some combination of options didn't enable something in the kernel that would use more power (ie: DMA on your drives). Check it out. Also, I'd try compiling a 2.6 kernel to see if that cures your problem. They ARE labeled 'test' kernels, but I've been running 2.6.0-test8 since the day after it was released, and NO problems whatsoever. They improved a lot of thread management and SMP stuff, so I'm thinking it may be that they may have improved something that could help your situation out.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: chsh1ca
I'd start with the power supply. It could be that some combination of options didn't enable something in the kernel that would use more power (ie: DMA on your drives). Check it out.

I honestly don't see how power could be a problem; Running the cpus at 100% vs. near 0% would make a much bigger difference in power usage, yet I can run them both at 100% with total stability, on my good kernel. Also, removing sound/network cards, and either cdrom or hard drive (not both at the same time, since it's a little hard to boot then ;)) made no difference as far as which kernels crashed.

I might try 2.6, but I'll have to figure out what I need to do with respect to /dev.

edit: found this nugget in #debian:

(/msg dpkg 2.4bug)

extra, extra, read all about it, 2.4bug is look at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=194287 as it says there, gcc-3.3
cannot compile 2.4.x kernels (x <= 20), apt-get install gcc-3.2 or gcc-2.95 (preferred) then edit the top level Makefile before you
make dep with: perl -pi.bak -e 's/gcc/gcc-2.95/' Makefile, or gcc-3.3 compiles 2.4.21-rc3 and up just fine. The latest
kernel-source-2.4.20 packages are buildable with gcc 3.3
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: chsh1ca
I take it you're using GCC 3.3?

Yep. I have 3.2 and 3.3 (well, and 2.95, now) but it defaults to 3.3.

edit: the 2.95 kernel froze. :(

Going to build a 2.6 kernel and see if I can get it running, after that I think I'm done for today. The crap gets old fast. :|

Awesome, it doesn't even boot. It's funny, this whole thing is one big catch-22. I am constantly tempted to just buy a (NON-tyan) MPX board, but as I keep dumping more time into diagnosing this one, I'm hurting the time/money case towards buying a new board, i.e. I've already spent so much time on this thing, I might as well see it through to the end. Ugggh this blows.
 

drag

Elite Member
Jul 4, 2002
8,708
0
0
Cool. I never realy liked devfs a whole lot.

It was nice not to have a half billion different /dev/ files I'll never use, but devfs seemed just to make things to complicated.
 

Chaotic42

Lifer
Jun 15, 2001
34,852
2,020
126
I'd still check the power connector. My problems started with weird random things in Linux. It progressively got worse. It's worth a shot anyway.
 

Sunner

Elite Member
Oct 9, 1999
11,641
0
76
Originally posted by: Chaotic42
I'd still check the power connector. My problems started with weird random things in Linux. It progressively got worse. It's worth a shot anyway.



Funny you should mention that, happened to one of my workstations at work a while back.
Crashed every now and then, I couldn't figure out why for the life of me, got worse, eventually as a last ditch effort, I checked the PSU cable, and sure enough it's wasn't properly connected, was off by a mere mm or so, but obviously enough.
 

Chaotic42

Lifer
Jun 15, 2001
34,852
2,020
126
Originally posted by: Sunner

Funny you should mention that, happened to one of my workstations at work a while back.
Crashed every now and then, I couldn't figure out why for the life of me, got worse, eventually as a last ditch effort, I checked the PSU cable, and sure enough it's wasn't properly connected, was off by a mere mm or so, but obviously enough.

Yep. The Tiger MP's connector is just barely able to handle the 5V demands of the system. My problems started when I added my Radeon 9700. Strange because it has it's own connector, but it happened anyway.
 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
Originally posted by: Chaotic42
Originally posted by: Sunner

Funny you should mention that, happened to one of my workstations at work a while back.
Crashed every now and then, I couldn't figure out why for the life of me, got worse, eventually as a last ditch effort, I checked the PSU cable, and sure enough it's wasn't properly connected, was off by a mere mm or so, but obviously enough.

Yep. The Tiger MP's connector is just barely able to handle the 5V demands of the system. My problems started when I added my Radeon 9700. Strange because it has it's own connector, but it happened anyway.

What all is in your system? I don't think mine should be overloaded, I have:

2 duron 1.3's
matrox g400
sb16
nic
barracuda IV
32x burner

But I'll check it out sometime today. I just hate turning my machine off. :p I haven't gotten much done recently because of all of the rebooting :(
 

drag

Elite Member
Jul 4, 2002
8,708
0
0
Can't you just pop the side off of the computer and then use a voltameter on the power plug? If voltages are to low then that would be your problem.


Just FYI:

ATX power supply/color codes.

1......11
2......12
3......13
4......14
5......15..clamp side
6......16
7......17
8......18
9......19
10....20

pins 1, 2 and 11 are orange and should = 3.3v

12 is blue and is -12v

3, 5, 7, 13, 15, 16, and 17 are all black and are ground.

4, 6, 19, 20 are red and those are equal to 5v

Pin 8 is gray and is Power Good or p_ok. Don't know what exactly it does, but I could guess.

Pin 9 is purple and is 5v. The only difference is that it is ALWAYS ON. It provides power for wake on lan and stuff like that. So even if your computer is turned off, it always has 5v's going thru it. Which is why it's a bad idea to pull cards and stick your fingers and bolts and stuff all over your main board will the power supply switch is on or plugged into the wall.

Pin 14 is power-on and is green.

pin 17 is white and is -5v.

</FYI>

I suppose you can plug the leads from the voltameter into a spare plug (for a HD or something) and watch the voltages and see if you can get it to crash or whatever and see if it drops below certian voltage.

Turn the multi-meter to volts, make sure that's it is set to DC operation,and pull out one extra plug. Take the - (ground) probe and jam that into the black wire's part of the plug then test the the other plugs, (red=+5v, and yellow=+12v).

Put extra HD in there, run the cd player some i/o intesive tests and some cpu benchmarking stuff. Try to get as high as a load as possible on your computer power supply and it shouldn't vary more then 5% of what the voltage should be.

A high-quality power supply shouldn't very much of any when running different loads.

Of course be carefull, I don't want you to fry your computer!


Or maybe it would be easier to set up the on-board sensors and just log those every 5-10 seconds or so, but that won't be as accuarate. :p








 

Barnaby W. Füi

Elite Member
Aug 14, 2001
12,343
0
0
I have a multimeter somewhere but I don't feel like cramming myself into the case while it's running. I remember seeing in the bios that the 5v line was something like 4.7v. But I still don't see how it could really be the problem, if one kernel still runs flawlessly...