brand new CentOS5 install goes to crap

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
I installed CentOS5 on a maxtor (ok maybe that is my problem, but doubt the drive is bad, it was running fine before this) and have 3 seagates as a blank raid5.

When I booted up for the very first time, it just throws me in some grub> shell which is completly useless. I can't even boot into the OS. I think its because the maxtor is not the primary drive but the /boot is being treated as it, so when it tries to start loading the OS its looking on the primary drive which is part of the raid array.

How can I get around this?


edit:

Tried booting with an ubuntu CD to see how far I can get. It starts loading then I get this error:


*** glibc detected *** double free or corruption (!prev): 0x0806c0d0 ***
[17179708.812000] drivers/usb/input/hid-core.c: can't resubmit intr, 0000:00:100-1.1.1/input1. status -19.



I'm guessing bad ram but that server is too loud (it's right in my room) for me to leave it do memtest overnight so before I do that... could it be something else?
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Usually I don't make too many assumptions about the problem when I can't even get something installed right.

It could be a buggy BIOS you have.
It could be certain BIOS settings you have that it doesn't like.
It could be an issue of compatibility with the kernel drivers and your chipset or HDDs.
It could be some inappropriate / unwise types of user configurations of the kernel or the storage.

Try updating the motherboard BIOS to the latest stable settings.

Record your values, then reset BIOS settings to factory defaults, then try to change or enable only very simple / safe / almost sure to work settings and see if that helps.

Be careful with the whole RAID thing. It might be best to install on a non-RAID OS disc, then once that is installed and all online updates are applied, try manually configuring an array and spend a couple of days spare time intermittently stress testing it to ensure it is stable and performing well as a data array without the OS having to boot from it.

Be careful of the UDMA type and SATA transfer rate settings in the BIOS, some drives / cables / controllers get much more flaky at fast DMA rates. Try forcing all UDMA modes to OFF and using slow PIO 4 mode only on all the drives and see what happens.

Investigate whether the MODE of the drive controller / set is better off as AHCI or IDE or JBOD or motherboard chipset based RAID (I'd avoid the latter).

Try disabling ACPI in the BIOS and see if that helps.

Try kernel arguments from grub like
noacpi acpi=off ide=nodma noapic
and see if that helps stability at all.

Until you get the most recent kernel on there I'd just assume you could have buggy drivers and hope that simply updating might fix things.

If you really still can't get it figured out after asking around on the centos forums, you could always try Fedora 8 on the box and see if that works any better with comparable configurations since it is related software but Fedora tends to use newer versions of kernels / drivers / packages. If that works much better then it's a question of either kernel / package versions, kernel boot parameters, kernel configuration options, or similar.

If you have a spare machine you might even see if you can set up and update Centos on the other machine then transfer the discs and few relevant setting changes over to the target system and see if it'll behave / boot better once it is in a stable "known good" configuration state to start with.

 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
PS you should check your RAID configuration whether it is using DMRAID (and, if so, what configuration) or /dev/md based RAID.

IMHO DMRAID is "in theory" a better solution but it seems pretty half baked in compatibility with various motherboard BIOS and chipset issues. Turning that off and using MD based RAID may be a pain in the butt but be easier to get working since you'll be more dissociated from the BIOS / chipset drive and raid stuff.

Sometimes even DMRAID wants to claim your drives even when you don't want it to, and that can interfere with it starting a MD based RAID because MD doesn't get created or control of the devices.

Also consider enabling serial console output and hooking up a PC over a null modem cable to capture the exact kernel boot log or whatever grub messages you may get. Take off "quiet" or "rhgb" or similar kernel options so you get the most output.
Run through that looking for device names, etc.

Look at /boot/grub/menu.lst and see what devices it refers to for root.

Get yourself a LiveCD Unix boot CD or whatever and use it to boot and mount the installed system's /boot and / and see if you see anything in
/var/log/messages, /var/log/dmesg, /dev, /boot, /, /etc/fstab, /etc/sysconfig that looks like it is not going to work right with respect to device mounting and naming.

 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
Yeah the OS drive is stand alone. I also tried a live CD (ubuntu) but I got the errors I posted. I'm guessing its hardware issue so I'll try updating the bios if not then i'll have to buy a new motherboard/ram/cpu combo. this was supose to be a budget system but I already put 500 bucks into it LOL. Sadly its not even sata drives lol.

Another thing, I have them set to cable select, could that pose issues? They should all be masters as they are on all their own channel.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Yeah cable select can cause problems, it is possible to install the cable backwards (motherboard on the drive plug) or plug the drive into the wrong one of the two drive connectors. Both of which can cause problems.

Make sure the BIOS thinks they're all primary masters anyway as a start.

This:
*** glibc detected *** double free or corruption (!prev): 0x0806c0d0 ***
[17179708.812000] drivers/usb/input/hid-core.c: can't resubmit intr, 0000:00:100-1.1.1/input1. status -19.

sounds like a USB driver bug... if you don't need USB or firewire, turn off your USB EHCI/OHCI/USB controller in the BIOS.

Set Plug And Play OS? to NO in the BIOS, though if it is already no, try YES just once or twice before returning to NO just to see if
something changes.

I think it is drastic to replace the motherboard.. usually I find there is some work-around to get UNIX working on systems like this. Sometimes it isn't obvious what it is, but it's usually something VERY simple in the end, like
adding noapic or whatever to the grub line, or updating the kernel.

You can see if FreeBSD 7.0 likes your setup as another test.

I'd focus on BIOS settings, grub / kernel boot option arguments, and kernel / driver version stuff. I'd bet lunch for a GNU that it's not a hardware problem that isn't soluble in simple soft-configuration.

Does the system work better when you unlplug the raid drives and just install the OS on the OS drive?

I think my last LINUX install problem was relating to USB EHCI/High Speed Hand Off settings in the BIOS....

Originally posted by: RedSquirrel
Yeah the OS drive is stand alone. I also tried a live CD (ubuntu) but I got the errors I posted. I'm guessing its hardware issue so I'll try updating the bios if not then i'll have to buy a new motherboard/ram/cpu combo. this was supose to be a budget system but I already put 500 bucks into it LOL. Sadly its not even sata drives lol.

Another thing, I have them set to cable select, could that pose issues? They should all be masters as they are on all their own channel.

 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
I sort of need USB as thats how the KVM plugs in, and I've used it before with fedora core 7 with no issues. I'll try unplugging the raid and setting them all to master after plugging them back in, if it solves the issue, once I get home.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Originally posted by: RedSquirrel
I sort of need USB as thats how the KVM plugs in, and I've used it before with fedora core 7 with no issues. I'll try unplugging the raid and setting them all to master after plugging them back in, if it solves the issue, once I get home.

Actually thinking about your USB error message and what I said before:
"I think my last LINUX install problem was relating to USB EHCI/High Speed Hand Off settings in the BIOS.... "
...it might be the same problem.

Change the EHCI/UHCI hand-off option in the BIOS or disable the EHCI / USB 2.0 controller and leave it in USB 1.1 mode.. Or plug all your USB stuff into a USB 1.1 only hub so it won't try to use USB 2.0 and see if that helps.


 

Brazen

Diamond Member
Jul 14, 2000
4,259
0
0
what kind of mob do you have? Sometimes simply googling for the motherboard and "linux" may turn up some known compatibility problems.
 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
Tried booting without the raid, still screwed. Tried setting OS drive (non raid) to primary master (was secondary mater, still screwed. Tried messing with cable order, still screwed. Now it wont even boot period even with everything as it was. It stays stuck at "detecting primary master". My guess is the mobo is fubar, I'll have to buy a new one. :/ This really sucks, I'm trying to save money for a house here and this "budget server" is starting to cost me more then if I would build one from scratch.

The mobo is a KT600.

Also another possible source of issues is the PCI IDE card, think at boot time its in a certain order, but once it gets past the bios, the order is swapped and it becomes primary over the mobo IDEs. Could this be? If yes is there a way to force the IDE controller to be 3rd and 4th IDE?

Oh and nothing in the bios refering to EHCI/UHCI to change.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Double check all the drive power/data cabling and seating of the disk controller card etc.
Then go reset the BIOS back to factory defaults, go in to the HDD settings and do AUTO detect on your various IDE channels, make sure it finds the expected drives on the expected ports.

Yes the settings of the BIOS (if any) in your add in disk controller will possible make a difference as to whether it is probed before or after your MB attached drive ports.. your MB bios may have settings about whether to use the add in card and add in controller chip BIOS or not and what drive order to assign things.

I doubt the mb is fubar.. it is probably just cruddy BIOS and cruddy OS boot loader / kernel settings interacting badly. Update the bios in everything if possible.

Anyway if I understand you correctly you have a kernel panic on boot with USB related problems even without the RAID being an issue. Just unplug the RAID drives and leave only the OS drive and try to get the system installed/booting/updated/working.

Seriously, unplug the KVM and all the USB stuff, turn off the USB controller as much as possible, and use a PS/2 keyboard / mouse to install Centos and see if you can get past the boot problems.

noapic noacpi ide=nodma
are kernel arguments I'd try for the installer and or installed system, possibly others that can help with your USB issues relating to disabling USB probing / EHCI / HID etc.

If you MUST use USB then plug the KVM and stuff in through a 1.1 only HUB and see if that helps but personally I'd use PS/2 until my kernel and udev and hid stuff was updated.

The order of various drives does tend to get swapped depending on the BIOS boot order and device order and BIOS enable/boot priority options. Don't worry too much about that until it is your main problem which I doubt it is since you can't even get it working without the RAID just with the single OS disc.... So just unplug the 2nd IDE controller and RAID discs and deal with the OS drive.

The RAID stuff in UNIX is sort of intelligent about having devices be swapped around different ports and still being able to assemble the raid... udev tries to keep logical device names consistent even when devices move to different physical/logical ports....
Cross that bridge when you come to it.

Doubt you need to spend money on a new MB if you're willing to fight with that one or try a different OS distribution or wait for a kernel/driver fix or whatever it takes. Chances are you can solve it with a kernel boot argument though and BIOS settings.

On the other hand I'm sure there are motherboards you could buy that would have better BIOS and better driver support so things "just worked" with less hassle because Centos better supports their configurations by default. But that'd be like $50 - $100 for a new one and no guarantees it won't have quirks. I suppose I can see how well Centos works on the IP35-E in a couple of days, that's a cheap one, $50 after a damn slow rebate at Newegg last time I checked, it's Socket 775/DDR2.. don't know what your CPU/memory is now.

Did you try NetBSD / FreeBSD?

Anyway rip out everything you don't absolutely need to test with and go for a test install with PS/2 and only the OS drive and see what happens.




Originally posted by: RedSquirrel
Tried booting without the raid, still screwed. Tried setting OS drive (non raid) to primary master (was secondary mater, still screwed. Tried messing with cable order, still screwed. Now it wont even boot period even with everything as it was. It stays stuck at "detecting primary master". My guess is the mobo is fubar, I'll have to buy a new one. :/ This really sucks, I'm trying to save money for a house here and this "budget server" is starting to cost me more then if I would build one from scratch.

The mobo is a KT600.

Also another possible source of issues is the PCI IDE card, think at boot time its in a certain order, but once it gets past the bios, the order is swapped and it becomes primary over the mobo IDEs. Could this be? If yes is there a way to force the IDE controller to be 3rd and 4th IDE?

Oh and nothing in the bios refering to EHCI/UHCI to change.

 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
Managed to get it booting, well to the crashed grub screen. Trying to reinstall right now to see what happens. If it does it again I'll try Fedora Core 7 instead then if it does it again I'll try without the card, but I NEED that card, so that wont really be a solution.

The USB error seems like it was a 1 time thing, I did not see it again yet.
 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
Woot managed to get it going. Guess Linux just pulled off a Windows for once, reinstall fixed it.

[root@localhost etc]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 hdg1[3] hda1[1] hdc1[0]
976767872 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
[==>..................] recovery = 12.4% (60973496/488383936) finish=228.1min speed=31217K/sec

unused devices: <none>
[root@localhost etc]#


FTW!

This may take a while, and is super loud.


Only thing, I'm scared it may decide to crap out once the raid is done. So I still have to cross my fingers and reboot once this is complete.


Actually know what, I dont remember it taking this long when I created the raid through the setup, so bet the first time it booted it started rebuilding but I was not aware and Ijust shut it off so it trashed the array and caused all those issues.

I used mdadm this time and did it manually.
 

Red Squirrel

No Lifer
May 24, 2003
70,204
13,591
126
www.anyf.ca
Think I figured out what happened. Check this:


[root@alderaan ~]# fdisk -l

Disk /dev/hda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 60801 488384001 fd Linux raid autodetect

Disk /dev/hdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 * 1 60801 488384001 fd Linux raid autodetect

Disk /dev/hde: 163.9 GB, 163928604672 bytes
255 heads, 63 sectors/track, 19929 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hde1 * 1 25 200781 83 Linux
/dev/hde2 26 3212 25599577+ 83 Linux
/dev/hde3 3213 4487 10241437+ 82 Linux swap / Solaris
/dev/hde4 4488 19929 124037865 5 Extended
/dev/hde5 4488 19929 124037833+ 83 Linux

Disk /dev/hdg: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdg1 * 1 60801 488384001 fd Linux raid autodetect

Disk /dev/md0: 1000.2 GB, 1000210300928 bytes
2 heads, 4 sectors/track, 244191968 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Device Boot Start End Blocks Id System
/dev/md0p1 1 244191968 976767870 83 Linux
[root@alderaan ~]#



What I want it to boot from is hde1 but the raid drives come first, and they also have the bootable flag, but those are specialized partitions and dont have actual OS data, so when it tries to boot off it instead of hde1 it craps out.

anyway to make those drives not bootable? so it just does hde1 first? I'm sure if I reboot this server, I'll end up with same problem until I fix this.

tried to just toggle boot flag in fdisk but I get this whenever I try to altar it (even /dev/md0 gives me this error)

Device Boot Start End Blocks Id System
/dev/hda1 1 60801 488384001 fd Linux raid autodetect

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
[root@alderaan ~]#