Converted Linux system to a VM, now it's acting like it has a failing hard drive... but it's a VM.

Red Squirrel · Jun 7, 2019

So this is a weird one, I have a FC9 system (I know) that all it does is run my local mail. Been wanting to migrate that, but mail is kind of a pain to setup and just never got around to it. I want to use virtual domains instead of individual user accounts for each mailbox and procmail does not work with virtual users so I need to code my own version which I just have not gotten around to doing. I have a couple special mailboxes that won't work the way they do now if I go to virtual users, without coding a custom MTA.

Anyway, so basically ever since I switched, trying to get my mail is BLOODY SLOW. Thunderbird just sits for about half an hour before all the mail finally loads. It's using IMAP.

In the dmesg log I also get lot of errors like this:

Code:

ata2.00: qc timeout (cmd 0xa0)
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
ata2.00: status: { DRDY ERR }
ata2: soft resetting link
ata2.00: configured for PIO0
ata2: EH complete
ata2.00: qc timeout (cmd 0xa0)
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
         res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
ata2.00: status: { DRDY ERR }
ata2: soft resetting link
ata2.00: configured for PIO0
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
         cdb 4a 01 00 00 10 00 00 00  08 00 00 00 00 00 00 00
         res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
ata2.00: status: { DRDY }
ata2: soft resetting link
ata2.00: configured for PIO0
ata2: EH complete
[root@borg ~]# 
Message from syslogd@borg at Jun  5 12:02:37 ...
 spamd[11064]: spamd: handled cleanup of child pid 11072 due to SIGCHLD

Looks to me like a hard drive failing, but again, this is a VM. The VM disk is stored on a raid 10 via NFS. The other VMs are fine. My NFS has always been super slow though, I just never been able to figure out why and gave up. Want to eventually revamp storage and switch to iSCSI or something at some point. Is it just that this OS is being affected more seriously by the slow disk?

TheELF · Jun 7, 2019

Is PIO0 the normal definition for pata drives? Is the VM set up for pata? Was the original system?

Did you make the clone with a program that does not mess up the alignment?

Red Squirrel · Jun 7, 2019

The clone was done a rather crude way, maybe this is the issue. I literally DDed the entire disk then booted the fresh VM with a live distro, and DDed it to the disk. It did not boot at first, I don't remember what I did but it involved rebuilding the startup portion of the kernel or something like that. The old server is SATA, new one is technically SATA as well. Disk starts with sda and not hda.

I probably just need to bite the bullet and completely retire this server though, the few advanced mailboxes I have I can maybe leave them and it can do it's thing in the background and I can migrate the rest to new mail server.

Even just typing commands in the ssh console is brutal slow. Oddly enough it was not like this at first it just got worse with time. Load is constantly at like 15+ it seems the spam filtering really pushes it. Almost wonder if I am in fact better off having a separate physical box for this. I just noticed this now really, that the load is so high. Maybe that's the issue.

Red Squirrel · Jun 9, 2019

Well it's definitely some weird disk I/O speed issue. No idea why it's so damn slow though. It's slower than all the other VMs on the same LUN so this particular slowness is not my NFS being slow, even though that IS an issue I have, which I just gave up trying to troubleshoot.

I think it might just be the fact that this is just such an old system, something somewhere is mucked up. I just need to migrate mail off it and be done with it and retire it for good.

Interesting how the speeds vary so much based on parameters though. It seems it can't deal with big chunks of data at a time.

Code:

[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=100M count=1
1+0 records in
1+0 records out
104857600 bytes (105 MB) copied, 116.876 s, 897 kB/s
[root@borg ~]# 
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=10M count=10
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.581574 s, 180 MB/s
[root@borg ~]# 
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 535.325 s, 2.0 MB/s
[root@borg ~]# 
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=10M count=10
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 1.05444 s, 99.4 MB/s
[root@borg ~]# 
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.15293 s, 686 MB/s
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 49.5754 s, 21.2 MB/s
[root@borg ~]# 
[root@borg ~]# 
[root@borg ~]# dd if=/dev/zero of=testfile.bin bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 101.341 s, 10.3 MB/s
[root@borg ~]#

To show a comparison, locally on the NAS:

Code:

[root@isengard temp]# dd if=/dev/zero of=testfile.bin bs=10M count=10
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.215098 s, 487 MB/s
[root@isengard temp]# 
[root@isengard temp]# 
[root@isengard temp]# dd if=/dev/zero of=testfile.bin bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 3.6055 s, 291 MB/s
[root@isengard temp]# 
[root@isengard temp]# 
[root@isengard temp]# dd if=/dev/zero of=testfile.bin bs=100M count=100
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 37.1309 s, 282 MB/s
[root@isengard temp]# 
[root@isengard temp]# 
[root@isengard temp]# 
[root@isengard temp]# dd if=/dev/zero of=testfile.bin bs=1G count=10
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 40.9538 s, 262 MB/s
[root@isengard temp]#

So yeah think I just need to migrate this stuff off to a newer OS and retire off this server once and for all.

amd6502 · Jun 9, 2019

That's an interesting concept, I'd be interested as it sounds like a good way to keep a system going while migrating to a new linux install.

So how does one go about converting a bootable OS into a VM?

Red Squirrel · Jun 9, 2019

I kinda got lucky, was able to just dd the entire disk over to the VM. At first it would not boot, but was able to get it going after googling the errors, don't remember the steps.

But probably why I'm having issues, it's not exactly a clean way of doing it.

I basically did it so I can turn off the physical box since it was running just for this one purpose.

ultimatebob · Jun 10, 2019

Did you install VMWare Tools on your new VM? You might be using the wrong disk controller drivers, which might be causing the controller bottleneck.

Red Squirrel · Jun 10, 2019

It's Linux, so never managed to get vmware tools to work on that so I don't bother.

matheusber · Aug 6, 2019

Red Squirrel said:
but was able to get it going after googling the errors, don't remember the steps.

Hi, can't tell if you already fixed it by now, either by figuring out or installing another box, but Would be usefull to know what you had to do to make it get going. It got me curious this:

ata2.00: configured for PIO0

I don't know if you know this, but as noone mentioned it, PIO is some old disk interface that was used before DMA. My thoughts are on the things google told you to do so it could boot fine, and made it use this slow disk access. I would look this and the Hypervisor disk config.

Hope that helps,

matheus

Red Squirrel · Aug 6, 2019

It was something to do with the grub command. I had to basically make it rebuild the kernel or something along those lines. I just followed some steps I found when I googled the error. I should have taken note for future but was kind of just trying anything as my next step would have been to just migrate everything over to a new VM manually.

I set up the VM to reboot nightly, and also set dovecot and postfix to restart hourly. It's a hack and a half, but it has improved performance a bit. This server (the OS installation itself) is on it's last legs anyway so I just need to take the time to migrate the mail to a new server/platform. There's lot of advanced stuff that this mail server does such as piping mail through some custom applications, which is easy to do when you use procmail and unix users, but I want to move to virtual users as it's easier to manage but it does make per user rules a bit more tricky.

Scarpozzi · Aug 6, 2019

The problem with I/O disk bottlenecks is that they degrade performance. If you can get into your hypervisor tools look and see what your network interface looks like on the nfs interface. Look for spikes or sustained use that plateaus on a time scale...that would hint at peak usage vs an issue with the system throttling itself... I only used nfs for mounting media and not running a kernel or processes....or swap.

Do you have a gig link? Can you do etherchannel? 10G?

I'm just thinking the virtual side of the house is being limited by the physical network/disk constraints...if it's not something wrong with the filesystem itself.

Red Squirrel · Aug 7, 2019

That's the weird part, my NFS is just slow. There's no real bottleneck or at least it's very hard to see if there is one. It's gig links all the way, things like iperf etc show good result, local raid is good, it's strictly NFS that is slow.

For this particular server it's an OS issue, it's just super old and the fact that it got transplanted from physical hardware to virtual probably caused weird issues too, but my NFS infrastructure IS very slow. It's something I pretty much gave up on as I have no idea where else to look. I just need to retire this particular box and get it over with and migrate stuff to a newer OS/VM.

I eventually want to revamp my whole VM environment and use Proxmox or something more open with some kind of clustered/iSCSI based shared storage and I'm hoping that in that process the problem just goes away. I know my raid is fine too since locally I get really good speeds but NFS is slow.

That said I did buy a quad gig NIC and a separate 24 port gig switch. At some point when I'm in a position to completely shutdown my network I will put the quad port nic in my file server and then route everything through a dedicated switch. The VM servers will have a dedicated nic too. Essentially convert my file server into a SAN. Can also experiment with jumbo frames. I really don't think my network is the bottleneck though, but it can't hurt to do this setup anyway.

TheELF · Aug 7, 2019

matheusber said:
I don't know if you know this, but as noone mentioned it, PIO is some old disk interface that was used before DMA.

If that's the case then the drive is supposed to run slowly.
You could add a new disk controller to the VM attach a virtual drive and clone the slow drive to the new one and then detach the old one,with only the compatible controller the drive should run much faster.

matheusber · Aug 7, 2019

Red Squirrel said:
That's the weird part, my NFS is just slow. There's no real bottleneck or at least it's very hard to see if there is one. It's gig links all the way, things like iperf etc show good result, local raid is good, it's strictly NFS that is slow.

For this particular server it's an OS issue, it's just super old and the fact that it got transplanted from physical hardware to virtual probably caused weird issues too, but my NFS infrastructure IS very slow. It's something I pretty much gave up on as I have no idea where else to look. I just need to retire this particular box and get it over with and migrate stuff to a newer OS/VM.

I eventually want to revamp my whole VM environment and use Proxmox or something more open with some kind of clustered/iSCSI based shared storage and I'm hoping that in that process the problem just goes away. I know my raid is fine too since locally I get really good speeds but NFS is slow.

That said I did buy a quad gig NIC and a separate 24 port gig switch. At some point when I'm in a position to completely shutdown my network I will put the quad port nic in my file server and then route everything through a dedicated switch. The VM servers will have a dedicated nic too. Essentially convert my file server into a SAN. Can also experiment with jumbo frames. I really don't think my network is the bottleneck though, but it can't hurt to do this setup anyway.

What version of NFS are you using?

If you can use more then one nic, you can try to usa a lagg interface (as is in my FreeBSD world) and test it using more then a gigabit link. You have nothing to loose, right?

TheELF said:
If that's the case then the drive is supposed to run slowly.
You could add a new disk controller to the VM attach a virtual drive and clone the slow drive to the new one and then detach the old one,with only the compatible controller the drive should run much faster.

That is a good idea, can also try to change the disk controller on the hypervisor.

matheus

Red Squirrel · Aug 7, 2019

Not sure what version, whatever is default in CentOS 6. I think it's 3?

I eventually need to start looking into upgrading to a newer OS but migrating something like a file server is a pretty huge feat that involves lot of downtime so I typically only upgrade to a newer OS when I'm building a new server so it can be running side by side.

Scarpozzi · Aug 7, 2019

matheusber said:
What version of NFS are you using?

If you can use more then one nic, you can try to usa a lagg interface (as is in my FreeBSD world) and test it using more then a gigabit link. You have nothing to loose, right?

That is a good idea, can also try to change the disk controller on the hypervisor.

matheus

Those are good ideas because the changes should be easy to undo, if needed.

Red Squirrel · Aug 7, 2019

I actually had to use the non default disk controller as the default gave errors... can't recall the specifics now. Will have to play with it again and see if I can clone as suggested maybe that will work. My windows box is off right now so can't check. I really need to switch to a hypervisor that does not require Windows to manage.

matheusber · Aug 8, 2019

Red Squirrel said:
Not sure what version, whatever is default in CentOS 6. I think it's 3?

I eventually need to start looking into upgrading to a newer OS but migrating something like a file server is a pretty huge feat that involves lot of downtime so I typically only upgrade to a newer OS when I'm building a new server so it can be running side by side.

I think you can force nfs version using fstab options. I just recall the time CentOS 6 was out, but I think nfs ver 4 is old, despite not much frequently used. You can try also other nfs options, I guess man fstab may help alot.
I know about this kind of migration, I have two boxes that are to be updated myself, and just thinking about beginning gives me the creeps. And one is a remote server I can't do side by side :/

Red Squirrel said:
I actually had to use the non default disk controller as the default gave errors... can't recall the specifics now. Will have to play with it again and see if I can clone as suggested maybe that will work. My windows box is off right now so can't check. I really need to switch to a hypervisor that does not require Windows to manage.

This controller change sounds too much as culprit, although it may not. Non default options may render some weird outcomes. Well, relying on Windows is really bad, but I have one VM at home that used to run on Ubuntu through VirtualBox and now is running on windows, and it is fine and stable. I have not much to complaint about it. And as it is a VM, you can duplicate the hard disk file and try other scenarios while the official box is running.

Well, good luck!

matheus

Red Squirrel · Aug 24, 2019

So I sorta fixed this I think, and it's been good for at least a few weeks now.

The "fix" is the same fix many Windows server administrators use when they have servers that act slow or have weird issues. Scheduled reboots.

I have it set to reboot nightly, and I also have dovecot setup to restart hourly. It's a super hack fix, but it actually seems to be ok now. I of course still need to migrate this over to a new OS/VM but at least this fix buys me more time.

you2 · Sep 4, 2019

What happens if you run the vm off a local disk rather than nfs ?

Red Squirrel said:
I kinda got lucky, was able to just dd the entire disk over to the VM. At first it would not boot, but was able to get it going after googling the errors, don't remember the steps.

But probably why I'm having issues, it's not exactly a clean way of doing it.

I basically did it so I can turn off the physical box since it was running just for this one purpose.

Red Squirrel · Sep 4, 2019

Oh I know it will be faster, but I don't want to use local disks. The VM server does not even have any kind of raid capability as that's the NAS's job.

Fallen Kell · Nov 6, 2019

Cleaning up a few issues for others who might read this thread in the future:

NFS version 4 is the latest version out there. And yes, it has "slower" performance than version 3 (due to the added overhead and in some cases data integrity/safety features). By default, CentOS 6 (and 5, 7, and 8) have NFS version 4 as the max supported version in their config file, and as such, if you have a client that is also one of those systems, it will default to attempt to use NFS version 4 first when handshaking. Assuming all things are configured properly, it will then use NFS version 4. Personally, I configure my clients to only use version 3 (easiest to use a mount option "nfsvers=3").

You also need to do some basic NFS tuning in order for it to give decent performance, as the defaults are just horrible. Typically I will start with having the client side mount options "intr,noatime,rsize=65536,wsize=65536" at a minimum and depending on workload (i.e. are there isn't more than one system writing to the same file(s) at the same time and have a stable network and battery backup power for the server and network), I might also include "async". Run a few tests such as the ones above with DD creating files, and then try tweaking the rsize and wsize values (try doubling it/halving it, etc) and see what runs best with your hardware and network.

As for the VM, I would seriously look into fixing the drive type in the hypervisor for that system to use SATA or similar controller. My personal experience is mostly with using linux KVM, in which I would have it configured as a virtio storage device. Also on virtual machines do not use filesystems such as brtfs (stick to EXT3/4 and XFS, but there are even good reasons to simply use EXT3/4 since they can be shrunk in size, while XFS can not be). Again, if this was on linux, I would look at using tools to clone the disk image to a different file and convert it to RAW format (which in linux is a fully populated disk image (i.e. not thin provisioned), which is roughly a 1:1 mapping of the size of the virtual disk to the physical file size on your storage system and has one of the lowest overheads for virtual disk image files (aside from giving the VM a physical disk or network LUN)).

Converted Linux system to a VM, now it's acting like it has a failing hard drive... but it's a VM.

No Lifer

Diamond Member

No Lifer

No Lifer

Senior member

No Lifer

Lifer

No Lifer

Senior member

No Lifer

Lifer

No Lifer

Diamond Member

Senior member

No Lifer

Lifer

No Lifer

Senior member

No Lifer

Diamond Member

No Lifer

Diamond Member