Help me with an VMware ESX benchmark

spikespiegal

Golden Member
Oct 10, 2005
1,219
9
76
If anybody is running Windows VM's inside ESX 4 (or higher) on decent servers could you do the following 'crude' benchmark in the guest OS and tell me what you get?

Open MSpaint. Resize the image (attributes) to 6000x6000. Save it to local disk and roughly time how long it takes to save (hour glass goes away).

I'm doing a big Citrix rebuild for a client, and running into substantial performance problems on the guest OS's inside ESX 4 which are being blamed on Citrix and the guest OS's. I have limited access to the host side because their 'certified' VMware engineer doesn't feel there is a problem, or just doesn't care, but is being territorial. All the metrics I'm doing within the guest OSs point to huge disk write latencies with Perfmon and other tools, but I need something a little more basic in terms of smoking gun to elevate the issue to a higher level. So, I came up with the MSpaint thing.

On our guest OS's it takes almost a minute to save the file locally. On a 3ghz P4 XP box it takes 6 seconds. It takes 5 seconds to save the file from the problematic guest OS to a server running on bare iron across gig ethernet.
 

GhettoFob

Diamond Member
Apr 27, 2001
6,800
0
76
To figure out where the bottleneck is, you'll need the VMware engineer to run esxtop while you're performing the operation. You can examine CPU/memory/disk/network here to see what kind of performance/latency you're getting. Do you know what the underlying storage is? Local disk/iSCSI/FC/NFS?
 

Crusty

Lifer
Sep 30, 2001
12,684
2
81
i7 920, 12GB ram
4x 1TB WD RE3 drives in RAID 10

~3seconds

I'm using ESXi 4.0
 

spikespiegal

Golden Member
Oct 10, 2005
1,219
9
76
If I may jump ahead here, our engineer isn't going to assist with performance metrics because it's much easier to blame the problem on Windows / Citrix. Given I'm not VMware 'certified' he's not listening, and I'll likely have to take the issue to his less than technically inclined boss. I snuck onto the VSphere console and was able to see high disk utilization corresponding to the guest OS's.

It's an IBM M2 with LSI MegaRaid and local disks. RAID level likely isn't exotic. Memory / CPU utilization across all cores is light. Frankly speaking my old Winframe boxes running on quad P200's smoked this piece of junk and required less dicking around. Vent = off

VMware says that controller is verified in version 4.1, but I'm not sure where to find this, nor do I trust if our version of ESX has been patched.

If Crusty is getting that performance from an i7 chipset I would assume I have a configuration / driver issue between ESX and the LSI, correct?
 

GhettoFob

Diamond Member
Apr 27, 2001
6,800
0
76
That's pretty lame that he won't help out...You can check out whether your hardware is supported or not here. Even non-supported white boxes shouldn't be performing at 1/10th of your expected speeds though.

Since this is with local disks, are you able to check whether write-back cache is enabled? I've seen that make a huge difference in performance. Otherwise, I'd suggest using Iometer to see what kind of disk performance you're seeing within the guest. Then run the same tests on comparable physical hardware to demonstrate the problem.
 

GhettoFob

Diamond Member
Apr 27, 2001
6,800
0
76
I snuck onto the VSphere console and was able to see high disk utilization corresponding to the guest OS's.

If you're looking at the chart I think you're looking at, that means that there are actual reads/writes coming from within the guest. Do you know what's causing those reads and writes? (Virus scans/backup jobs etc..)
 

spikespiegal

Golden Member
Oct 10, 2005
1,219
9
76
Actually I've encountered this attitude with quite a few 'engineers' in my corporate contract rounds. "VMare is fine - everything else is broke". In my neck of the woods a lot of senior engineers tend to be the old Novell types who are used to isolating themselves with layers away from end users doing actual work so they don't have to be accountable for it. Bad combination, and I wish these clowns would go back on the unemployment line because we tried to get rid of them back in the early part of the decade.

As for actual I/O, these are Citrix / TS boxes, so most of the file activity is generic Winlogon sequences, profile loads, etc. Nothing out of the ordinary there, and verifed this with FileMon. We're lucky we're still running Office 2003 because 2007 would make the problem 2x as bad. However, something system level like a print job or profile change gets queued on the disk and the entire box stutters.

The good thing is other than our disk / RAID issue VMware does an outstanding job with Citrix / TS. I have at least 25% more user capacity because I can increase the number of VM's and isolate errant processes.

Good call on write back cache. I'll try to check that.
 

quikah

Diamond Member
Apr 7, 2003
4,199
744
126
What are you saving as? 24-bit bmp is ~100MB, png is ~400KB. makes a second or 2 difference on my setup, both save in under 5 seconds though.

Do you have any other VMs on that datastore? Same problems?

Might be a hardware problem (bad disk?). Check out the vmkernel logs, I suspect you are getting a boatload of SCSI errors.
 

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
VMWare using LSI gear on local datastores should run at near native disk speeds. However it depends on what the other guests are doing. If the disks are over loaded or the VMWare tools is out of date you can see some significant performance issues. Are you running the ParaVirtual driver with the corresponding ParaVirtual adapter in the Guest VMX inside the guest machine? What does the performance Tab on that host say? In a lot of cases poor disk performance is caused by overloaded disks. Being all local datastore... I would guess that the odds of that are high since most servers don't really have a enough disk slots to load enough spindles to get decent IOP performance from multiple guest OS's.
 

spikespiegal

Golden Member
Oct 10, 2005
1,219
9
76
makes a second or 2 difference on my setup, both save in under 5 seconds though.

Takes almost a minute on this server from all the guest VM's.

Do you have any other VMs on that datastore? Same problems?

Half a dozen guest VM's are running on that box, and they all have the same problem when I run a 'disk write' benchmark inside them. In the Vsphere console I see physical disk write latency getting as high as 100ms at times.

However it depends on what the other guests are doing.

No. I've logged onto the box late at night when *nobody* is on them other than my own task, and running the MSpaint benchmark stays the same. Since Citrix / TS boxes are extremely latency sensitive anyways, so this is a bad combination.

or the VMWare tools is out of date

..and I verify this how? This is a cool product, but getting a decisive driver/version verification is like some secret society club or something. At some point the host OS has to handshake with the underlying hardware and I just can't believe it's plug and play. I've boned up on optimizing VMware with proper Partition Alignment, but I would tend to sort this into the 'tweak' category and not out of the box issues.(??)

Are you running the ParaVirtual driver

Yep - verified that. Also, as I've said above, write times within each guest OS regardless of load on the box are terrible and consistent. If we were hosting a bunch of Web Servers, database servers or such nobody would really notice the problem and it would be blamed on something else. Funny thing is, our engineer is going to shift about 30% more users to the boxes this weekend, which will make monday a fun day. By then I'd better have some rock hard evidence to defend the guest OS's not being the problem.
 

GhettoFob

Diamond Member
Apr 27, 2001
6,800
0
76
Takes almost a minute on this server from all the guest VM's.



Half a dozen guest VM's are running on that box, and they all have the same problem when I run a 'disk write' benchmark inside them. In the Vsphere console I see physical disk write latency getting as high as 100ms at times.



No. I've logged onto the box late at night when *nobody* is on them other than my own task, and running the MSpaint benchmark stays the same. Since Citrix / TS boxes are extremely latency sensitive anyways, so this is a bad combination.



..and I verify this how? This is a cool product, but getting a decisive driver/version verification is like some secret society club or something. At some point the host OS has to handshake with the underlying hardware and I just can't believe it's plug and play. I've boned up on optimizing VMware with proper Partition Alignment, but I would tend to sort this into the 'tweak' category and not out of the box issues.(??)



Yep - verified that. Also, as I've said above, write times within each guest OS regardless of load on the box are terrible and consistent. If we were hosting a bunch of Web Servers, database servers or such nobody would really notice the problem and it would be blamed on something else. Funny thing is, our engineer is going to shift about 30% more users to the boxes this weekend, which will make monday a fun day. By then I'd better have some rock hard evidence to defend the guest OS's not being the problem.

100 ms is pretty bad. How much physical memory does the server have and what's the sum of the memory of the VMs? You might want to check if the VMs are swapping (not within guest) but at the VM level. If they are, then they're writing to a .vswp file on the local disk which would mean additional writes when memory needs to paged in/out. You may be able to find this info from the performance charts, but I usually use esxtop (switch to the memory view by pressing M and looking at the SWCUR column). This is a good link for a few other things to check for.

Are you able to shut down all but one of the VMs and try the same test? Did you check the write-back cache? Can you check /var/log/vmkernel (or /var/log/messages for ESXi) to see if there are any storage related errors or I/O aborts?

Sadly, this is all stuff your VMware engineer should be checking out....If they haven't done this, they should be fired.