SAN/Virtualization Issue

[DHT]Osiris

Lifer
Dec 15, 2015
17,380
16,661
146
Wasn't quite sure where to put this since AT doesn't have a 'professional IT' section, but storage fits well enough.

Currently working with vendor(s) on this, but my company recently ordered a new SAN from a respected SAN vendor, in the $50k range. Not a budget one but not a $200k rack-o-disks either.

We've had a persistent problem with it for about two-three weeks now, specifically in how it handles thin vs thick provisioned LUNs within VMware which is preventing us from putting it in production. Basically, with a thin provisioned LUN of arbitrary size on both RAID6 and RAID5 we get about 50% of the read/write speeds as we do with thick provisioned. I've never encountered this before, and in fact the only thing I've ever been schooled on regarding thin vs thick was that in ye olden dark days, thin provisioned volumes/luns/whatever could suffer a moderate degradation in write speed, which makes sense. This has for the most part gone to the wayside though as far as my experience has shown me.

So anyhow, poor performance with thin provisioned LUNs regardless of raid pool type, primarily attributed to read speed latencies. This was tested via simple methods, initially file copies within a VM to itself, vmotioning (this is a VMware build), some SQL backups, junk like that. Also employed IOMeter to derive info regarding choke points.

Build is 2x Dell hosts with 2x connections each, connected via 10Gbe fiber iscsi through a fiber switch, to host with 2x connections as well. Tested with/without multipathing, with/without fiber switch involved, swapped fiber cables, and fiddled with a handful of esxi settings with varying results.

Anyone seen abnormally high latency in scenarios vaguely similar to the above? Right now we're banking on some kind of firmware bug wrt interaction with esxi 6, but the vendor engineers haven't popped their heads up yet.
 
  • Like
Reactions: dave_the_nerd
Feb 25, 2011
16,992
1,621
126
What kind of reporting does the SAN admin software have? I'd be curious about IOPS load per LUN, etc. How many disks does the SAN have? What kind of firmware's it got? (Is it a branded unit with proprietary firmware, or one of those smaller companies that puts together some FOSS software and validates it for their SAN distro?)

What about your iSCSI network? Is it physically distinct? Just its own VLAN? Own subnet? Routed differently? Jumbo Frames enabled?

Are you talking to the "vendor engineers" from a VAR/reseller, or to the manufacturer's support directly?

With ours, we will see high latency if the IOPS load gets too high, and it is, in fact, possible for a single VM on a single LUN to go bonkers and cause latency warnings for everything else on that physical stripe (so not even just its own LUN.)

We actually have four VMs that we have to quarantine. They do some kind of hadoop thing; except they have to live on their own LUN. Between the four of them, it's very steady, ~1k IOPS, 5-10MB/sec, 24/7, and every VM on the LUN with them gets 35-50ms latency instead of 5-10.

So, personal biases talking, but I'd believe that's more likely than the thick/thin LUN performance thing - I think that might be a coincidence.
 
Last edited:

[DHT]Osiris

Lifer
Dec 15, 2015
17,380
16,661
146
What kind of reporting does the SAN admin software have? I'd be curious about IOPS load per LUN, etc. How many disks does the SAN have? What kind of firmware's it got? (Is it a branded unit with proprietary firmware, or one of those smaller companies that puts together some FOSS software and validates it for their SAN distro?)

Are you talking to the "vendor engineers" from a VAR/reseller, or to the manufacturer's support directly?

With ours, we will see high latency if the IOPS load gets too high, and it is, in fact, possible for a single VM on a single LUN to go bonkers and cause latency warnings for everything else on that physical stripe (so not even just its own LUN.)

So, personal biases talking, but I'd believe that's more likely than the thick/thin LUN performance thing - I think that might be a coincidence.

Reporting is okay, it's 'enterprisey' so you get what they provide. Most of my determination of read latency issues has come from its own reporting, and vmware's esxtop info, both of which report extremely high latency in read scenarios of the thin lun (50ms-100ms reads) vs thick (10-20ms reads).

IOPs load isn't overtly bad unless I specifically stress it with IOmeter, most of our testing has been with very small workloads on the actual LUN (1-3 VMs). If I crank it with like, 4KB sequential reads it goes very high (as expected), >15,000 IOPs on thick and thin.

34 disk array, though right now it's got 7 hot spares (due to a mis-order of disks), more on the way to alleviate that and bump us up another few TB. Currently set as an 8+1 RAID5, 3x raid groups. Branded unit, probably some mush of proprietary and *nix firmware/software (it's got a *nix interface locked down with a 'commercial' CLI command system).

Oddly, we started working through the vendor directly but spun our wheels with their front line desk support before finally going through the reseller's (much more knowledgeable) engineers and got back-channeled from there.

It's definitely possible it's a weird IOPs issue, but I cannot ascertain the very repeatable and predictable differences between thin and thick LUNs.

Here's an example of data flow from an email I sent the engineers/intermediary. DelayedACK is a setting within vmware advanced settings that I was playing with (sometimes it's unsupported/causes issues with SANs).. default is on, 'altered' is off, 'iSCSI' is an advanced setting within ESXi 6.0, specifically 'iSCSI.MaxIOSizeKB' which changes the maximum size of packets sent from iSCSI to the SAN (adjustable from 128KB (default) to 512KB (what I set it to in 'iSCSI' sections)). Thin/thick are referencing the LUN, all tests were done on a thin volume within VMware (no significant differences when tested with thick volumes under any scenario). LUN IOPs at the bottom is referencing vmware's roundrobin iSCSI functionality. It does a path handoff at 1,000 IOs by default, you can change it to 1 however to have it do a no-kidding roundrobin per request:
Unaltered DelayedACK/iSCSI, thin provisioned
single thread 10GB - 100-250 MB/s, variable. Sub-100MBs/s after 20%. Some fluctuation, mostly sub-100MB/s though.
single thread 5GB - 100-175 MB/s, significant sub-100MB/s drops after 70% or so.
multi thread 8GB - 80MB/s avg

Altered DelayedAck/iSCSI, thin provisioned
single thread 10GB - 80MB/s
single thread 5GB - 80MB/s
multi thread 8GB - 80MB/s avg

Unaltered DelayedACK/iSCSI, thick provisioned
single thread 10GB - 150-200MB/s until 80% (8GB), drops to sub-100
single thread 5GB - 175-200MB/s
multi thread 8GB - 536MB/s avg - 620MB/s avg

Altered DelayedAck/iSCSI, thick provisioned
single thread 10GB - 400-500 until 80% (8GB), drops to 150?
single thread 5GB - 500MB/s
multi thread 8GB - 804MB/s avg

Vmotion between LUNs hovers around 110MB/s, either direction.

Setting LUN IOPs to 1000 (default) instead of 1 (altered) results in a 10-40% loss in speed depending on test parameters.

In this scenario, default configuration within ESXi for delayedack and iscsi.maxiosizekb, along with a thin provisioned LUN (which is what the vendor's software creates by default, have to CLI a thick one) provides the worst performance. Some speeds can be scraped by adjusting certain settings, but the largest gain comes from altering all settings and shifting to a thick LUN. We saw similar performance on RAID6 as well.

EDIT: single vs multi from above... single thread copying was done via a simple .iso file copy on the desktop (5GB file and 10GB file), multithread was done via a robocopy /MT:32 of 500 misc sized files, 8GB total size.
 
Feb 25, 2011
16,992
1,621
126
RAID5 is usually terrible for write performance. Do you have any storage tiering options? (Say, write everything to a RAID-10 group and then have the SAN migrate not-frequently-used data to RAID5 later?)

If you're copying files to a (Windows?) server VM, rather than benching the datastores directly from the ESX host, my assumption is that you'll get inconsistent performance due to guest OS RAM caching. :\
 

[DHT]Osiris

Lifer
Dec 15, 2015
17,380
16,661
146
RAID5 is usually terrible for write performance. Do you have any storage tiering options? (Say, write everything to a RAID-10 group and then have the SAN migrate not-frequently-used data to RAID5 later?)

If you're copying files to a (Windows?) server VM, rather than benching the datastores directly from the ESX host, my assumption is that you'll get inconsistent performance due to guest OS RAM caching. :\

Nah, raid5 or raid6, this vendor does tiering via additional storage arrays and internal tiering via software (nearline SAS, SAS, and SSD). Only options for the existing disks is the general usage SAS, either raid5 or raid6. For what it's worth though, the raid5 writes are quite good as long as it's not reading from a Thin lun. VMotioning a VM from thick -> thick LUN runs at about twice the speed as thin -> thick or thick -> thin (implying it might not just be read latencies causing the problem).

Yeah, some inconsistencies come up from OS ram caching, hence the 5GB vs 10GB file sizes (10GB always runs out of OS ram before finishing). The multithread copies are done unbuffered so no RAM involved there.. but either way with this, it should be giving identical performance on each.
 

[DHT]Osiris

Lifer
Dec 15, 2015
17,380
16,661
146
Because I personally like to see things like this updated, I'll provide some more information with how things have been going on this little project.

We've done more testing on the behest of the SAN vendor's engineer(s), specifically with IOMeter to kinda 'scrape away' OS level stuff (caching, etc). I did tons of testing with different block sizes on thin/thick provisioned LUNs, along with some specific ESXi host setting alterations which proved to affect speeds to an extent. Below is basically the email I sent to the vendor with the data gathered.

As a note, the vmware setting I was referencing in the below is called 'ISCSI.MaxIOSizeKB', and it's a setting found in the advanced settings section of the ESXi host. Near as I can tell, it's an esxi6 setting, there's zero documentation from vmware on it, and it works as expected. Default is 128KB which sends 128KB blocks to the SAN for processing, it can be increased to either 256KB or 512KB. In our specific scenario, we see an increase in speed from that.

Through discussion with the vendor engineer, he steered me away from the unaltered settings, as, according to him, 'delayedack on' isn't a supported configuration by the vendor (despite there being no documentation to this effect) so he scratched all results from 'unaltered' hosts. He also scratched 2MB reads/writes based on them being out of scope of the SAN's supported block sizes. I personally disagree with this, as within Hyper-V (something we tested prior) it happily sends block sizes of multi-GB, something I can't really find information on, thus I can only assume it's a difference in storage mentality between MS and Vmware. Anyhow, he ditched results for unaltered host tests, ditched results for 2MB results, and narrowed down on the 4KB/512KB 32 depth reads/writes for thin and thick. Discussions are ongoing, but it seems as though he's leaning toward it being 'performance as expected' despite my protests that it's absurd to me, considering as I had pointed out to him, there's a dramatic difference between thin/thick luns @ 512KB SEQ reads @ 1 queue depth (30MB/s vs 370MB/s) and @ 32 queue depth (175MB/s vs 1600MB/s).

I'll freely state that I'm not a storage admin primarily, but those numbers just look wrong. And going beyond 'artificial testing', the performance is just crap within the VM infrastructure itself.

Testing parameters are as follows:

Tests were performed within a Windows Server 2016 VM, 1GB test file located on the C: drive. All tests are performed using sequential reads/writes. Workers are configured as # of CPUs (the VM itself has 8 cores), but I only ever see a single Dynamo.exe in resource monitor.. I’m assuming it’s functionally multi-core though. All tests were performed for at least 30 seconds, some for longer to normalize the speeds, as some amount of slowdown at the beginning was skewing the numbers lower than they realistically were performing as (specifically with 2MB block sizes). Current configuration is RAID5, 8+1 disk configuration, 4x disk groups. LUNs are thin and thick, both configured identically (15TB) without utilizing the ‘vmware configuration’ part of the CLI, so it’s just iSCSI host -> Block LUN. The thick LUN was configured within CLI, thin within the web GUI.

For the below numbers, SEQ READ and SEQ WRITE are sequential read and write, respectively, OS IO is ‘outstanding IO’s’ which as far as I know is the queue depth. Avg RT is average response time.

‘Unaltered DelayedAck/ISCSI’ is our host with unchanged (from default) delayedack and iscsi.maxiosizeKB settings (on, and 128KB respectively). ‘Altered’ is on our host with changed settings (off, and 512KB, respectively). I’ve included screenshots showing an example of the proof of change of the IO block sizes within ESXi from within the Unisphere performance monitoring (specifically LUN I/O size, reads and writes).

Of specific note, I’d like to highlight a) the difference between performance numbers on thin and thick LUNs given the same settings, b) the rather drastic latency differences between thin and thick, and c) the increase in write speeds vs read on the thin LUN (from my experience, that’s always switched).

Unaltered DelayedAck/iSCSI, Thin LUN --- Limited to 128KB LUN IO sizes on perf monitoring
4KB SEQ READ, 1 OS IO - 1550 IO/s, 6.33MB/s, .64ms avg RT
4KB SEQ READ, 32 OS IO - 15000 IO/s, 62MB/s, 2.1ms avg RT
512KB SEQ READ, 1 OS IO - 60 IO/s, 31.5MB/s, 16ms avg RT
512KB SEQ READ, 32 OS IO - 505 IO/s, 265MB/s, 63ms avg RT
2MB SEQ READ, 1 OS IO - 25 IO/s, 50MB/s, 43ms avg RT
2MB SEQ READ, 32 OS IO - 120 IO/s, 250MB/s, 260ms avg RT

4KB SEQ WRITE, 1 OS IO - 1100 IO/s, 4.5MB/s, .90ms avg RT
4KB SEQ WRITE, 32 OS IO - 16000 IO/s, 65MB/s, 2ms avg RT
512KB SEQ WRITE, 1 OS IO - 55 IO/s, 30MB/s, 18ms avg RT
512KB SEQ WRITE, 32 OS IO - 450 IO/s, 240MB/s, 70ms avg RT
2MB SEQ WRITE, 1 OS IO - 20 IO/s, 41MB/s, 50ms avg RT
2MB SEQ WRITE, 32 OS IO - 195 IO/s, 400MB/s, 170ms avg RT


Altered DelayedAck/iSCSI, Thin LUN --- Greater than 128KB LUN IO sizes
4KB SEQ READ, 1 OS IO - 1500 IO/s, 6.25MB/s, .65ms avg RT
4KB SEQ READ, 32 OS IO - 16200 IO/s, 66MB/s, 1.9ms avg RT
512KB SEQ READ, 1 OS IO - 23 IO/s, 12.5MB/s, 41ms avg RT
512KB SEQ READ, 32 OS IO - 335 IO/s, 175MB/s, 95ms avg RT
2MB SEQ READ, 1 OS IO - 12 IO/s, 25MB/s, 83ms avg RT
2MB SEQ READ, 32 OS IO - 105 IO/s, 220MB/s, 304ms avg RT

4KB SEQ WRITE, 1 OS IO - 925 IO/s, 3.75 MB/s, 1ms avg RT
4KB SEQ WRITE, 32 OS IO - 16000 IO/s, 65MB/s, 2ms avg RT
512KB SEQ WRITE, 1 OS IO - 35 IO/s, 18.5MB/s, 28ms avg RT
512KB SEQ WRITE, 32 OS IO - 275 IO/s, 145MB/s, 115ms avg RT
2MB SEQ WRITE, 1 OS IO - 20 IO/s, 40 MB/s, 52ms avg RT
2MB SEQ WRITE, 32 OS IO - 145 IO/s, 300MB/s, 220ms avg RT

Unaltered DelayedAck/iSCSI, Thick LUN
4KB SEQ READ, 1 OS IO - 2200 IO/s, 9.00MB/s, .45ms avg RT
4KB SEQ READ, 32 OS IO - 15000 IO/s, 61.5MB/s, 2.1ms avg RT
512KB SEQ READ, 1 OS IO - 700 IO/s, 370MB/s, 1.4ms avg RT
512KB SEQ READ, 32 OS IO - 3300 IO/s, 1770MB, 10.3ms avg RT
2MB SEQ READ, 1 OS IO - 440 IO/s, 900MB/s, 2.2ms avg RT
2MB SEQ READ, 32 OS IO - 730 IO/s, 1500MB/s, 43ms avg RT

4KB SEQ WRITE, 1 OS IO - 1270 IO/s, 5.2MB/s, .78ms avg RT
4KB SEQ WRITE, 32 OS IO - 12000 IO/s, 49MB/s, 2.65ms avg RT
512KB SEQ WRITE, 1 OS IO - 525 IO/s, 280MB/s, 1.8ms avg RT
512KB SEQ WRITE, 32 OS IO - 2290 IO/s, 1200MB/s, 14ms avg RT
2MB SEQ WRITE, 1 OS IO - 270 IO/s, 565MB/s, 3.7ms avg RT
2MB SEQ WRITE, 32 OS IO - 540 IO/s, 1140MB/s, 60ms avg RT

Altered DelayedAck/iSCSI, Thick LUN
4KB SEQ READ, 1 OS IO - 2450 IO/s, 10MB/s, .40ms avg RT
4KB SEQ READ, 32 OS IO - 13800 IO/s, 56.25MB/s, 2.3ms avg RT
512KB SEQ READ, 1 OS IO - 600 IO/s, 315 MB/s, 1.67ms avg RT
512KB SEQ READ, 32 OS IO - 3065 IO/s, 1600MB/s, 10.5ms avg RT
2MB SEQ READ, 1 OS IO - 250 IO/s, 525 MB/s, 5.75ms avg RT
2MB SEQ READ, 32 OS IO - 787 IO/s, 1650 MB/s, 40.7ms avg RT

4KB SEQ WRITE, 1 OS IO - 1160 IO/s, 4.75MB/s, .86ms avg RT
4KB SEQ WRITE, 32 OS IO - 13500 IO/s, 55.25MB/s, 2.3ms avg RT
512KB SEQ WRITE, 1 OS IO - 495 IO/s, 259MB/s, 2.02ms avg RT
512KB SEQ WRITE, 32 OS IO - 2700 IO/s, 1415MB/s, 11.82ms avg RT
2MB SEQ WRITE, 1 OS IO - 220 IO/s, 464MB/s, 4.5ms avg RT
2MB SEQ WRITE, 32 OS IO - 675 IO/s, 1415MB/s, 47ms avg RT