Fibre Channel Abort - could an application/OS cause it?

Jeff7181

Lifer
Aug 21, 2002
18,368
11
81
Every now and then we get a notification about an abort on our fibre channel network. It's rarely more than one, if it's ever been more than one.

I was told this is usually a physical layer issue - data getting corrupted on the wire... or... in the tube, in this case since I'm talking fibre channel.

I'm wondering if anyone has ever seen this caused by an application, driver or OS layer event. I ask because our DBAs occasionally see an alert from SQL Server about being unable to perform IO for 15 seconds. These events last nanoseconds... so I find it hard to believe that its the cause. I'm wondering if it could be an application, driver or OS layer issue with a buffer getting full or something and finally crapping out and sending a corrupt chunk of data to the FC network.
 

Brovane

Diamond Member
Dec 18, 2001
6,390
2,581
136
I have never seen this on or FC networks. Is the error on the FC switch?
 

Jeff7181

Lifer
Aug 21, 2002
18,368
11
81
The abort is being reported by a FC monitoring device we have from Virtual Instruments which is actually a hardware tap... splits off 30% of the light, lets us see what's happening at the physical layer, even digs into the SCSI and FC protocol and gives us data like this, and also real latency rather than just what our SAN reports.
 

Jeff7181

Lifer
Aug 21, 2002
18,368
11
81
We have a work request submitted to clean the cable ends and ports during our next maintenance window as VI has told us that these type of errors are usually dirty optics or kinked cables. I just find it hard to believe that there's been this one abort in the past 6-9 months and it's due to dirty optics AND that it caused a 15 second delay in IO. I'm thinking it sounds more likely that a driver or something at the software layer was having issues for a while and finally crapped the bed and sent malformed data, which was seen as a corrupt bit or something at the physical protocol so an abort was requested, accepted and then everything moved on.

According to this same monitoring tool, there was no increase in latency at this time, no decrease in latency and no change in IOPS. It was pretty steady at 1-2ms and IIRC 200-300 IOPS on this one link.
 

Brovane

Diamond Member
Dec 18, 2001
6,390
2,581
136
We have a work request submitted to clean the cable ends and ports during our next maintenance window as VI has told us that these type of errors are usually dirty optics or kinked cables. I just find it hard to believe that there's been this one abort in the past 6-9 months and it's due to dirty optics AND that it caused a 15 second delay in IO. I'm thinking it sounds more likely that a driver or something at the software layer was having issues for a while and finally crapped the bed and sent malformed data, which was seen as a corrupt bit or something at the physical protocol so an abort was requested, accepted and then everything moved on.

According to this same monitoring tool, there was no increase in latency at this time, no decrease in latency and no change in IOPS. It was pretty steady at 1-2ms and IIRC 200-300 IOPS on this one link.

Have you checked your HBA Driver/Firmware versions and see if you are out of date?
 

Jeff7181

Lifer
Aug 21, 2002
18,368
11
81
Have you checked your HBA Driver/Firmware versions and see if you are out of date?

On the particular server which experienced the abort I was referencing this time, yes, everything is up to date.

I could contact support and begin the troubleshooting process after cleaning the optics and whatnot, but I'm actually more interested in anecdotal experiences - has anyone ever seen an abort on an FC network that was not caused by a physical issue (ie. dirty optics, kinked cable, etc.)?