Our SAN is handled by a different team, so while I don't have much practical experience w/ SAN, I do have a few questions:
1. I know SAN needs to be lossless, but exactly what happens when a frame/packet is dropped?
Specifics depend on the protocol used, but afaik all protocols will verify receipts and reattempt if there's a problem.
My team manages the DWDM infrastructure, which carries some of the SAN traffic between datacenters.
Once in a while there's an issue w/ one leg of the dark fiber, and we'd need to shut down that portion of the ring.
All the SAN traffic in flight would be dropped.
What are the negative impacts when this happens?
==========
Lag, mostly. The SANs I'm familiar with all use iSCSI replication for SAN->SAN DR. So TCP/IP handled send/ack stuff, resends the missing bits, etc. The SANs also talk to each other and make sure that everything is up to date. So the net result, if everything is working the way it's supposed to be, is-
is-
is-
is-
is-
is not much of anything at all. At least from an end user perspective. Our SANs don't even notice/alert/log brief outages from things like switch reboots.
Your SAN admins will probably be able to tell you specifics re: failover timer settings, that sort of thing. And they'll be able to tell you what replication protocol they're running.
2. Some servers have FC HBA's w/ two ports.
One goes to SAN-A, and one goes to SAN-B.
How does a server know which port to use?
Is it determined by the OS, or HBA driver, similar to active-standby NIC team in the ethernet world?
TIA
Are SAN A and SAN B really different SANs? Or is it a dual-controller system? Because the answer is slightly different depending.
Basically, fiber devices have something called World Wide Names (like a MAC address for Ethernet.)
A SAN controller will map a volume to a specific WWN, and say to that WWN, "This is your LUN." If there's a switch involved, than those WWNs are set up into different zones. (A zone is sort of like a VLAN, except the same WWN can be in multiple zones.) Zones are used both to manage traffic, and for access control.
You could certainly have a server plugged into to completely different SANs and accessing LUNs on both.
In a dual-controller setup, both controllers have access to the same disk enclosures, and know about all of the volumes that the other controller is serving up, and to whom. When the other controller goes down, the remaining controller spends a few seconds (60 is default for our gear) mourning the loss of its friend, before taking control and restoring access to LUNs that were being managed by the other controller. But that also means that both controllers need to have a path to the server!
Either way: the HBA on the server will then present the LUN to the server OS as though it were a local volume. The problem is that the HBA is stupid. It gets the "This is your LUN" message from the SAN controller, but it'll get a copy for each possible path (multipathing) to the SAN over the fabric. And it passes all that to the OS.
So an MPIO capable OSes will do some magic to keep track of which paths go to which LUNs, which paths go to the same LUNs, and which it's going to use. FreeBSD (gmultipath) does it by putting a little code at the very end of the disk. Other OSes may do it differently, iDunno, but it works.
The server can use multiple paths in either an active/active or an active/passive arrangement, depending on how you want to configure it.
If you are running an OS that is NOT MPIO-capable, then you do some tricks with the SAN controller or the fiber zoning to make sure the server only sees a single path.
But that's a bad idea, because there'd be no failover. So what you'd REALLY do, probably, is make an ESX host and virtualize your non-MPIO OS. VMWare can pass through raw LUNs (Raw Device Mapping, or RDM) to a specified Guest VM, and the host handles the multipathing.