Metro Optical Ethernet troubles, please help - FIXED!

randal · Aug 31, 2006

OK, please see the drawing here:

http://isis.data102.com/~randal/moedrawing2.gif
Edit!! - updated with much more information
Edit2!! - new diagram

Basically we have a DS3 and are switching to Metro Optical Ethernet, AKA Ethernet over Sonet, for performance, cost and multiple access.

Problem is that traffic from Denver to Colorado Springs is working perfectly, while traffic from Colorado Springs to Denver is very slow, 3-4mbps.

I am doing my testing with iperf and ttcp, testing all possible combonations of the servers listed there. All of the COS servers test to eachother (locally) at 90+mbps, all of the Denver servers test to each other at 90+mbps. This indicates that the machines -can- put out 100mbps reliably.

I have verified that all switchports involved are 100/full. I am taking no errors on any switchports or server interfaces.

Our LEC came out with a sweet tberd-like device and was able to put 100mbps across the link and receive the same (tried loopback + head-to-head) without issue, at MTUs ranging from 1400 to 1522. It all comes back 100% and clean.

I have tried everything, from adjusting TCP recv windows, tweaking MTUs. I have tried bypassing the switch on both sides, one at a time and simultaneously. I am totally at wits end, and have no idea where to go from here. Any suggestions would be very appreciated.

spidey07 · Aug 31, 2006

really does sound like a duplex problem. I know you checked, but it's got all the signs of one. You could run clean and still have a duplex problem unless you have access to their gear. Maybe ask and make sure how you are to set it (auto or force it)

are you tagging anything (dot1q) from your switches going into the metro?

Are the 7200s setup to bridge? If so, check spanning-tree.

I'd look at all devices and record the spanning-tree setting like who's root, what's blocked, what mode they are running, etc.

-edit- one last thing, what interface are they giving you? 100 Base-T? 100 Base-FL?

randal · Aug 31, 2006

They told me everything should be forced 100/full, so that is how it has been set. I will check and triple check that again. The thing that gets me is that when a loopback plug is put on the DEN side, we test w/ the tberd from COS and we get 100mbps in -both- directions; if there were a duplex issue, I think it would choke on that test? Not sure.

We are not doing any trunking on the link yet, as all of the COS & DEN hosts are on a single VLAN. The only dot1q happening is from the COS switch to the COS c7206b router. That precludes the issue of a > 1518byte packet.

The 7200s right now are not utilizing the MOE, thats just kind of there for illustration. They are routed right now, run ibgp across the DS3 to share the multiple upstreams at each site. One of the internet-bound tests I run is trying to get a DEN host to download a file - it will cross the MOE, hit the COS router, then go out on the internet, just as if it were on the LAN in COS.

I haven't spent a lot of time on the spanning tree, as I didn't think that would be an issue. I know that when they stick a loopback on one end, the other end err-disables, as it should. Even though I don't think it's broken, I'll recheck my stp stuff and make sure it is straightforward.

It is 100 Base-T delivered on cat5. I am using prefab cables everywhere, and have replaced every cable I am able to. LEC Patch panels + xconnects are no-touch, but I made them certify them to 100mbps, and they came through with no issue.

spidey07 · Aug 31, 2006

only thing I can think of possibly - the 7200s are proxy arping for hosts on either end, forcing traffic (from a layer2 perspective since the router is replying with it's mac for hosts on the other ent) through the routers.

But I'm having trouble picturing this in my head - both sites are on the same subnet, but you say you're routing on the 7200s?

so we've elmininated frame sizes from what I'm hearing. maybe run a trace and see if you're getting fragmentation issues.

basically what I'm saying is you've done all you can at layer1 (after you've verified speed/duplex), move up to layer2. Check the arp caches of the hosts to make sure they are using what you believe to be the correct layer2 addresses. check the mac address table of the switches to make sure it's taking the layer2 path you believe it should.

randal · Aug 31, 2006

The link between the two routers is a routed DS3, not bridged, so I don't see why it would be leaking proxyarp info across it. The servers in Denver are -not- connected to the Denver router; they are on the same VLAN 101 as the Colorado Springs servers, the only gateway for which is c7206b (across the MOE). When I talk to the Denver servers, when they get Internet, whatever, it all goes across the MOE. Right now, essentially the COS LAN is extended via MOE to a remote (denver) location.

(The servers were installed at the Denver location just for testing - they are not normally there. As such, everything is configured so that there is no cleanup router-config-wise when they get pulled)

I checked STP and a couple of the servers were on portfast. I switched that over, and nothing is blocking anywhere. I did the forbidden and turned on CDP everywhere to help with any duplex issues I may be seeing, but nothing has popped up.

I'll start checking MACs here in a sec and let you know.

spidey07 · Aug 31, 2006

I think a trace would be a really good step. Some times if you're scratching your head you just gotta see the trace, ideally get it on both ends.

Really dumb question (but could affect your througput, although doubtfull it would bring it down to 3-4 meg) - what are the ping times of 64 and 1500 byte packet?

randal · Aug 31, 2006

Looks to be serializing fine with a pretty good rtt. As far as a trace, are you talking a trace route, or a test-station trace or ?

anubis# ping -s64 -c3 ra.data102.com && ping -s1450 -c3 ra.data102.com
PING ra.data102.com (38.97.208.21): 64 data bytes
72 bytes from 38.97.208.21: icmp_seq=0 ttl=64 time=2.620 ms
72 bytes from 38.97.208.21: icmp_seq=1 ttl=64 time=2.601 ms
72 bytes from 38.97.208.21: icmp_seq=2 ttl=64 time=2.518 ms

--- ra.data102.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 2.518/2.580/2.620/0.044 ms

PING ra.data102.com (38.97.208.21): 1450 data bytes
1458 bytes from 38.97.208.21: icmp_seq=0 ttl=64 time=4.059 ms
1458 bytes from 38.97.208.21: icmp_seq=1 ttl=64 time=4.060 ms
1458 bytes from 38.97.208.21: icmp_seq=2 ttl=64 time=3.976 ms

--- ra.data102.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 3.976/4.032/4.060/0.039 ms
anubis#

PING ra.data102.com (38.97.208.21): 1500 data bytes
1508 bytes from 38.97.208.21: icmp_seq=0 ttl=64 time=4.024 ms
1508 bytes from 38.97.208.21: icmp_seq=1 ttl=64 time=4.051 ms
1508 bytes from 38.97.208.21: icmp_seq=2 ttl=64 time=4.096 ms

--- ra.data102.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 4.024/4.057/4.096/0.030 ms
anubis#

spidey07 · Aug 31, 2006

well, we've knocked out latency/serailization.

by trace - packet trace/sniffer. taking a look at tcp to make sure it's running well (no fragmentation, few retransmissions, good windowing, etc)

randal · Aug 31, 2006

I'll have to hookup my laptop + packet sniffer on it and run the same tests and see what I can get. The only problem is that I am seeing the exact same issues between every possible mix - from any server in DEN to any server in COS, it is the exact same situation. That makes me question if it's TCP at all :-/

Also, RE: duplex, I am still scratching my head as I do not have the specifics on auto-negotiation vs. manual settings. On both sides, if I set the switch to auto I get 100/half - which means that the LEC is set to either 100/full, OR they are on auto/auto and our auto-neg isn't working (surprise surprise). If I manually set my side to 100/full, and the LEC is on auto, then they will defer to ... 100/half? *dialing their number now*

spidey07 · Aug 31, 2006

autonegotiation works great IMHO. But the rule is "always force both side or auto both sides, if not you will most likely have a duplex mismatch"

But if one side is forced (to half or full) and the other side is auto. the auto side will almost always drop to half duplex - because it is not receviging any autonegotiation information.

But it's normal practice to force critical links like this and I'd be surprised if the LEC has their end auto.

randal · Aug 31, 2006

That is my experience as well, but my laptop, which is auto/auto, negotiates 100/full when plugged in. Assuming my laptop isn't super awesome, why would it auto to 100/full if the LEC's box is manually 100/full/not-auto? Hmmm.

*waiting on a call back while checking netstat compulsively*

spidey07 · Aug 31, 2006

good link on autonegotation...

http://www.cisco.com/en/US/products/hw/...oducts_tech_note09186a00800a7af0.shtml

randal · Aug 31, 2006

OK I am running the tests on the unix machines and it looks like I am getting a LOT of Out of Order when sourcing from Colorado Springs. On the order of ~30%. I'll compile all that, (4 servers) and have you take a look if you don't mind?

randal · Aug 31, 2006

Well, it looks like it really is an out of order issue, which is then accompanied by retransmits. I would appreciate any additional insight you can give me:

http://isis.data102.com/~randal/moe_ra-seshat.txt
http://isis.data102.com/~randal/moe_ra-anubis.txt
http://isis.data102.com/~randal/moe_thoth-seshat.txt
http://isis.data102.com/~randal/moe_ra-isis.txt

So now that I have a handle on what's wrong, WTF could be causing this? Only thing I can think of is some weird load sharing on the underlying SONET OC-3?? That helps make a lot more sense, because then one of the fibers, or the channel on its underlying OC-48, might be broken?

bruceb · Aug 31, 2006

Are you breaking the DS3 down into individual full T1 spans at both ends ? ?
If so the LEC should have set the DS3 Framing as M13
If you are using the entire DS3 pipe at full bandwith, then you need C-Bit Framing

For what you are trying to do, I think you would want M13 to bring out
a bunch of T1 carriers with 24 DS1 channels on each .. each channel can then
do different jobs .. Voice Trunks, Data Lines, etc

Framing errors will not show up on a loopback test over the T3
Have the LEC check the DS3 Framing Options in the DACS and in the SONET RING

Note: Even though a Cisco Switch or Router can take a DS3 as an Input
internally it will break it down and really should have the DS3 as C-Bit framing
for it to work properly .. it is very possible the LEC has one direction Misframed
This is a very common mistake for inexperienced dacs mapping techs

You did order the same framing type on Both Ends from the LEC right ? ?

Garion · Sep 1, 2006

Hey cool. An interesting question here!

Looking at your diagram, I assume that this is ALL Layer 2 and that there isn't any layer 3 switching or routing going on here - You've just extended the segment between the two LANs.

One thing that doesn't make sense is that you've got your 7200's with a DS3 for backup on a separate subnet. To be honest, I'm not sure how this will work. If the fiber link dies, each 7200 will assume that it has the whole 38.97.208.x/25 subnet is local and never send it across the DS3.

A couple of things to look at.

Out Of Order packets can be a lot of things, but it can often be caused by some kind of odd loop in the mix. Have you tried to totally shut down the 7200's and take them out of the mix? If you're showing a clean circuit across the fiber then something else is getting in the way and the 7200 is really the only device capable of doing so. You might be having something REALLY odd happening where the traffic to DEN from COS is bouncing through the DS3 and back or something.

If you can't take the 7200 out of the mix, here's what I'd do:

Setup a PC with Ethereal and get it on your 3524. Use Cisco's port spanning feature to capture all the traffic that's actually going out of the switch port that leads to the metro Ethernet AND capture it on the host that's sending it. Pretty much the same test you did, just adding a PC to watch the metro ethernet port. See if there's any differences.

Next do the same thing, but span the port off to the 7200. Theoretically, it shouldn't see ANY lof the traffic. If it does, something is Not Right Here.

One dumb thing to check - Do you have any kind of wierd subnet masks going on? Is everything across the board consistently using a 255.255.255.128? If the hosts in COS were using a different mask it could force their traffic somewhere else that could cause very, very odd things to happen.

- G

spidey07 · Sep 1, 2006

G,

where the hell have you been. welcome back.

randal · Sep 1, 2006

Garion, your initial statement about extending the segment is spot on.

PLEASE SEE THE UPDATED VISIO: http://isis.data102.com/~randal/moedrawing.gif - it has a lot more info on it that will help.

The DS3 is not a backup. The DS3, and its backup (a dark DS3) is currently in use and moving traffic between denver and COS. That link is a standard routed /30. The Metro Ethernet is supposed to replace it. There is currently no routing in place at all for any of the MOE directly. The ONLY place that any router can touch the MOE is on the trunk between the COS cisco 3500-24XL and the COS c7206b - this is simply to enable internet access for the servers in COS. Again, the Denver router has no knowledge of 38.97.208.x (except via bgp over ds3).

I cannot turn off the 7200s, but I can disconnect the trunk that hooks up the servers/MOE to it. From all of my sniffs & mac checking, things that are on 38.97.208.x/25 are staying on net as they should.

I have not tried port spanning, which is a good idea. On the other hand, I have taken the switches out entirely and ran from RA (den) straight to my laptop (cos) and had the same throuhgput, although I have not done that with tcp dumping / looking for OoO packets, retrans etc, just throughput.

I checked all of the subnets on everything and they are all set appriopriately.

I think the updated picture will help a lot. Please continue firing away. It will be very difficult to convince the LEC Layer1/2 guys that Yes, we are getting 100mbps of Ethernet, but it is coming back in the wrong order. I do not know if their test sets are able to sequence ethernet packets with varying patterns and then test that - will follow up with them in the morning.

bruceb · Sep 1, 2006

Randal .. I suggest your read my post above
This really sounds like a DS3 Framing issue to me
This is a new DS3 you just installed, right ? ?

I have over 30 years in telecom with Verizon
I have seen many odd problems that are a result
of circuit misoptions .. if you are breaking the DS3
down in DS1 / T1 at both ends, make sure both ends
of each T1 is set for the same framing .. most people
now are using ESF / B8ZS on the T1 pipe

TC10284 · Sep 1, 2006

Off topic, I know, but just wanted to say - I envy you guys' knowledge of networking. I hope to be this knowledgeable with LAN/WAN networking when I am older (22 in October).

spidey07 · Sep 1, 2006

I still think the 7200s are playing a role somehow (I know they shouldn't, but it's an option you have to leave open). The only way to be sure is to get a packet trace and look at each layer to make sure it is correct. Or disconnect them and re-rerun your tests.

There could be redirects going on, proxy arp, or any host of weird layer2/3 things. I bet if you get a sniffer on there a big light bulb is going to turn on.

Do me a favor and "show ip proto" "show ip int brief" "show ip int f0/0" "show ip route" "show run int f0/0" on the connected 7200.

cmetz · Sep 1, 2006

randal, have you tried hooking two servers directly to the 100Mb/s Ethernet handoffs (assuming they're copper) and running netperf? That would help determine whether the 7200s are introducing problems, and should be able to help you fight with the telco if that test fails.

Also ensure that the telco's duplex settings are right. I have similar circuits from Verizon, though 10Mb/s, and a recent problem I had with those showed me that their default config on their gear is pretty bogus.

randal · Sep 1, 2006

Originally posted by: bruceb
Randal .. I suggest your read my post above
This really sounds like a DS3 Framing issue to me
This is a new DS3 you just installed, right ? ?

I have over 30 years in telecom with Verizon
I have seen many odd problems that are a result
of circuit misoptions .. if you are breaking the DS3
down in DS1 / T1 at both ends, make sure both ends
of each T1 is set for the same framing .. most people
now are using ESF / B8ZS on the T1 pipe

Bruce, this is an Ethernet Over SONET product. Basically they stick an OC3 into some Fujitsu box and it spits out ethernet.

100bT-----[Fujitsu box]-OC3----------------------------------OC3-[Fujitsu box]---100bT

We are at the ends connected to the 100bT links. As such, we have zero input on line framing, channelizing, anything. The nice thing about it is that you can get a multiple access technology (ethernet) over 100 miles thanks to SONET.

randal · Sep 1, 2006

Originally posted by: cmetz
randal, have you tried hooking two servers directly to the 100Mb/s Ethernet handoffs (assuming they're copper) and running netperf? That would help determine whether the 7200s are introducing problems, and should be able to help you fight with the telco if that test fails.

Also ensure that the telco's duplex settings are right. I have similar circuits from Verizon, though 10Mb/s, and a recent problem I had with those showed me that their default config on their gear is pretty bogus.

Yes. I hooked up the MOE delivery directly to the "ra" server in Denver and my laptop here in COS and had the same throughput issues, host to host. I did -not- however, do any looking for OOO packets, retrans, etc. just pure throughput. I will do this again today, host-to-host without any intermediary switches/routers.

randal · Sep 1, 2006

Originally posted by: spidey07
I still think the 7200s are playing a role somehow (I know they shouldn't, but it's an option you have to leave open). The only way to be sure is to get a packet trace and look at each layer to make sure it is correct. Or disconnect them and re-rerun your tests.

There could be redirects going on, proxy arp, or any host of weird layer2/3 things. I bet if you get a sniffer on there a big light bulb is going to turn on.

Do me a favor and "show ip proto" "show ip int brief" "show ip int f0/0" "show ip route" "show run int f0/0" on the connected 7200.

I'll see about disconnecting the 7200 entirely from 2 of the test servers here in COS and then testing like that. I wish I could give you that output, but that will be a TON of information (bgp route table?), of which I am not sure how much is germaine, but here is what I can show:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
cos7206b#show ip proto
Routing Protocol is "bgp 33302"
Outgoing update filter list for all interfaces is not set
Incoming update filter list for all interfaces is not set
IGP synchronization is disabled
Automatic route summarization is disabled
Neighbor(s):
Address FiltIn FiltOut DistIn DistOut Weight RouteMap
38.97.208.126 1 1
38.97.208.130
38.97.208.146 2 2
38.97.208.182
216.84.132.185 3 3
Maximum path: 1
Routing for Networks:
Routing Information Sources:
Gateway Distance Last Update
38.97.208.146 20 1w6d
38.97.208.130 200 00:00:13
216.84.132.185 20 2d01h
38.97.208.126 200 9w1d
Distance: external 20 internal 200 local 200

cos7206b#show ip int fa0/0
FastEthernet0/0 is up, line protocol is up
Internet protocol processing disabled
cos7206b#show run int fa0/0
Building configuration...

Current configuration : 192 bytes
!
interface FastEthernet0/0
description Data102 Internal
no ip address
no ip redirects
no ip unreachables
no ip proxy-arp
load-interval 30
duplex full
end

cos7206b#

As you can see, nothing out of the ordinary
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Although there are multiple `no ip proxy-arp` entries in the config, I agree that isolating things would be best. To that end, I think I'll prep another server really quick and make things purely host-to-host: one server on one end, another server on the other end, no switches or routers involved.

Metro Optical Ethernet troubles, please help - FIXED!

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

No Lifer

Golden Member

Golden Member

Diamond Member

Platinum Member

No Lifer

Golden Member

Diamond Member

Senior member

No Lifer

Platinum Member

Golden Member

Golden Member

Golden Member