Good networking problem for the network engineers

subflava

Senior member
Feb 8, 2001
280
0
0
Hey guys,

I've recently discovered a strange problem on Comcast's network which I thought you guys might like to chew on. What I'm looking for is some good theories as to what the cause might be. This is an on-going problem so if anyone can think of some good tests to run, I could do it.

Summary of the problem:

I first noticed the problem a couple weeks ago while trying to play a game via a direct IP connection with a friend who is also on the Comcast network. The problem also coincides with an IP change on my friend's node. The new IP he received begins with a different octet than what he used to get which leads me to believe Comcast has been doing some re-allocation of IP space. Running some ping tests shows that I'm getting massive packet loss to my friend. I get anywhere ranging from 25% to 75% loss (keep reading, I promise this will get more interesing). We're located about about 5 miles apart physically and our IP's are similliar. Mine is currently (until the DHCP lease expires) xx.xx.222.138 and his is xx.xx.3.29 (the first 2 octets are the same) both with a /21 subnet mask (255.255.248.0).

Okay, at this point some of you are probably thinking:

1) Connection problem. Either your friend's or your connection is dropping packets.
Nope. Both our connections are fine to other places on the internet (mostly port traffic). I can ping from outside of Comcast's network to our connections and recieve 99%-100% response.

2) Must be some overloaded circuit between us
Doubtful. The problem has been there for 2 weeks and it's been there whenever I've had the chance to test it. Highly doubtful that a circuit is overloaded for that period of time between 2 nodes that close together.

I've also done some ping tests to neighboring subnets and found that I can ping anyone in my subnet (xx.xx.216.0/21) without problems. I can also ping a lot of other nodes on different subnets (xx.xx.144.0/21 for example). I'll have to do some more testing to be sure, but it seems when I get down into the 50's (maybe 60's) the problem starts to appear. I haven't done a complete mapping of the entire B class, but the fact that the problem is only with certain subnets is probably significant and could point to subnet mask problem?

I guess the easiest explanation is there's some misconfiguration in one of the routers between our connection. That's obvious, but what kind of problem could cause these symptoms? If it's a subnet mask or ACL problem, shouldn't all traffic be getting dropped? What's the explanation for the partial drops?
Unfortunately I can't trace the route between our connections as Comcast has blocked it. However, one would assume there can't be more than 1 or 2 hops between us.

Does anyone have any insight into how cable operators setup their networks? Do they use ATM to backhaul the traffic before it is handed off to their IP network? I know Covad does this with their DSL service (or at least they did 2 years ago)

And I guess one thing I should confirm is if this loss happens with all traffic types. I'm currently assuming it happens to all traffic...should probably verify it.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
First off, if there *is* a problem with Comcast's network (which it sounds like there is), nobody on here can do a thing about it. Call up Comcast if you haven' t yet, and try to get your hands on someone who actually knows something about the technical side of the network to explain your problem to (it won't be easy, most likely, and they'll have to send a technician out to look at it, which could take a few days).

That said, let's start at the beginning. Your connection runs through the cable wiring that's strung throughout your neighborhood. Your cable modem converts signals in a particular chunk of bandwidth on that line to/from digital Ethernet signals. The cable company basically has a similar device sitting on the far end of the wire, which acts like a router for your subnet (or it may be a dumb box that plugs into a gigabit routing switch for your subnet -- it doesn't really matter). In any case, from that moment on you're in IP land, most likely with a gigabit Fibre Channel backbone (although it could theoretically be ATM, that's almost exclusively used by phone companies).

If packets are not successfully making it to one particular location, there's a routing problem in their network. You said your friend changed subnets -- usually this happens when there are too many customers on one subnet (leading to degraded performance and numerous complaints about slow web surfing), so they run some more wires, plug in another cable modem/router thing on their end, and split the subnet.

Odds are they partially misconfigured the routing tables for whichever piece of equipment your friend is now on (or something between you, if it's more than 1 hop). They likely have static routes configured for other connected Comcast subnets, so that local traffic doesn't get bounced out to the Internet unnecessarily (which both increases your ping time and their bandwidth bill). A mistake there can end up sending packets between you and your friend off on some wild goose chase, and not all of them will make it back. How do your ping *times* look? Are the packets that get through delayed for a long time before arrival? If so, it's probably forwarding your traffic incorrectly to an external gateway (which, understandably confused, probably sends it back or just tosses it). If the packets that get through have a low ping, then it might be a load-balancing scheme gone awry -- it may be trying to send some packets over one interface and some over another (intending them both to get to your friend, but with some of them getting sent the wrong way).

Problems like this can be hard to detect, since certain misconfigurations will only show up if a computer on one subnet tries to access certain things through a particular router interface -- and testing them all every time the topology changes is difficult (if not impossible) and time-consuming. Like you said, everything seems fine unless your two computers are talking to each other, which makes me suspect a minor configuration problem. Odds are there just aren't many people sending traffic back and forth between your subnets, and so Comcast probably doesn't even know anything is wrong.

I'd say to run a traceroute, but you've obviously tried that already. They probably have the internal routers configured to ignore ping requests. I'd get on the phone with them ASAP, as there isn't a whole lot more you can do from your end.
 

Kadarin

Lifer
Nov 23, 2001
44,296
16
81
It's also possible that it's not a routing problem. Some ideas: One of the routers in the path may have a high cpu problem for some reason. There may be a bad port. There may be bad fiber. There could be a bug in some code somewhere. Who knows?

Report the problem; if they get enough complaints, they just might have someone track it down. It may be slipping under their radar because not that many people need end to end communication within the Comcast network, so that particular path carries little traffic.

Good luck!
 

subflava

Senior member
Feb 8, 2001
280
0
0
First off, if there *is* a problem with Comcast's network (which it sounds like there is), nobody on here can do a thing about it. Call up Comcast if you haven' t yet, and try to get your hands on someone who actually knows something about the technical side of the network to explain your problem to (it won't be easy, most likely, and they'll have to send a technician out to look at it, which could take a few days).

Guess I should have said this at the beginning, but I'm not actually asking for help to solve this problem. I've already been in contact with Comcast tech support and am currently trying to solve this issue through them. I just thought the symptoms of this problem were a little unusual and thought people might be interested in it from an academic stand-point. I guess I should also say that I'm a networking professional myself and have decent experience with networks ("biggest" router I've worked with is a Cisco 7513)

In any case, from that moment on you're in IP land, most likely with a gigabit Fibre Channel backbone (although it could theoretically be ATM, that's almost exclusively used by phone companies).

Are you just making an educated guess on this part or do you have knowledge of this? I know from first-hand experience that Covad keeps most of their DSL traffic on an ATM network until it's handed off to their customers (ISP's that is, not end-users). I suppose the reason for doing this is because ATM handles high loads more efficiently and insures each circuit gets a chance to send traffic under busy conditions. I just wondered if anyone knew how cable companies handle it. Actually, if anyone has a link to the benefits/disadvantages of ATM I'd like to read up on it.

Odds are they partially misconfigured the routing tables for whichever piece of equipment your friend is now on

Right. To me that's the interesting part. What is misconfigured exactly? I guess I'm mean it in the sense that if I was a network engineer assigned to troubleshoot this problem, what kind of tests would I run? What do the symptoms tell me? Where would I begin to look? I know we obviously don't have access to Comcast's routers, but usually troubleshooting doesn't start with looking at router configs anyways. I'm asking what the symptoms would tell a network engineer to look at.

How do your ping *times* look?
They are very consistently in the 20ms range when I get a response which is what I've observed to be normal conditions. Highs of about 60ms here and there so I don't think this is a congestion issue. I've seen congestion conditions before where the pings will range from normal, to 100ms, 500ms, 1000ms and all over the place included drops. This doesn't seem to be the case here.

Comcast probably doesn't even know anything is wrong.
Agreed. This problem is so local I doubt they know...even after the 3 phone calls I've place with them :)
I'll keep calling to see if they eventually listen.

Oh, and thanks for reading my long (confusing??) post and responding :)
 

subflava

Senior member
Feb 8, 2001
280
0
0
One of the routers in the path may have a high cpu problem for some reason. There may be a bad port. There may be bad fiber.

Perhaps...but that doesn't really fit what I'm seeing. If it's a bad port/router/fiber, then the problem should be showing up with other locations too. My hunch right now is still that it's a L3 issue...perhaps a QOS or load-balancing (which Matthias99 mentioned) misconfiguration? My experience with those services on a live network is limited though, so I can't formulate a good possible explanation.
 

subflava

Senior member
Feb 8, 2001
280
0
0
Spidey, where the heck are you? :D What, is this one just too simple for you?

Anyways, here's a ping sample:

(about 40 good replies before this)
Reply from xx.xx.3.29: bytes=32 time=23ms TTL=147
Request timed out.
Reply from xx.xx.3.29: bytes=32 time=22ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=35ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=19ms TTL=147
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from xx.xx.3.29: bytes=32 time=21ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=22ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=24ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=19ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=26ms TTL=147
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from xx.xx.3.29: bytes=32 time=32ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=32ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=20ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=30ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=21ms TTL=147
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from xx.xx.3.29: bytes=32 time=20ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=20ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=20ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=21ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=21ms TTL=147
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Reply from xx.xx.3.29: bytes=32 time=17ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=17ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=19ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=22ms TTL=147
Reply from xx.xx.3.29: bytes=32 time=19ms TTL=147

Hmm...I've noticed there's stretches where the packet loss kind of forms a pattern, 5 responses followed by 5 drops. The pattern isn't always there, but there are stretches where it is. Seems like a helpful clue.
 

cmetz

Platinum Member
Nov 13, 2001
2,296
0
0
subflava, Google for pchar, RTFM on it and run it. All sorts of diagnostic goodness there. I don't know whether their blocking of traceroute will cause problems with this, though. Also try traceroute with both UDP and ICMP modes, and if all those fail, Google for tcptraceroute and try that. Also, ping your local next hop and see what that TTL is - the delta between that and the returned TTL of 147 when pinging your friend should tell you how many hops are between you.

A typical cable operator network uses GigE with long-reach optics over dark fiber between cable head-ends in a metro area, and PPP-over-SONET WAN links between metro areas and peering points. Within a metro area, a cable operator should have more bandwidth than they know what to do with and therefore no congestion loss.

Unfortunately, what you really need to do is get in touch with someone at Comcast who has real networking clues and the ability to get into the equipment, and those people are well hidden. It's a standard problem in the low-end (home/SOHO) ISP solution market; they aggressively shield the real engineers behind call centers, and it can be incredibly frustrating trying to push problem reports to the people who can actually cause them to get fixed. Two things I can think of to try are that I believe Comcast has a news server with local discussion groups, and there's likely a Comcast forum on DSLReports.com with real Comcast folks lurking.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
In any case, from that moment on you're in IP land, most likely with a gigabit Fibre Channel backbone (although it could theoretically be ATM, that's almost exclusively used by phone companies).


Are you just making an educated guess on this part or do you have knowledge of this? I know from first-hand experience that Covad keeps most of their DSL traffic on an ATM network until it's handed off to their customers (ISP's that is, not end-users). I suppose the reason for doing this is because ATM handles high loads more efficiently and insures each circuit gets a chance to send traffic under busy conditions. I just wondered if anyone knew how cable companies handle it. Actually, if anyone has a link to the benefits/disadvantages of ATM I'd like to read up on it.

This is an educated guess on my part, but I would be *very* surprised if they were using ATM equipment. AFAIK it's basically extinct except for long-distance trunks maintained by various phone companies. ATM's virtual circuit switching is not terribly useful unless you're running phone connections, as TCP/IP is built around a packet switched network model.

ATM chops up your data into tiny little fixed-size packets (it's like 53 bytes or some tiny esoteric number around there), and also has a bunch of protocols for managing and maintaining end-to-end connections within the ATM network. The advantages are that the packets can be switched entirely in hardware (meaning very low per-hop latency), and it's very easy to maintain QoS restrictions on the packet flows, since part of the protocol handles bandwidth allocation explicitly. The disadvantages are that having packets that small is terrible for handling IP data -- there's too much overhead (it's like 4 bytes per 53-byte packet, versus maybe 10 bytes per 1500-byte packet with Ethernet). Circuit switching adds overhead each time a connection is made or broken, since control packets have to get bounced all over the place to make sure all the routers involved have adequate bandwith open. It was built mainly for telecommunications, at which it excels, but it's just not geared for TCP/IP data transmission. Computer data networks are highly dynamic, with data rates rapidly going from zero to full blast. When they developed ATM, they didn't have readily available fiber optics, and they didn't have millions of high-speed computers all connected to each other. People don't care so much if it takes a second or two to connect their long-distance calls, but everyone will care if your system lags for 1-2 seconds every time you connect to a website.

Odds are they partially misconfigured the routing tables for whichever piece of equipment your friend is now on

Right. To me that's the interesting part. What is misconfigured exactly? I guess I'm mean it in the sense that if I was a network engineer assigned to troubleshoot this problem, what kind of tests would I run? What do the symptoms tell me? Where would I begin to look? I know we obviously don't have access to Comcast's routers, but usually troubleshooting doesn't start with looking at router configs anyways. I'm asking what the symptoms would tell a network engineer to look at.

The symptoms are that computers on two of your subnets can't reliably talk to one another. The first thing I'd do is to get some computers hooked up to the routers in question and see if they can send packets back and forth (we'll assume this fails, just like with you and your friend). At this point I'd look at what they changed when your friend switched subnets, and double-check everything they did to make sure that a) they picked the right thing to do, and b) they did it correctly.

My guess from these symptoms is that some of the static routing tables on one end or the other are not correct -- it's easy to mistype something, and it might not show up immediately if access to the internet is still OK. I would double-check the routing entries and then try it again if they seemed out of whack. If that didn't work, you have to start looking at what's between your two subnets. It could be just one fiber optic cable -- it could be half a dozen routers and switches in a load-balancing network. If there aren't any more routers, I might start to suspect a bad fiber (especially if one had just been installed). If there were more routers, you'd have to go look at them and see if their routing tables were correct, and make sure their interconnections are good. That's the general process, anyway.

How do your ping *times* look?
They are very consistently in the 20ms range when I get a response which is what I've observed to be normal conditions. Highs of about 60ms here and there so I don't think this is a congestion issue. I've seen congestion conditions before where the pings will range from normal, to 100ms, 500ms, 1000ms and all over the place included drops. This doesn't seem to be the case here.

Since the pings are OK (20ms sounds about right -- probably 2-3 hops), it means some of the packets are getting through fine, but others are getting mangled or lost. Could be a flaky cable or port (or router!), could be screwed up routing. You're right in that it doesn't look like congestion.

Comcast probably doesn't even know anything is wrong.
Agreed. This problem is so local I doubt they know...even after the 3 phone calls I've place with them :)
I'll keep calling to see if they eventually listen.

Oh, and thanks for reading my long (confusing??) post and responding :)

No problem.
 

p0lar

Senior member
Nov 16, 2002
634
0
76
That diagnostic you printed almost spells it out.. it really, REALLY, looks like some kind of equal-cost routing gone sour or misconfigured if I were to hazard a guess (and I am). It reeks of poorly configured static routes - or - a misconfigured load-balanced internal routing mechanism designed for fault-tolerance or increased bandwidth between segments. This is why neither of you will have problems with traffic destined for other locations. FWIW, most cable companies that I've dealt with either use dark fibre or SONET (some even use sub channel SONET, FBoW). You don't see lots of ATM-based stuff unless it is a smaller company or has a remote POP or group of cells that doesn't have a large enough customer base to warrant high bandwidth or the cost of the fibre backhaul.

There's a lot of good info in this thread, but a general lack of knowledge about how CMTS systems are set-up. Unfortunately, they're not all configured identically (or correctly for that matter) so it's hard to guage one particular system by the experiences with another.

Pull the config file from your CMTS TFTP server and post it here - that may help as well. (I'm assuming this is DOCSIS 1.0/1.1 since Cox is relatively large)

- p0lar
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
well I guess others have covered it but if your problem is only with their local metro network then there isn't much you can do other than have comcast fix it.

The repeating pattern of 5 good pings and then 5 timeouts is really unusual. I like cmetz's idea of running different kinds of trace routes. With the nachi worm out there pinging away some isps have low queued or even rate limited pings.

But it does seem like some kind of routing problem.
 

p0lar

Senior member
Nov 16, 2002
634
0
76
Good answer, Spidey - I had forgotten all about that.. welchi and nachi have both run rampant on cable networks, especially with such large broadcast ranges. It's *entirely* possible that they've rate-limited ICMP.

I still see a lot of garbage on my segment from that..
 

exx1976

Member
Nov 13, 2003
77
0
0
CMETZ - The network setup you describe is exactly what I was going to say. A friend of mine used to be a head-end tech for Cox communications in the Phoenix area. I've seen pictures of all of their stuff, and of stuff he wired.. They had THREE OC-48's worth of bandwidth between headends, and it was a folding ring topology.