Poor network response, Late Collisions, High CPU Utilization, and Long Ethernet Cable runs.. Are these related?

Santa

Golden Member
Oct 11, 1999
1,168
0
0
I was hoping some people could help fill in any gaps that I may not have tested to eliminate a slow performance issue.

At first someone suggested that the network was slow to our branch office but after I reviewed the Logs of network traffic the utilization was not even 20% most of the time.

I also saw some Late collision messages on the router so I called out some electricians to do some Length and Attenuation tests on the wires running to the core of that branch office.

Ended up finding out there were many links beyond 100Meter(328 Feet). Now at first I thought.. ok Late collisions = retransmissions = slow down in network performance. Great.. replace wires and go.

But before I went spending lots of money and digging carpet to recable on this theory I needed to test this out and confirm that the speeds would be better.

Log of what was performed...

Network of Branch office =
Switches
2 x Cisco 2900XL switch
1 x Cisco 2950 w/Gigbit
Router
1 x Cisco 4000
Frame Relay

Removed 2 switch
·Still receiving late collisions
·still slow performance
·34-50% CPU

Removed all but 1 workstation + router
·No Late Collisions
·still slow performance
·34-50% CPU

Put back on 12 workstations that had runs <200 feet
·No Late Collisions
·still slow performance
·34-50% CPU

Put back on 12 more workstations that had runs <250 feet
·3 Late Collisions
·still slow performance
·34~50% CPU

Generated ping traffic from 12 Workstations, server, and over WAN
·Late Collision
·Collisions back up to 2-5 per minute
·still slow performance
·38-60% CPU

Took off 6 Machines and had 6 of 18 machines generating packets
·Late Collisions
·Collisions still up to 2-5 per minute
·still slow performance
·38-50% CPU

Took off 6 more Machines and had generated traffic to 12 workstations
·Late Collisions
·Collisions still up to 2-5 per minute
·still slow performance
·38-50% CPU

Took off 6 more Machines and had 6 remaining machines generate traffic
·Late Collisions
·Collisions still up to 2-5 per minute
·still slow performance
·40-60% CPU

Put all machines and both switches back onto the network
·Late Collisions
·Collisions still up to 3-5 per minute
·still slow performance
·46-66% CPU

What concerns me is this office is the only one with long ethernet runs and a router that has a cpu utilization at idle at 38% This goes up to 44-45% underload and spikes around 60-70%

My question is..
- Will replacing the router and the long wire runs bring some performance hits into check?
- Is 38% cpu utilization at idle extermely strange for a Cisco 4000 with 1 Frame relay link, EIGRP, and some mainframe controllers hanging off of it normal? It isn't remotley near as high on 2522 or 3620 we have in other offices
- Is the Cisco 2522 superior to the 4000? According to specs the 2522 has the same type of Motorola CPU 68030 but with 20Mhz or 25Mhz speed but the 4000 has it with the 40Mhz speed.. shouldn't make sense to have higher utilizations according to the speed of proc but if architecture is less than that of the 2522
- Does anyone know if the 4000 can do Full Duplex on the Ethernet, NP-1E module we have in this thing? Everywhere it just says its 10Base-t but no duplex info.. it works both ways so I am assuming it does full?
- Is there anything else I am missing in terms of tests to see where the bottleneck may lie?

Thanks in advance..
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
hmmmmm....

Where is the poor network response? All local LAN traffic are are we talking WAN traffic that is slow?

Late collisions are bad and can be thought of as a drop. They're bad because now the transport layer is responsible for the retransmission, so you're waiting 5 - 10 seconds on that retran instead of microseconds for a regular collision. Most common causes of late collions are cables too long or a duplex mismatch where one side of the link is full and the other half - always force both sides, never force one side and auto the other. So check and double check duplex settings on interswitch links, the router (4000 module half-only I believe but you can check on cisco site)

More on LAN troubleshooting - 3-5 collisions per minute is fine. When they're approaching 2+% of total frames is cause for concern.

About processor utilization on the router. It shouldn't be that high. What process is using the most? (show proc cpu)

About slow WAN performance. Could be duplex related to the 4000 lan interface. Check your buffers/drops. Check physical layer errors on both sides. Check your MTU and fragmentation settings. If frame relay check fecn/becn and CIR with show frame-relay pvc, show frame-relay map.

And lastly, can you describe the slow network performance? Is there really a performance problem or do the stupid users just don't realize that they're accessing a system over a WAN circuit.

HTH
 

Garion

Platinum Member
Apr 23, 2001
2,331
7
81
I doubt if that card does full duplex. In fact, if you are seeing reports of collisions, you KNOW it's not in full duplex mode - A FD link doesn't have collisions, so it reports everything as framing or other L2 errors.

I'd try to lock down the port running the router to 10/Half and see if that fixes things. Also, if your cabling job is less-than-ideal (too long, poorly terminated, etc.) you can lock problem ports down to 10/half, too. It's definitely less sensitive to cabling issues, and performance should still be fine, unless you're talking about hardcore engineers or graphic designers. For most places (banks, etc.) 10Mb/s is PLENTY.

I noticed that in ALL your tests, you had slow performance. Have you tried to link a single PC to the router with a crossover cable, to totally eliminate any cabling or switch issues?

When you say slow performance, what do you mean? Slow throughput? Not responsive? Dropping packets?

How about a sniffer test - It show anything of interest?

- G
 

JustinLerner

Senior member
Mar 15, 2002
425
0
0
Here's a generic response.

You should eliminate connections over the 100 meter cabling limit using whatever appropriate means and solutions you have at your disposal.

CPU and network usage may be related to other things. Maybe some 'helpful' company employees want to contribute 'CPU' processing sharing with SETI or other programs. What about viruses, worms, trojans, spying or other monitoring traffic run amuck?
 

me19562

Senior member
Jun 27, 2001
374
0
0
if the wire drop is not more than 290ft and the patch cable in the comm closet and PC connection you are not suppose to have wiring problems unless the wiring was not properly terminated. Check the settings and status on each port and PC NIC.
 

ScottMac

Moderator<br>Networking<br>Elite member
Mar 19, 2001
5,471
2
0
You don't really mention the server / application architecture, or your security setup. If you'r running a batch of Novell servers, and / or Appletalk, and / or Netbeui / SNA / OS/2, you're gonna have a lot of broadcasts ... and with broadcasts comes processor utilization.

If you have a lot of access lists on your routers, if you have a large routing table (I'm not seeing a lot of segments, but you didn't say if you were describing your entire network), if the router is doing DHCP and security stuff (IDS / firewall) ... all of these contribute to the processor utilization. So do things like custom / priority queueing and "helper" applications (like DHCP helper).

The late collisions are almost certainly a physical infrastructure issue (assuming it's not a duplex mismatch). Bad termination, and the out-of-spec lengths are your most likely suspects, followed by flakey patch cables (they can go bad, even if you don't move 'em ... kinda like cable rot), flakey panels, ground loops, electrical noise.

Your best bet (at least for that / those runs) would be to hit it with a certification-grade cable tester - many of the newer ones will give you a noise figure on the cable (Fluke fer sher).

Long cable runs are best done with fiber, even if you use a copper--> fiber --> copper converter. They are way less likely to nuke your network or gather and inject noise. Plus they'll bring your segment back into spec.

SO, let us know about "the rest of your network" and maybe something will ring a bell.

Good Luck

Scott
 

Santa

Golden Member
Oct 11, 1999
1,168
0
0
Did some serious testing last night..

When I first posted this message I configured a 2500 router I had lying around and shipped it overnight to their location..

CPU utilization went down dramatically with the new router.. same network and load tested.. CPU load remained around 9-20% but idled around 9%

This seems more right in my eyes but the unfortunate matter is the response times did not improve much if any. (<1 sec)

To give you an idea the application we are testing this against is an imaging process. In other words scanned documents reside at the HQ and the entire process of querying and retreiving to displaying the document on the screen over a Frame Relay line (512/384) through a switched network 100Mbit to the desktop. We are comparing this sites response times to those of two other sites which both have T1 but 1 with 384 cir and one with 512 cir.. now at first I suspected WAN.. but after monitering for multiple weeks we didn't see spikes above 200-250k on their line so we assumed 512/384 is not taxed yet.. That and we saw so many Late collisions and the reports of our electricians saying we had many long wire runs worried us.

I did test the idea that Garion suggested in eliminating the wiring infrastructure altogether by crossover cable and made sure the computer's NIC was set to half Duplex / 10Mbit to hopfully match the router.. the router doesn't have a duplex option (neither 2500 nor 4000) so im assuming half.

The test proved to have the same result.. perhaps overhead is just typical but why this site compared to the other two? I notice no unnormally high loads on the WAN circuit throughout my entire testing either..

What I do know to be a fact:

1) 1 PC + 1 Router + 1 Switch (same room) = No Collisions, ~5second response times
2) Add Load and multiple computers (even with runs < 250 feet including cable to computer ) and response times drop to 6-7 seconds or maybe 8 seconds.. but with Late Collisions

Server: Windows 2000 with SQL Server and Disk Extender
Application Architecture: Image Retreival via SQL query and disk extender
Security Setup: Windows 2000 Active Directory but no filters or ACLs on the frame relay links

CPU utilization for five seconds: 38%/35%; one minute: 41%; five minutes: 42%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
1 5328 51869 102 0.00% 0.00% 0.00% 0 Load Meter
2 0 1 0 0.00% 0.00% 0.00% 0 LAPF Input
3 7066764 63589 111133 0.00% 2.65% 2.71% 0 Check heaps
4 0 2 0 0.00% 0.00% 0.00% 0 Pool Manager
5 72764 130266 558 0.00% 0.02% 0.04% 0 Timers
6 31644 26172 1209 0.00% 0.00% 0.00% 0 ARP Input
7 0 1 0 0.00% 0.00% 0.00% 0 SERIAL A'detect
8 619580 455095 1361 0.57% 0.23% 0.16% 0 IP Input
9 23200 41740 555 0.00% 0.00% 0.00% 0 CDP Protocol
10 0 1 0 0.00% 0.00% 0.00% 0 LAPF Output-I
11 211084 263067 802 0.32% 0.25% 0.24% 0 IP Background
12 40048 137465 291 0.00% 0.00% 0.00% 0 TCP Timer
13 352 121 2909 0.08% 0.00% 0.00% 0 TCP Protocols
14 0 1 0 0.00% 0.00% 0.00% 0 Probe Input
15 0 1 0 0.00% 0.00% 0.00% 0 RARP Input
16 4880 2500 1952 0.00% 0.00% 0.00% 0 BOOTP Server
17 13160 4322 3044 0.00% 0.00% 0.00% 0 IP Cache Ager
18 36 4324 8 0.00% 0.00% 0.00% 0 NBF Input
19 0 2 0 0.00% 0.00% 0.00% 0 SPX Input
20 0 4 0 0.00% 0.00% 0.00% 0 DDR Timers
21 0 1 0 0.00% 0.00% 0.00% 0 SNMP ConfCopyProc
22 0 1 0 0.00% 0.00% 0.00% 0 Syslog Traps
23 0 1 0 0.00% 0.00% 0.00% 0 Critical Bkgnd
24 76 303 250 0.00% 0.00% 0.00% 0 Net Background
25 340 6121 55 0.00% 0.00% 0.00% 0 Logger
26 24004 257978 93 0.00% 0.00% 0.00% 0 TTY Background
27 44764 257990 173 0.00% 0.00% 0.00% 0 Per-Second Jobs
28 1039000 257989 4027 0.24% 0.24% 0.24% 0 Net Periodic
29 564 1828 308 0.00% 0.00% 0.00% 0 Net Input
30 7800 51870 150 0.00% 0.00% 0.00% 0 Compute load avgs
31 124828 4322 28881 0.00% 0.03% 0.00% 0 Per-minute Jobs
32 57608 259054 222 0.00% 0.00% 0.00% 0 DLSw Background
33 7664 3101 2471 0.00% 0.00% 0.00% 0 DLSw msg proc
34 9380 40893 229 0.00% 0.00% 0.00% 0 CLS Background
35 0 4 0 0.00% 0.00% 0.00% 0 DLSw Peer Process
36 118552 43670 2714 0.00% 0.00% 0.00% 0 TCP Driver
37 33720 50146 672 0.00% 0.00% 0.00% 0 FR LMI
38 460 4179 110 0.00% 0.00% 0.00% 0 FR ARP
39 27744 2577746 10 0.00% 0.00% 0.00% 0 FR Broadcast Output
40 228 12972 17 0.00% 0.00% 0.00% 0 FR TUNNEL
41 924 22887 40 0.00% 0.00% 0.00% 0 traffic_shape
42 607644 45616074 13 0.00% 0.01% 0.00% 0 SDLC Timer
43 219604 387407 566 0.00% 0.00% 0.00% 0 IP-EIGRP Hello
44 0 1 0 0.00% 0.00% 0.00% 0 LAPF timer-Ack
45 55968 12504 4476 0.00% 0.17% 0.12% 0 IP SNMP
46 4 19 210 0.00% 0.00% 0.00% 0 SNMP Traps
47 6484 4411 1469 0.00% 0.01% 0.00% 0 IP-RT Background
48 148132 169031 876 0.00% 0.03% 0.01% 0 IP-EIGRP Router
49 80 45 1777 4.17% 0.43% 0.09% 2 Virtual Exec

I don't see how it adds up to 40+%

So far the verdict seems to be.. Router CPU utilization and wiring doesn't impact that dramatically.. I am going to recommend to them that we get a sniffer in there to see if there is any weird network traffic going on behind the scene.

Would it be safe to say that since the test of cross over cable to the router didn't yeild significant reponse times that even if router and wiring in the building was replaced that it wouldn't make a differnce also?

I am thinking that the router and wiring replacment may help during load time to keep it from going up to 7-8 second response but don't think I can get it to below 4-5 seconds.

FECN and BECN bits are not showing up and bandwidth utilization spikes to 250 max but usually stays around 56k and 150k

The router hosting their Frame Relay is setup very simply to host 1 Circuit with 2 PVCs and about 3-4 Mainframe controllers.

No ACLs but we do Custom queueing.. here is our queuing strategy..

queue-list 2 protocol ip 1 tcp 2065
queue-list 2 protocol ip 2 tcp telnet
queue-list 2 default 3
queue-list 2 queue 1 byte-count 4500 limit 200
queue-list 2 queue 2 byte-count 3000 limit 200
queue-list 2 queue 3 byte-count 3000 limit 200

the traffic in question would fall under queue 3 2065 is controller traffic and telnet.. well telnet traffic..

I didn't set this up but it seemed reasonable enough and works fine for our other offices so didn't mess with it.

I am running out of ideas.. beyond sniffing for weird traffic anything else I havn't tried?

Would going from 10Mbit Half Duplex - 100Mbit Full duplex on the router ethernet link improve speeds much? (remember frame relay size)

I will also try some bandwidth and latency tests here soon and post those results too to verify it isn't bandwidth.. (will use QCheck)
 

Garion

Platinum Member
Apr 23, 2001
2,331
7
81
Wait a second.. Where are you seeing late collisions? On the router?!? You shouldn't ever see a late collision on the router, since it's connected to the switch via very short cable. You should ONLY see late collisions between the PC's with long cable runs and the switch, since each port is a separate collision domain and physical layer errors shouldn't be passed from port to port.

Is your switch using cut-through, as opposed to store-and-forward? If so, that might explain it - It would be forwarding on the frame before it's fully received, and if a late collision shows up, it might send that out.

Have you tried to shut down ALL interfaces except the frame relay to see what the router CPU utilization is? Try and re-enable each interface, one at a time, and see what happens.

On that note, what are you doing with the mainframe controllers? Terminating DLSW or running something like STUN can add a pretty heavy load on the router, so the utilization is to be expected.

One last question - The entire 4000-series line simply say "4000" on the front, but there's 4000's, 4500's, and 4700's, with escalating CPU capacity and interface capacity. Which do you have, specifically? Show ver show give you this.

- G



 

ScottMac

Moderator<br>Networking<br>Elite member
Mar 19, 2001
5,471
2
0
Maybe a "show buffers" would help too. How much RAM is in the routers?

And a "Show Interfaces" on the path through the router (Ethernet & FR), both routers.

AND (while we're showing ....) maybe a "show queue" (on each queue from whichever interfaces you're implementing the custom queueing on).

I'm gonna bet a nickle that some of the queue3 traffic is being dropped while the router is servicing queue1 & 2 ... (make sure you have a representitive load).

Good Luck

Scott





 

Santa

Golden Member
Oct 11, 1999
1,168
0
0
This is show version..

Cisco Internetwork Operating System Software
IOS (tm) 4000 Software (C4000-J-M), Version 11.2(26a), RELEASE SOFTWARE (fc1)
Copyright (c) 1986-2001 by cisco Systems, Inc.
Compiled Thu 07-Jun-01 06:42 by leccese
Image text-base: 0x00012000, data-base: 0x00776A0C

ROM: System Bootstrap, Version 4.14(6)[fc3], SOFTWARE

System image file is "c4000-j-mz.112-26a.bin", booted via flash

cisco 4000 (68030) processor (revision 0xA0) with 16384K/4096K bytes of memory.
Processor board ID 5038751

This is show buffers...

Buffer elements:
499 in free list (500 max allowed)
220766771 hits, 0 misses, 0 created

Public buffer pools:
Small buffers, 104 bytes (total 50, permanent 50):
50 in free list (20 min, 150 max allowed)
109345507 hits, 50 misses, 150 trims, 150 created
0 failures (0 no memory)
Middle buffers, 600 bytes (total 25, permanent 25):
23 in free list (10 min, 150 max allowed)
77737 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Big buffers, 1524 bytes (total 50, permanent 50):
50 in free list (5 min, 150 max allowed)
63716 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 10, permanent 10):
10 in free list (0 min, 100 max allowed)
31191 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Large buffers, 5024 bytes (total 0, permanent 0):
0 in free list (0 min, 10 max allowed)
0 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Huge buffers, 18024 bytes (total 0, permanent 0):
0 in free list (0 min, 4 max allowed)
0 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)

Interface buffer pools:
Ethernet0 buffers, 1524 bytes (total 64, permanent 64):
16 in free list (0 min, 64 max allowed)
2351 hits, 360 fallbacks
16 max cache size, 16 in cache
Serial0 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
599 hits, 82 fallbacks
16 max cache size, 15 in cache
Serial1 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
417 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial2 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
15582 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial3 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
101 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial4 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
85 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial5 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
117 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial6 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
13985 hits, 0 fallbacks
16 max cache size, 16 in cache
Serial7 buffers, 1524 bytes (total 64, permanent 64):
15 in free list (0 min, 64 max allowed)
8006 hits, 0 fallbacks
16 max cache size, 16 in cache

This is show queue serial 0...
Input queue: 0/75/0 (size/max/drops); Total output drops: 1
Queueing strategy: weighted fair
Output queue: 0/1000/64/0 (size/max total/threshold/drops)
Conversations 0/40/256 (active/max active/max total)
Reserved Conversations 0/0 (allocated/max allocated)

I definatly am recieving Late collisions on my router eth interface log. We have switched cables and also made sure duplex/speed are matched as best I can . Currently set to half duplex and 10Mbits.

One mystery solved and apparently more information led to the solving of it..
Guess the controllers just don't show up on the utilization charts.. Good call Garion!

I did a shut on serial 1-7 and utilization went down to 15-19% from 38%.. That is more reasonable for this router with just a frame relay only connection I suppose.

We do have 7 controllers hanging off of the router.. We only need 3 so I am going to get the admin of that location to consolidate . I was thinking about that but didn't realize how much cpu underutilized controllers could take up. I am not sure what you mean by "terminating dlsw" but we have these controllers servicing dummy terminials and I have them setup anchored via a Loopback interface and configured with peers at our mainframe datacenters

Could having these controllers hooked up cause issues with slow downs?