Mysterious year-long problem with network share: solve this and you are God :)

IronCrown

Junior Member
Apr 23, 2012
12
0
0
This problem is tormenting us for over a year now. Nobody can find a solution. I myself have tried many things. And we have an external IT service provider who manages our network and servers and three different employees have tried to solve the problem for hours and hours at considerable cost and it was all for nothing.


And this is what happens:

Applications often just stop working; the window fades out and the cursor turns into the rotating circle indicating that the computer is busy. Sometimes the computer returns to normal after a really long time (several minutes), but usually nothing can be done except killing the application. Often even that is impossible and the PC has to be rebooted. The application is usually Word or Excel, but it can also be Firefox or our email software Tobit David, so I don't think it's a MS Office thing. When the application hangs and you open "my computer", the contents usually do not load – as if the network share is unavailable. The folder windows usually shows the green loading bar indefinitely.

This happens mostly when loading or saving, though occasionally it happens when you just open a folder or sort the contents of one or just look at it. But it ONLY happens when the user is working with files on our network share. It NEVER happens when the user has only files opened that are on local drives. The network share is located on the domain controller.


Now more details:

As I said the problem occurs when the user is working with files on the network share, but it does not happen for all users or for all of them at the same time.

Some of our PCs are not affected at all, they always work fine. These PCs are two older ones that run XP and Vista, respectively. But we also have a couple of DELL laptops running Win 7 or Vista and they also never hang – even when the affected PCs are having the issue. So it's not a general problem with the network share.

The PCs that are affected have all been bought from our IT provider, but over a period of several years. They have different mainboards and NICs. All of them run Win 7, some 32-bit, some 64-bit. some of them have been with us for years and ran smoothly until about 14-15 months ago when the trouble started.

Sometimes several PCs have the problem at once or in short succession. Other times, one user can work a whole day without problems while another can hardly work at all because every time he tries to save his work the PC hangs. The next day it might be reversed. Some days nobody is having a problem at all. We even had periods of a whole week without the problem, and then it was back worse than ever. There were no regularities, like always the same colleague on vacation when it didn't happen or something like that.

So we have no way to reliably reproduce the error. Many times we were hoping to have found the cure because we did something and the problem did not occur for several days, but then it was back.


More details about the hardware and network:

During the long time we are already having this issue a lot about our hardware changed without any noticeable effect on the problem: Our network switch broke and we replaced it. We moved our whole business to another building, so all the cabling is brand new. I replaced several of the NICs in the PCs. We even have a new Lancom router. We also bought some new PCs; all of them are affected, but also several PCs we already had before.

The physical network connection seems to be perfectly fine. For testing we had a ping to the server running on the affected machines. Even when the PC started to hang, the ping continued smoothly with <1ms latency.

When the problem occurs, the server (domain controller) never shows anything unusual. It's not overloaded or unresponsive. Our IT provider checked the event reports and never found anything. For a while the server often ran out of RAM, but we recently upgraded it so it never goes above 40% memory used. We also upgraded the read/write cache which increased performance considerably in normal operation but did not eliminate the hang-ups.

Other short facts:
- the DC is a HP ProLiant running Windows Server 2003
- there's another backup DC (the old server) that runs Server 2000
- we have about a dozen workstations and about ten laptops
- McAfee is used for security
- Tobit David email server is running on the DC
- several MS SQL databases are also running on it (MSSQL 2008 Express)
- backup of the network share is done with Acronis

I'll greatly appreciate any suggestions. Even if it sounds improbable... we probably tried all the probable things already. Or maybe not.
 

JackMDS

Elite Member
Super Moderator
Oct 25, 1999
29,545
421
126
The phenomenon is happening because the Network connection is lost while working through the Network. The applications that opened and using remote the files can not find the file any more and they hangup.

It can be many reason to Network losing its connection, if IT pro can not find it on site the probability that you can solve it through online forum is not very high.

As long shoot, I do not know what "McAfee is used for security" means.

But if it is installed on the individual computers, try to get rid of it. Not just Disable it, Disabling does not prevent interfering with the TCP/IP stack. Use MS Security Essential instead.


:cool:
 

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
This smells very strongly like some kind of DNS/windows name resolution problem. That's the exact symptom you'll get (computer hangs in open/save/network browsing/displaying mapped drives via explorer). You'd have to take a complete look at DNS and every single client network configurations/bindings.

What is providing DNS for the clients and what DNS servers are listed on the clients? There should only be the internal DNS server(s).

I'm assuming you don't have underlying physical layer problems (cabling) which is always a possibility and the number 1 cause of "network weirdness" and slow performance.

A packet capture of the problem on the client machine would be the best way to find out what it's looking for when the problem occurs.
 
Last edited:

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
I am going to echo Spidey here. It sounds like a "non AD admin" is maintaining AD. This sounds like DNS misconfiguration on the workstations and possibly the server. Short of shoddy physical layer stuff.

Also get rid of that 2k server...
 

_Rick_

Diamond Member
Apr 20, 2012
3,952
70
91
You should probably use wireshark/ethereal and analyze the events at the packet level.
Preferably on a test server and test client, as logging packets can be a performance drain you may not want on your production network.

Then, once you got a machine that has "lost" CIFS, you will have to examine the packets that came before.
From the knowledge we have it could be anything, but it appears to be on the higher levels of the model, if you have changed cabling and hardware without effect.
Corrupt DNS caches could be one problem, so could some weird authentication issues, or just a temporary loss of connection, and windows failing to resuscitate the link. The latter should not happen (When my server kernel panics, and I have to reboot, often the Windows client does reconnect to the samba share after a while, without logging off).

For additional trouble shooting, I recommend using the samba suite, which allows you to do some quick diagnostics on the command line, from any Linux terminal. In some cases quicker to run, than clicking through the endless AD interfaces.

Smells like a configuration issue of the "software" stack, in any case, if your observations regarding the lower OSI layers are correct. Be sure to verify those observations, by running continued diagnostics on the Ethernet and IP-layers.
Without continuous observation, it is impossible to determine what happens in the run up to the problem. You may even run your diagnostics in a virtual environment, that requires less resources. Create VM images of your standard deployed image, and of your server image, and run a few instances on some free machine(s). You might even want to go as far as to completely virtualize your entire current network. That will cost more resources, but you'll be able to run closer to reality than when using bare images.
Set up some scripts that are reasonably similar to the network I/O your server sees, and let it run. If you can isolate the actual problem on the clients (i.e.: share inaccessible) then you can use that as a test condition to mark your packet capture logs or even freeze a VM, if your virtualizer has an accessible API and you are somewhat handy with code.

I'm afraid there is no magic bullet for hard to reproduce problems. You've got to find out how to reproduce them, and sometimes only statistics helps.
 

SilthDraeth

Platinum Member
Oct 28, 2003
2,635
0
71
Are you sharing from the server using a Distributing File System, or just going to the directory and sharing it out by applying sharing and folder security permissions?
 

her209

No Lifer
Oct 11, 2000
56,336
11
0
What OS is the server running the file share? What's your CAL licensing model set to on the server?
 

her209

No Lifer
Oct 11, 2000
56,336
11
0
Also check to see if there is a specified limit on the number of users that can connect to the share.
 

IronCrown

Junior Member
Apr 23, 2012
12
0
0
Thank you all for the suggestions so far. I'll try them out as far as I can and let you know if it worked.

Are you sharing from the server using a Distributing File System, or just going to the directory and sharing it out by applying sharing and folder security permissions?
It's just a shared folder on the Windows 2003 Server that is accessed like a regular hard drive by everyone.

her209 said:
What OS is the server running the file share? What's your CAL licensing model set to on the server?
Win 2k3 Server. I've never heard of CAL licensing before so I don't know.
There is no user limit set for the share.

@_Rick_: Your suggestions sound really complicated and like something our supposed network specialist firm should know a lot more about than me. But if all else fails I guess I'll have to get into it...

@spidey07: Your suggestions sound plausible and reasonably simple, so I'll try that first. Any ideas on specific things I should look out for?

My plan for tomorrow is that I'll disable the DHCP server on the DC, check every computer and printer in our offices and set them all to static settings for IP, subnet, gateway and DNS.

Right now both our DCs (the 2k and 2k3 server) act as DNS. I might also add that our network uses 10.0.1.x IP addresses but with a 255.255.255.0 subnet mask. I inherited this setup from my predecessor... but although unusual, it shouldn't cause problems, right?

edit: First thing I noticed is that in the DHCP server settings, there were three entries for DNS: The primary DC, the secondary DC and also the router/gateway. I removed the latter two and also disabled all DHCP functionality on the router.

edit2: I noticed another odd thing: In the TCP/IP settings of the old 2k server, in the WINS tab, there was an entry for an even older server that does not exist anymore. It was removed from the network at least a year before the problems began though, so this is probably not the cause.
 
Last edited:

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
IronCrown, what is supposed to be providing DHCP? The simplist way to handle DHCP on a Windows Domain is with a Windows DHCP server. Make sure the DNS is set to the domain controllers (they should be running DNS if they are running AD.) Then what you are doing is a step in the right direction. Any devices that are hard set such as the servers need DNS reviewed and corrected.

In the Windows world DNS has to work both directions. The server and workstation names really do need to resolve properly. WINS is a whole different ball of wax and I would start it with: Do you need it? It is netbios for multiple sites and it can cause its own problems.

10.0.1.0/24 is not odd at all either.
 

IronCrown

Junior Member
Apr 23, 2012
12
0
0
Yes, the DHCP server is running on the primary DC. WINS is probably not needed and was just there by default I guess.

I have begun to reconfigure all clients with static IP/subnet/gateway/DNS. One was already setup with static data and it was an affected machine so that wasn't it or rather, if that was it, some machines cause the problems for everyone.

The most promising thing yet is probably the old server (that was still used as secondary DNS) having some link to the non-existent even-older server. I have now removed all DNS entries pointing to the old server and disabled the DNS service on the machine altogether.

Since then two hours have passed without the problem, but that doesn't mean anything yet. (Although we were massively hit the hours before, which fills me with a certain sense of probably false hope)

One other thing: One of my colleagues reported that she has been having problems with her printer since... forever. Sometimes she can print, sometimes the printer shows as offline. What happened was that when I checked the old server, I noticed that there was one and only one employee with active connections to it, and the accessed files were related to the printer spooler. Turned out that she (and only she) used a network printer that was installed on the old server (but does get its IP from the DHCP server running on the newer machine). I deleted all printers on the old server. Tomorrow I will reinstall all printers on the primary DC with their new, static IP addresses.
 
Last edited:

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
Keep at it. One of the main parts of IT that gets missed is the maintenance. There is no "ROI" [there is but it is hard to explain to a lot of manager] Slowly work through the low hanging fruit and the harder stuff will become apparent or even go away.
 

Zargon

Lifer
Nov 3, 2009
12,218
2
76
Yes, the DHCP server is running on the primary DC. WINS is probably not needed and was just there by default I guess.

I have begun to reconfigure all clients with static IP/subnet/gateway/DNS. One was already setup with static data and it was an affected machine so that wasn't it or rather, if that was it, some machines cause the problems for everyone.

The most promising thing yet is probably the old server (that was still used as secondary DNS) having some link to the non-existent even-older server. I have now removed all DNS entries pointing to the old server and disabled the DNS service on the machine altogether.

Since then two hours have passed without the problem, but that doesn't mean anything yet. (Although we were massively hit the hours before, which fills me with a certain sense of probably false hope)

One other thing: One of my colleagues reported that she has been having problems with her printer since... forever. Sometimes she can print, sometimes the printer shows as offline. What happened was that when I checked the old server, I noticed that there was one and only one employee with active connections to it, and the accessed files were related to the printer spooler. Turned out that she (and only she) used a network printer that was installed on the old server (but does get its IP from the DHCP server running on the newer machine). I deleted all printers on the old server. Tomorrow I will reinstall all printers on the primary DC with their new, static IP addresses.

sounds the full settings audit is the way to go
 

dawks

Diamond Member
Oct 9, 1999
5,071
2
81
The phenomenon is happening because the Network connection is lost while working through the Network. The applications that opened and using remote the files can not find the file any more and they hangup.

It can be many reason to Network losing its connection, if IT pro can not find it on site the probability that you can solve it through online forum is not very high.

As long shoot, I do not know what "McAfee is used for security" means.

But if it is installed on the individual computers, try to get rid of it. Not just Disable it, Disabling does not prevent interfering with the TCP/IP stack. Use MS Security Essential instead.


:cool:

I'll roll with Jack on this one. I had a problem where our payroll staff would report that excel locks up when working with files on a mapped network drive. Windows Server 2003 has a default timeout of like 15 minutes. If there is no activity on an open file, Windows Server will disconnect. Excel freaks out and crashes. Took me a while to work this one out, but setting a higher timeout fixed it for me.

Windows Server drops un-used connections by default after 15 minutes, just to limit resource consumption.

I bumped mine to a few days since my file server load is low relative to resources available.

run 'net config server' and check 'Idle Session Time'.

Vista and Win7 may be less likely to be affected since they are running SMB2...

I'd also verify the DNS stuff too I guess.

http://support.microsoft.com/kb/556004
 
Last edited:

kevnich2

Platinum Member
Apr 10, 2004
2,465
8
76
Yes, the DHCP server is running on the primary DC. WINS is probably not needed and was just there by default I guess.

I have begun to reconfigure all clients with static IP/subnet/gateway/DNS. One was already setup with static data and it was an affected machine so that wasn't it or rather, if that was it, some machines cause the problems for everyone.

The most promising thing yet is probably the old server (that was still used as secondary DNS) having some link to the non-existent even-older server. I have now removed all DNS entries pointing to the old server and disabled the DNS service on the machine altogether.

Since then two hours have passed without the problem, but that doesn't mean anything yet. (Although we were massively hit the hours before, which fills me with a certain sense of probably false hope)

One other thing: One of my colleagues reported that she has been having problems with her printer since... forever. Sometimes she can print, sometimes the printer shows as offline. What happened was that when I checked the old server, I noticed that there was one and only one employee with active connections to it, and the accessed files were related to the printer spooler. Turned out that she (and only she) used a network printer that was installed on the old server (but does get its IP from the DHCP server running on the newer machine). I deleted all printers on the old server. Tomorrow I will reinstall all printers on the primary DC with their new, static IP addresses.

I would caution against disabling DHCP/DNS on your DC, it's there for a reason. You do likely have a misconfiguration in the settings, but that doesn't mean you need to completely disable it. Some sysadmins like static IP's set in the client. I, myself, prefer DHCP on a DC as it gives centralized admin of all IP's and makes changes, such as increasing your subnet, easy. I would reconfigure your settings as it looks like some of the dns settings are incorrect but I would recommend keeping dhcp and dns active on the DC, otherwise in a few days you'll likely start noticing things stop working on the clients side with regarding to network resolution of things.
 

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0

Zargon

Lifer
Nov 3, 2009
12,218
2
76
I would caution against disabling DHCP/DNS on your DC, it's there for a reason. You do likely have a misconfiguration in the settings, but that doesn't mean you need to completely disable it. Some sysadmins like static IP's set in the client. I, myself, prefer DHCP on a DC as it gives centralized admin of all IP's and makes changes, such as increasing your subnet, easy. I would reconfigure your settings as it looks like some of the dns settings are incorrect but I would recommend keeping dhcp and dns active on the DC, otherwise in a few days you'll likely start noticing things stop working on the clients side with regarding to network resolution of things.

this.

going manual to test some stuff is fine but dont back to static everywhere, IMO thats going backwards bigtime
 

IronCrown

Junior Member
Apr 23, 2012
12
0
0
I'll roll with Jack on this one. I had a problem where our payroll staff would report that excel locks up when working with files on a mapped network drive. Windows Server 2003 has a default timeout of like 15 minutes. If there is no activity on an open file, Windows Server will disconnect. Excel freaks out and crashes. Took me a while to work this one out, but setting a higher timeout fixed it for me.

Windows Server drops un-used connections by default after 15 minutes, just to limit resource consumption.
Thanks, but this is something I tried early on, did not fix the problem.

kevnich2 said:
I would caution against disabling DHCP/DNS on your DC, it's there for a reason.
Yes, I reconsidered that one already. We occasionally have "visitors" (mostly executives coming in for board meetings) who need WLAN access on their own notebooks, so I need DHCP for that. I will however reconfigure all our permanent devices (desktop PCs and printers) with static settings. Our network is after all quite small and not likely to grow a lot.


Today in the morning everything still went smooth for two colleagues who came in very early. Then an hour later when two other colleagues started working, the problem returned. I was not there as I came in only at noon (because I'm planning a night shift today to reconfigure everything). The problems were so massive that no one could work at all for about two hours. Even the two people on the computers that were never affected before were affected this time. Either the partial reconfiguration I did yesterday worsened the problem or it was something unrelated with similar symptoms.

The best thing is, when I arrived at the office (my boss had called me about two hours before and told me that no one could work), everything had returned to normal. No trace of the problem. My colleagues actually thought that I had done something remotely that fixed it.

I questioned everyone about what exactly they did shortly before the problem disappeared. One colleague told me that she switched ON a network printer at the time. This printer still gets its IP address via DHCP. I also remember that there was a problem with it about a year ago when some machines refused to connect to that printer for unknown reasons. So my newest theory is that maybe some other machine always tries to find this printer in the network, and when it doesn't find it, everything breaks down. But this is only the latest theory out of a thousand... ;)
 

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
Yes, I reconsidered that one already. We occasionally have "visitors" (mostly executives coming in for board meetings) who need WLAN access on their own notebooks, so I need DHCP for that. I will however reconfigure all our permanent devices (desktop PCs and printers) with static settings. Our network is after all quite small and not likely to grow a lot.

Don't waste your time. Set up DHCP correctly from the get go. Static IPs do not always register correctly in the AD's Dynamic DNS. It also makes a real pita if you do it on laptops.

What you need:

DC with DNS and DHCP roles installed.

DNS should be AD integrated. Create a reverse zone for the 10.0.1.0/24 range.
On your DC do an IPCONFIG /REGISTERDNS. Repeat on all static IP servers.

DHCP:
Create a scope for your 10.0.1.0/24. Make sure the assignable range does not include your servers.
Set the DNS servers to the ip addresses of your DCs
Set the gateway to your internet gateway.
Set the DNS suffix to your domains DNS suffix. Example office.blah.com.
Create MAC address based reservations for the printers.

Verify that all DC's with DNS servers are sharing the AD integrated DNS. Make sure they can all resolve "www.google.com" Check the forwarder tabs, they should be empty in a small company if not, make sure the DNS servers they point at work.

There is a bit more to it but at this point start rebooting machines. Make sure they can resolve the DNS names of the AD servers. IE myadserver1.office.blah.com
On reboot they should register in DNS in both the forward and reverse zones.

There may be a bit more but once this set up is solid, we can start looking in the event logs on the servers to see if there are events being generated.

When it comes to Windows Domain above is by far the most common screw up that results in all kinds of "odd" issues.

You really may need a pro to walk in and fix it properly depending on how bad it is because people at a forum can't see the entire setup.

Another thing to verify is that the time on the DC is set correctly.

Also verify that a live DC has all the FSMO roles assigned. If an older server was decommissioned incorrectly it will also cause a nightmare like this.

The 2 hour thing screams a misconfiguration at the most basic levels. You need to go through the settings and see what is being handed out.
 

IronCrown

Junior Member
Apr 23, 2012
12
0
0
The thing is that those steps have supposedly been done before by the supposedly professional network specialists we pay thousands of bucks a year to service our network. I have to try something different.

Since they continue to fail and my trust in them is near zero now, I'm trying to do the work I should not have to do myself. I am also an IT professional and learned this stuff in my apprenticeship, but my work is usually more general and I have external service providers for specific areas where I lack in-depth knowledge or simply do not have the time to do it all (I am kind of the one-man IT branch in my firm).
 

imagoon

Diamond Member
Feb 19, 2003
5,199
0
0
The thing is that those steps have supposedly been done before by the supposedly professional network specialists we pay thousands of bucks a year to service our network. I have to try something different.

Since they continue to fail and my trust in them is near zero now, I'm trying to do the work I should not have to do myself. I am also an IT professional and learned this stuff in my apprenticeship, but my work is usually more general and I have external service providers for specific areas where I lack in-depth knowledge or simply do not have the time to do it all (I am kind of the one-man IT branch in my firm).

The try something different really should be: "get a new provider."

You don't hit your car engine with a hammer to "try something different" when the mechanic hasn't fixed that noise the last 5 times you took it to him, you go to a new mechanic.

You will make it worse if you mess with things without fully understanding what you are doing.
 

kevnich2

Platinum Member
Apr 10, 2004
2,465
8
76
As others have stated, I would look for a new IT contractor/consultant. This requires someone trained in system and network administration. No offense here - but you do not have the training or experience from what you've posted. You may end up making the problem worse with your tweaks and reconfiguration of everything.

Go with the OSI model and start working your way your up. Your approach right now has no rhyme or reason to it. If that doesn't make sense to you, put down what your doing and start calling a few SMB IT consulting companies that have people certified in network troubleshooting to look at it. That's what their job is.

It may take a while for them to see the issue itself and fix it but it seems to be costing you money right now when it happens.

One other thing, turn on every device, printer, computer, laptop, wireless phone, etc that your office uses, wait about 5 min and see if the problem comes back. I'd say one of your devices or printers is causing this as it sounds sporadic and/or a broadcast storm from one of these devices. That would bring a network to it's knees.
 
Last edited:

spidey07

No Lifer
Aug 4, 2000
65,469
5
76
Or a printer thinking it's the master browser for netbios name resolution and a hodge podge of other devices that do or don't have netbios turned on or other protocols loaded. It just screams name resolution.
 

IronCrown

Junior Member
Apr 23, 2012
12
0
0
imagoon said:
There is a bit more to it but at this point start rebooting machines. Make sure they can resolve the DNS names of the AD servers. IE myadserver1.office.blah.com
On reboot they should register in DNS in both the forward and reverse zones.
Dynamically configured computers do show up in forward lookup zone (A record), but not in the reverse lookup zone. In the DHCP server, the options for dynamic updating of DNS are enabled.

Any idea why no entries in the reverse lookup zone are created?

All devices with static IPs (outside of the range of assignable addresses of the DHCP of course) do create entries in both zones after startup and can resolve all names just fine.