Win Server Auth Issues

[DHT]Osiris

Lifer
Dec 15, 2015
17,238
16,456
146
All,

I've got a rather insidious issue I'm working with that I'm hoping some minds here might be able to shed some light on. Not 100% sure if this is the best forum for this particular query, as it lies somewhere between OS, security, computer help, and highly technical.

We've got a pair of Server2016 VMs acting as a failover cluster for some file server nodes on our network, and we're having an issue with certain types of authentication. Specifically, on occasion (a day or two post-restart.. get to that later) one of, or sometimes both of, the cluster hosts will pass into a 'failing authentication' state (my words) whereupon auth requests from service accounts will fail to authenticate against them. Specifically, we use a network monitoring server/program which makes WMI queries (I think) for perf data, disk usage, etc, and our backup program (VEEAM) uses a service account, and fails to connect remotely. I can see very consistent security log failures for the service account when it's in a 'failed authentication' state, but just from these service account boxes. It results in failures with backups from VEEAM, and failures of monitoring for our monitoring program (Frameflow).

All boxes are domain joined, all on same domain. All are server 2016. Failover cluster operates normally when in 'failed state', and a restart of a given failover cluster node VM resolves the issue (usually for <24h or so). Authentication otherwise works fine (VM authing to domain, users connecting to shares hosted on node, etc), and as far as I can tell the only things that cannot reach it correctly are these two boxes with service accounts. Restarting those two servers does nothing.

Extra bonus: all the security logs are returning the same auth failure status code: 0xC0000122, which resolves as an NT status code of 'Invalid computer name', which I've <never> heard of, and nor has the internet apparently. I did a Wireshark on the actual node to monitor traffic during a failed authentication attempt, and I see NTLM auth exchange, NTLMSSP exchange, an initiation of traffic from the cluster host to a DC (presumably to authenticate the connecting box/service account), a response of that traffic from the DC, and finally the error response sent from the cluster host to the failing box (in this specific Wireshark exchange, the VEEAM server). That tells me the DC is probably the one generating the failure code, but that still makes little/no sense in my mind.

Extra extra bonus: I can't log into the DC to monitor logs (campus network, silo'd heavily) and I can't log into the firewall between these systems to see if it's manipulating packets in some way (for the same reason). I've got both teams looking at their respective elements. Additional surprise component: the DCs may not be Windows (i'm fairly certain they aren't).

So... Has anyone ever seen that NT status code, has anyone ever seen something like this come up, and/or does anyone have any ideas of what I could look at to fix this?
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
Cross-forest authentication could be a factor if that's in play. Are both systems on a domain, if so the same domain? What OS are the Veeam and Frameflow servers running? I've also seen monitoring systems hang onto outdated system names even after you update it. First thing I'd do is if Veeam and Frameflow are using the same service account, would be to split them up to separate accounts. I'd confirm Frameflow is using WMI for monitoring. If it's providing "realtime" monitoring with a "normal" polling interval (90 seconds) and it's monitoring CPU, memory, disks, and NIC, then you should be seeing at least 4 login attempts from the Frameflow box every 2 minutes. Do you have access to the domain controller(s) in question? Check for outdated DNS records for your file servers.
 

[DHT]Osiris

Lifer
Dec 15, 2015
17,238
16,456
146
Cross-forest authentication could be a factor if that's in play. Are both systems on a domain, if so the same domain? What OS are the Veeam and Frameflow servers running? I've also seen monitoring systems hang onto outdated system names even after you update it. First thing I'd do is if Veeam and Frameflow are using the same service account, would be to split them up to separate accounts. I'd confirm Frameflow is using WMI for monitoring. If it's providing "realtime" monitoring with a "normal" polling interval (90 seconds) and it's monitoring CPU, memory, disks, and NIC, then you should be seeing at least 4 login attempts from the Frameflow box every 2 minutes. Do you have access to the domain controller(s) in question? Check for outdated DNS records for your file servers.
Interesting. All the systems referenced in the above scenario are in the same domain, however we *are* utilizing cross domain (as part of a transition from DomainA and DomainB into DomainC... DomainC being where these systems already lie. Some users and servers still in DomainA/B).

VEEAM and Frameflow are both on Server2016, nothing's been renamed however, it's all original VMs (they were rebuilt, not migrated or anything).

They definitely don't use the same service account.

Frameflow is polling every 5m, and no I (unfortunately) don't have access to the DC logs, where I suspect I'd find something... Got a team looking at that side though.

Def isn't DNS, as all the systems are resolvable normally from every angle imaginable when the system passes into a failed state.

I really think that technet article is the closest thing to a rational explanation I've seen though, as the cluster hosts process auth requests cross-domain for other users (they're general purpose file servers for multiple domains right now, DomainA, B, and C as listed above) so it's possible we're seeing that 'cross-forest timeout -> send auth requests to wrong DC' issue. Curiously, that blog seems to indicate that a fix was in place/on the way for Server2016 (which all the referenced systems are). I wonder if it was intended to end up on the DCs though, rather than the client.
 

[DHT]Osiris

Lifer
Dec 15, 2015
17,238
16,456
146
Is Frameflow and Veeam using the FQDN's for your systems?
Yep, should be, I'll triple check it in the AM.

I'd be surprised if it was periodically failing to resolve if it wasn't, however. Our environment doesn't use Dynamic DNS in any way, so resolution should always be pretty static.