- Dec 15, 2015
- 17,238
- 16,456
- 146
All,
I've got a rather insidious issue I'm working with that I'm hoping some minds here might be able to shed some light on. Not 100% sure if this is the best forum for this particular query, as it lies somewhere between OS, security, computer help, and highly technical.
We've got a pair of Server2016 VMs acting as a failover cluster for some file server nodes on our network, and we're having an issue with certain types of authentication. Specifically, on occasion (a day or two post-restart.. get to that later) one of, or sometimes both of, the cluster hosts will pass into a 'failing authentication' state (my words) whereupon auth requests from service accounts will fail to authenticate against them. Specifically, we use a network monitoring server/program which makes WMI queries (I think) for perf data, disk usage, etc, and our backup program (VEEAM) uses a service account, and fails to connect remotely. I can see very consistent security log failures for the service account when it's in a 'failed authentication' state, but just from these service account boxes. It results in failures with backups from VEEAM, and failures of monitoring for our monitoring program (Frameflow).
All boxes are domain joined, all on same domain. All are server 2016. Failover cluster operates normally when in 'failed state', and a restart of a given failover cluster node VM resolves the issue (usually for <24h or so). Authentication otherwise works fine (VM authing to domain, users connecting to shares hosted on node, etc), and as far as I can tell the only things that cannot reach it correctly are these two boxes with service accounts. Restarting those two servers does nothing.
Extra bonus: all the security logs are returning the same auth failure status code: 0xC0000122, which resolves as an NT status code of 'Invalid computer name', which I've <never> heard of, and nor has the internet apparently. I did a Wireshark on the actual node to monitor traffic during a failed authentication attempt, and I see NTLM auth exchange, NTLMSSP exchange, an initiation of traffic from the cluster host to a DC (presumably to authenticate the connecting box/service account), a response of that traffic from the DC, and finally the error response sent from the cluster host to the failing box (in this specific Wireshark exchange, the VEEAM server). That tells me the DC is probably the one generating the failure code, but that still makes little/no sense in my mind.
Extra extra bonus: I can't log into the DC to monitor logs (campus network, silo'd heavily) and I can't log into the firewall between these systems to see if it's manipulating packets in some way (for the same reason). I've got both teams looking at their respective elements. Additional surprise component: the DCs may not be Windows (i'm fairly certain they aren't).
So... Has anyone ever seen that NT status code, has anyone ever seen something like this come up, and/or does anyone have any ideas of what I could look at to fix this?
I've got a rather insidious issue I'm working with that I'm hoping some minds here might be able to shed some light on. Not 100% sure if this is the best forum for this particular query, as it lies somewhere between OS, security, computer help, and highly technical.
We've got a pair of Server2016 VMs acting as a failover cluster for some file server nodes on our network, and we're having an issue with certain types of authentication. Specifically, on occasion (a day or two post-restart.. get to that later) one of, or sometimes both of, the cluster hosts will pass into a 'failing authentication' state (my words) whereupon auth requests from service accounts will fail to authenticate against them. Specifically, we use a network monitoring server/program which makes WMI queries (I think) for perf data, disk usage, etc, and our backup program (VEEAM) uses a service account, and fails to connect remotely. I can see very consistent security log failures for the service account when it's in a 'failed authentication' state, but just from these service account boxes. It results in failures with backups from VEEAM, and failures of monitoring for our monitoring program (Frameflow).
All boxes are domain joined, all on same domain. All are server 2016. Failover cluster operates normally when in 'failed state', and a restart of a given failover cluster node VM resolves the issue (usually for <24h or so). Authentication otherwise works fine (VM authing to domain, users connecting to shares hosted on node, etc), and as far as I can tell the only things that cannot reach it correctly are these two boxes with service accounts. Restarting those two servers does nothing.
Extra bonus: all the security logs are returning the same auth failure status code: 0xC0000122, which resolves as an NT status code of 'Invalid computer name', which I've <never> heard of, and nor has the internet apparently. I did a Wireshark on the actual node to monitor traffic during a failed authentication attempt, and I see NTLM auth exchange, NTLMSSP exchange, an initiation of traffic from the cluster host to a DC (presumably to authenticate the connecting box/service account), a response of that traffic from the DC, and finally the error response sent from the cluster host to the failing box (in this specific Wireshark exchange, the VEEAM server). That tells me the DC is probably the one generating the failure code, but that still makes little/no sense in my mind.
Extra extra bonus: I can't log into the DC to monitor logs (campus network, silo'd heavily) and I can't log into the firewall between these systems to see if it's manipulating packets in some way (for the same reason). I've got both teams looking at their respective elements. Additional surprise component: the DCs may not be Windows (i'm fairly certain they aren't).
So... Has anyone ever seen that NT status code, has anyone ever seen something like this come up, and/or does anyone have any ideas of what I could look at to fix this?