SQL Server / Windows OS losing connection to Domain Controllers

SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication.

Login failed. The login is from an untrusted domain and cannot be used with Windows authentication

 

A handful of my SQL Servers began losing connecting with the domain controllers after recent Windows Patches.  The only resolution was a reboot of the SQL Server, which obviously incurred downtimes.  The issue hit two non-production VMs and also a Windows SQL Server Cluster.  Oddly, both nodes in the cluster were affected simultaneously, even though SQL wasn’t running on the passive node.  After some troubleshooting with Microsoft, we identified the issue and I wanted to share it here.  A fix is pending, targeted for July.

The issue affects Windows Server 2012 OSes utilizing iSCSI storage and was introduced with KB4012216, a March security roll-up.  The total amount of ephemeral ports on the system becomes exhausted over time.  I won’t spend too much time showing you how to isolate the specific data we collected for Microsoft.  I feel that if you are experiencing this issue after a recent application of patches, and you are running Server 2012 with iSCSI, that is probably proof enough.

UPDATE:  We observed this behavior on servers not using iSCSI, but iSCSI was still enabled and causing the problem.  We also found corresponding Event IDs 4227 in the System log.

TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.

You can view some details about ports in use with the following commands, the first being a PowerShell command:

There is no permanent solution, but the following are options for workarounds until a patch is released.

  1. The most obvious would be to uninstall the patches.  We uninstalled all 3 roll-up patches that we applied, but Microsoft indicates that it is part of KB4012216.
  2. You can stop using iSCSI.  Not a viable solution for most.

Increase the number of available TCP ephemeral ports and modify TCP Time Wait Delay to increase the time it takes for the issue to manifest.  Type the following from a command line prompt and restart the server:

I hope you are able to use this information to fix any recurring issues you’ve experienced in your environments.  I spent the last 3 or 4 nights rebooting SQL Servers after hours, but not tonight!

 

Brandon has worked in IT for nearly 20 years, and currently serves as a SQL Server DBA for a healthcare company in California. In his spare time, he runs for miles and helps maintain RealCajunRecipes.com. Brandon is a certified SQL Server administrator.

Posted in Active Directory, Security, SQL Server, Windows
3 comments on “SQL Server / Windows OS losing connection to Domain Controllers
  1. Andy says:

    Any chance you would share your data collection process?
    Although a different scenario to you, I was wondering what data collection points and tools you used to gather the proof for MS.

    We have 2 servers running on ESXI that just lock up after just over 24 hours and a hard power off is required. One runs Exchange, one runs all the cad files so different roles. We have a ton of other servers that run similar roles but only these 2, for different clients, started to fail daily. Troubleshooting has been tough, especially as the clients just want the servers to (preferably) stay online or if they crash to bring them back up immediately rather than diagnose the issue.
    Occasionally we can ping to/from the affected server and can do a net use to the server but as soon as we do a dir on the mapped drive, that dos session will lock up.

    • Brandon Abshire says:

      Sorry for the delay, I was unaware of your comment.

      You can run this to see the number of ports in use:

      Get-Nettcpconnection | Group-Object -Property State, OwningProcess | Sort Count
      netstat -anoq

      But ultimately, Microsoft had us use NotMyFault by SysInternals to collect a kernel memory dump for their engineers to analyze.

      We later discovered that even if the server didn’t use iSCSI, it was still opening and closing ports and running into this issue.

  2. Ryan says:

    I had this same kind of issue on my WS2012R2 Hyper-V Failover Cluster! Trying this work-around now, which was also recommended by Microsoft, thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

*