SQL Server / Windows OS losing connection to Domain Controllers
SSPI handshake failed with error code 0x80090311, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. No authority could be contacted for authentication.
Login failed. The login is from an untrusted domain and cannot be used with Windows authentication
A handful of my SQL Servers began losing connecting with the domain controllers after recent Windows Patches. The only resolution was a reboot of the SQL Server, which obviously incurred downtimes. The issue hit two non-production VMs and also a Windows SQL Server Cluster. Oddly, both nodes in the cluster were affected simultaneously, even though SQL wasn't running on the passive node. After some troubleshooting with Microsoft, we identified the issue and I wanted to share it here. A fix is pending, targeted for July.
The issue affects Windows Server 2012 OSes utilizing iSCSI storage and was introduced with KB4012216, a March security roll-up. The total amount of ephemeral ports on the system becomes exhausted over time. I won't spend too much time showing you how to isolate the specific data we collected for Microsoft. I feel that if you are experiencing this issue after a recent application of patches, and you are running Server 2012 with iSCSI, that is probably proof enough.
UPDATE: We observed this behavior on servers not using iSCSI, but iSCSI was still enabled and causing the problem. We also found corresponding Event IDs 4227 in the System log.
TCP/IP failed to establish an outgoing connection because the selected local endpoint was recently used to connect to the same remote endpoint. This error typically occurs when outgoing connections are opened and closed at a high rate, causing all available local ports to be used and forcing TCP/IP to reuse a local port for an outgoing connection. To minimize the risk of data corruption, the TCP/IP standard requires a minimum time period to elapse between successive connections from a given local endpoint to a given remote endpoint.
You can view some details about ports in use with the following commands, the first being a PowerShell command:
1Get-Nettcpconnection | Group-Object -Property State, OwningProcess | Sort Count
2netstat -anoq
There is no permanent solution, but the following are options for workarounds until a patch is released.
- The most obvious would be to uninstall the patches. We uninstalled all 3 roll-up patches that we applied, but Microsoft indicates that it is part of KB4012216.
- You can stop using iSCSI. Not a viable solution for most.
Increase the number of available TCP ephemeral ports and modify TCP Time Wait Delay to increase the time it takes for the issue to manifest. Type the following from a command line prompt and restart the server:
1netsh int ipv4 set dynamicport tcp start=1025 num=64500
2reg add HKLM\\SYSTEM\\CurrentControlSet\\Services\\Tcpip\\Parameters /v TcpTimedWaitDelay /t REG\_DWORD /d 0x0000001E /f
I hope you are able to use this information to fix any recurring issues you've experienced in your environments. I spent the last 3 or 4 nights rebooting SQL Servers after hours, but not tonight!