SQL Backup Process Hanging and Stuck in Kill/Rollback - HyperBac
One of the most important tasks a DBA should perform daily is verifying database backup jobs are successful. Using Database Mail to e-mail upon the failure of a job is something everyone should be doing. But in some cases, that is not enough. You can query the MSDB database in order to find out the last time a database was backed up. If you have a large amount of servers, consider using PowerShell's SMO capabilities to query a list of servers and compile the information into a database table. Then you can query that table for backup dates older than a day or two, depending on your backup schedule. That's the topic of another article.
Using a similar approach, I recently discovered a database that had missed being backed up for 2 days over the weekend. I had not received any email regarding a failure. Using SSMS, I pulled up the list of jobs in Object Explorer Details and noticed that the job was stuck executing since the previous Friday night. The database backup file was not actively being written to, and since backups are serialized commands, any subsequent attempt to backup the database while that process is running will be unsuccessful. Using SSMS, I stopped the executing job.
A quick query of the sys.sysprocesses table (select * from sys.sysprocesses) showed that the process was still running. I killed the process since it was being so difficult. From the sysprocesses table results, find the spid for the process. Execute the statement Kill spid where spid is the actual ID that is shown. If you prefer to use SSMS, launch Activity Monitor, find the process in the list and kill it from there.
Once I killed the process, it went into a rollback state, which is expected. I'm not 100% sure what a backup transaction is rolling back, but it sure will try. Take note of the "waittime" column in the sys.sysprocesses table. The value may increase initially, but at some point it should max out and start decreasing. It is possible to track the progress of the rollback by executing the following command:
kill spid with statusonly
The result stated that the rollback was 16% completed. I decided to wait and see how long it would take. My waittime value continued to increase. I went on with my business and checked back later. After many hours I returned to find that the rollback was still at 16% and the waittime was STILL increasing. I knew something was wrong. My first quick-fix idea was to restart the SQL Service. This actually would have fixed my problem, but being that the system is in production, I wanted to find a better way. Also, a restart would not help me figure out why the problem was occurring in the first place.
Eventually, I realized that this server was using HyperBac, which is a really cool compression tool that intercepts a native backup and zips it up. I decided to stop the HyperBac service to see if it would help. After attempting to stop the service, the service got stuck in the "Stopping" state. I checked the log file in the HyperBac (Program Files) folder and saw a line stating that it was attempting to close out the very backup that had been stuck for 3 days.
I gave the service a large amount of time to stop, but it never did. I eventually had to End Task a few times on the HyperBac process. Once the process ended, my SQL backup process disappeared.
I cannot definitively say if HyperBac caused the SQL backup process to hang, or if the SQL process caused HyperBac to hang. Of the hundreds of backups that run every night, I have run into this issue 3 times. All 3 times were on large databases (~70GB) using HyperBac to back up to a network drive. Perhaps a network blink during the backup process is to blame for initiating the issue. Whatever the reason, I opened a ticket with HyperBac and was told that version 4.2.x fixes this situation. I should also comment on how quickly HyperBac responded to my problem with an answer, so thumbs up on customer support. I plan on upgrading to 4.2.x soon, but server reboots are required. If the issue happens again before then, it is easy enough to fix.