SQL Server: Block Level Replication Failures Causing Suspect Database
Recently, I worked with a client who uses block level replication software (think XOSoft/ArcServ) for their clustered SQL Server databases. During proactive testing one night, replication was broken and the failover server was brought up. For the first time in years, the procedure failed and the primary application database came up as SUSPECT. Examining Event Viewer revealed the following error:
Source: MSSQLServer Category:2 Type: Error Event ID: 824
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:9718172; actual 0:0). It occurred during a read of page (1:9718172) in database ID 3 at offset 0x00000f6b000000 in file 'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\myDatabase.mdf'. Additional messages in the SQL Server Serror log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
We backed up the primary server's application database, restored the backup to the failover server and the database came up no problem. We restarted replication, waited for it to complete, stopped replication, brought up the SQL Server engine and again the database was in SUSPECT mode. We then deleted the .MDF and .LDF files and replicated them from scratch. Once replication was complete, we brought up the SQL Server engine on the failover machine and the database came up just fine. We restarted replication, stopped it, brought up the SQL Server and again it was in SUSPECT mode.
Rinse, repeat these steps with a few changes. Uninstall all updates and HIDS software, no luck.
Finally, the client's disk administrator detached the failover cluster's LUN, formatted it, reattached it, rereplicated the database and boom! That seemed to do the trick. Several stops/starts/replication breaks later, the database still comes up healthy. Because we were able to get the database to repeatedly come up healthy (after the restore and initial replication) I, for one, am surprised that this was the solution, but that issue made for a stressful week, and I'm glad the solution was finally found.