3.5 hours down to 6.5 minutes…
Recently I went thru a project to get Zerto Replication up and running for a Emergency Dispatch Customer who was moving away from RecoverPoint and SRM in an effort to simplify and consolidate their DR runbooks.
As part of this project, we created multiple VPGs to match up with their software solutions, protection around 5TB of total VM space. The smaller VPGs consisted of small groupings of VMs, most of which ranged between 250 and 500GB of provisioned storage. The 5th VPG was a large VPG, consisted of a heavily utilized Production SQL Server and Report SQL Server and had around 3.2TB of provisioned storage.
Environment
The customers VMware environment consists of 4 ESXi hosts, a VNX 5300 array and a pair of Nexus 9K’s at each site. Connecting the sites together is a private 300MB circuit that handles the replication traffic as well as VOIP traffic.
The makeup of the VPGs were pretty standard, with recovery volumes, failover IPs, etc all configured.
Failover
After replication fully synchronized between the Primary and DR data centers, we performed a Live Failover of each of the VPGs. Each of these 4 VPGs failed over successfully and within minutes, with the IP’s being reconfigured as expected.
When we finally went to failover the final large VPG, while the failover went smooth and the VMs were brought online on the DR side, we suffered thru another almost 3 hours of time where the Zerto status was for the VPG was ‘Promoting’, where the VMs were accessible but horribly slow. So slow that users were told to go back to paper for data capture.
Troubleshooting
After the failover was complete, we commenced with the troubleshooting of what took the promotion so long to complete for this single VPG.
After some emails back and forth with a former Co-Worker who now works for Zerto – (Thanks Wes!), he pointed me to a Zerto document for best practices when protecting Microsoft SQL Server, found her. Going thru this document had me go back and look at the configuration of the SQL Servers that were being proteced. The Zerto document highlighted using Temp Data Disks specifically for the Windows Page and the SQL TempDB files.
Looking into the configuration of the SQL Servers, I noticed that the primary SQL Server VM was configured with a separate vmdk for the Page File, and an additional vmdk was provisioned for the SQL TempDB files, and also confirmed in SQL that the TempDB files were located on this vmdk.
In addition, we also looked at the VPG journal history, which was set to the default 24 hours, which meant that the VPG was storing a large amount of history in the journal, not only for the SQL Server data files, but also in the TempDB files. Add in the Report SQL Server and the replicated data processing.
After going thru the Zerto SQL Best practices document, a lot of things started to make sense why the VPG promotion took so long. First off, we had a very large amount of data that was being retained in the Journal History. Every transaction that hit the TempDB files were having to be placed in the journal, and upon failover when the TempDB is recreated – all that journal history had to be pushed back into the VM’s vmkd to bring it back to synchronization.
Solution
We switched the Page File and SQL TempDB vmdks in the Zerto VPG to utilize a Temp Data File. Since both the Page File and SQL TempDB contents are recreated upon reboot, usign the Temp Data File option allows Zerto to not track any data in the journal after initial synchronization.
After ensuring that the VPG was fully synchronized, we performed a failover from the DR facility back to the Primary facility. Each of the 4 small VPGs were failed back over, and then we prepared for the final large VPG. Due to the nature of this customers business, it’s criticial to have fairly precise estimates on downtime windows. Knowing we went thru a very lengthy initial failover process, we didn’t have a guarantee of the failover.
Once we kicked off the final VPG failover, the process moved smoothly thru until the Promoting stage, and then moved right thru. This failover attempt went from 3.5 hours down to 6.5 minutes in total. 6.5 minutes!
Needless to say, the customers expectations of the Zerto solution was met, the solution proved its worth with the ease of the failover.
Takeaways
This was a great scenario where going back to basics and best practices make all the difference in the world. Vendors put out documentation on best practices for a reason, and this is one scenario where personally I was happy to get some extra knowledge on SQL replication – as it wasn’t something I had historically done within Zerto. Great results from this project, moving away from RecoverPoint and SRM over to Zerto, and getting the customer a DR Replication solution that now allows them to do DR testing on a regularly scheduled basis with full confidence.