Nutanix Recovery Plans: Orchestrating DR Failover

Overview
📖 Disaster Recovery in 2025 Series - Part 5 This post is part of my comprehensive disaster recovery series. New to the series? Start with the Complete Guide Overview to see what's coming, or catch up with Part 1 - Why DR Matters, Part 2 - Modern Disaster Recovery, Part 3 - Nutanix DR Overview, and Part 4 - Protection Policies.
In my previous post on Protection Policies, I explored how Nutanix ensures your data is replicated and protected at your disaster recovery site through automated, policy-driven snapshot and replication mechanisms. But having your data safely replicated is only half the battle. When disaster strikes, you need more than just data, you need a plan to bring your applications back online in the right order, with the right network configuration, and in a way that minimizes downtime.
This is where Nutanix Recovery Plans come in. Think of them as your automated DR runbook, orchestrating the complex choreography of failover to ensure your business-critical applications come back online smoothly and in the correct sequence.
What is a Recovery Plan?
A Recovery Plan in Nutanix is a comprehensive disaster recovery orchestration framework that defines exactly how your infrastructure should behave during a failover event. While Protection Policies handle the "what" and "when" of data replication, Recovery Plans handle the "how" of bringing everything back online.
At its core, a Recovery Plan specifies:
- Which VMs or services should be recovered
- In what order they should be powered on
- How network connectivity should be mapped from primary to recovery site
- How to test your DR readiness without impacting production
Recovery Plans transform what could be a chaotic, manual failover process into a predictable, automated workflow that you can test and validate before you ever need it in anger.
The Foundation: Protection Policies and Recovery Plans Working Together
Before you can create a Recovery Plan, you must have a Protection Policy in place. This is a fundamental prerequisite in the Nutanix DR architecture, and for good reason: there's no point in having a recovery plan if the data doesn't exist at the recovery site.
Here's how they work together:
- Protection Policies continuously replicate VM snapshots from your primary site to your recovery location
- Recovery Plans reference these protected VMs and orchestrate how they're restored and powered on
During failover, the Recovery Plan uses the most recent snapshot from the Protection Policy to bring VMs online at the recovery site
This separation of duties is elegant and practical. Protection Policies focus on data durability and availability, while Recovery Plans focus on application recovery orchestration.
Power-On Sequencing: The Heart of Recovery Plans
One of the most important features of a Recovery Plan is power-on sequencing. Not all applications are created equal, and many have strict dependencies. Your web servers can't authenticate users if Active Directory isn't running. Your application servers can't function if the database is still booting.
Recovery Plans organize VMs into stages numbered from 0 to N, where Stage 0 always starts first.
Recommended Staging Strategy
Stage 0: Foundation Services
Your infrastructure services should always be in Stage 0. This typically includes:
- Domain Controllers (Active Directory)
- DNS Servers
- DHCP Servers
- Network infrastructure VMs
These are the services that everything else depends on. Without DNS and AD, most modern applications simply won't function.
Stage 1: Core Application Infrastructure
Services that depend on Stage 0 but are themselves dependencies for other applications:
- Database servers
- Authentication services
- Monitoring and logging systems
- Certificate authorities
Stage 2-N: Application Tiers and End-User Services
Organize these based on your application dependencies:
- Web servers and load balancers
- Application servers
- File servers
- Desktop VDI pools
- Less critical workloads
How Staging Works During Failover
During a failover event, Nutanix follows this sequence:
- Powers on all VMs in Stage 0
- Waits for all Stage 0 VMs to complete boot
- Proceeds to Stage 1, powers on those VMs
- Waits for Stage 1 completion
- Continues through each stage in order
This staged approach ensures that services are available when their dependent applications need them. There's no guesswork, no manual intervention required—just a predictable, repeatable process.
Network Mapping: Bridging Two Worlds
One of the trickiest aspects of disaster recovery is network configuration. Your primary site and recovery site likely have different network architectures, VLANs, and IP schemes. Recovery Plans solve this problem with network mapping.
Production Network Mapping
When you create a Recovery Plan, you define how networks at your primary site map to networks at your recovery site. For example:
- Primary VLAN 100 (Production Web) → Recovery VLAN 200 (DR Web)
- Primary VLAN 101 (Production DB) → Recovery VLAN 201 (DR DB)
- Primary VLAN 102 (Management) → Recovery VLAN 202 (DR Management)
During failover, VMs automatically connect to the appropriate recovery site networks. This mapping can be straightforward (same network names, different VLANs) or complex (completely different network topology at the DR site).
Test Network Mapping: The Isolated Sandbox
Here's where Recovery Plans truly shine: test failover capability. You can perform a complete test of your DR plan without impacting production workloads or production network spaces.
When you perform a test failover, you specify test networks that are completely isolated from production. This is critical: your test networks must be Layer 2 only networks with no gateway or routing configured. This isolation ensures:
- Zero impact to production: Test VMs cannot communicate with production systems
- No IP conflicts: Even if test VMs retain production IP addresses, they cannot conflict with live systems
- Complete isolation: Without Layer 3 routing, test failovers are confined to their own broadcast domain
- Safe testing: You can power on entire application stacks without risk to operations
When configuring test networks, ensure you:
- Create dedicated VLANs specifically for DR testing (e.g., VLAN 900-999 for test failover)
- Configure as Layer 2 only with no default gateway, no routing, and no connection to production networks
- Verify isolation before first use by attempting to ping production resources from a test VM
- Document the test network mapping in your DR runbooks for consistency
With properly isolated test networks, you can:
- Validate that VMs power on correctly
- Verify boot sequencing works as designed
- Test application functionality in the recovery environment
- Identify configuration issues before they matter
- Perform full DR drills without change control windows
All of this happens in a completely isolated network segment, meaning zero risk to your production environment. Once testing is complete, you simply clean up the test environment, and your production systems remain untouched.
Network Reconfiguration Limitations
While Recovery Plans handle network mapping elegantly, there are important limitations to understand:
IP Address Changes During Failover
Recovery Plans can change the network VLANs that VMs connect to, but they do not automatically reconfigure IP addresses inside the guest operating system. This means:
- VMs retain their original IP addresses when they fail over
- If your recovery site uses a different IP scheme, manual reconfiguration or scripting is required
- DHCP-based VMs will receive new addresses automatically if DHCP is available at the recovery site
- Static IP VMs require either IP reuse at the recovery site or manual/scripted reconfiguration
Static IP Mapping Considerations
For environments where recovery site networks already have IP addresses in use that conflict with production VMs:
- You cannot have the same IP address active on both sites simultaneously
- Test failovers require isolated networks to avoid IP conflicts
- Production failovers assume the primary site is down and the IP space is available
- In-guest scripting (covered in Part 8) can automate IP reconfiguration during failover
Failback Network Complexity
Failback operations face similar network constraints. When returning to the primary site, VMs need their original network configuration restored, which may require:
- Coordination to ensure IP addresses aren't duplicated during transition
- Temporary network isolation during the failback process
- Scripted reconfiguration to return to original network settings
Selecting VMs and Categories for Recovery
Recovery Plans offer flexible options for determining which VMs to protect and recover:
Individual VM Selection
You can explicitly add specific VMs to a Recovery Plan. This approach works well for:
- Small, well-defined application stacks
- High-value systems that need dedicated recovery plans
- Environments with minimal change
Category-Based Selection
For more dynamic environments, you can add VMs based on Nutanix Categories (essentially tags). For example:
- Category:
App=ERP
includes all ERP-related VMs - Category:
Tier=Web
includes all web tier VMs - Category:
Criticality=Tier1
includes all business-critical systems
The category-based approach is more scalable and maintainable. When you add a new web server to production and tag it with Tier=Web
, it's automatically included in the appropriate Recovery Plan—no manual updates required.
Critical Limitation: Mixing Replication Schedules
⚠️ Important: When selecting VMs for a Recovery Plan, you must be careful about mixing different replication schedule types. Do not include entities protected with synchronous replication in the same recovery plan as entities protected with asynchronous or near-sync replication. Doing so will cause the recovery to fail.
The rules are:
- ✅ Allowed: Asynchronous + Near-Sync in the same Recovery Plan
- ❌ Not Allowed: Synchronous + Asynchronous in the same Recovery Plan
- ❌ Not Allowed: Synchronous + Near-Sync in the same Recovery Plan
- ❌ Not Allowed: Synchronous + Asynchronous + Near-Sync in the same Recovery Plan
Why This Matters: Synchronous replication operates fundamentally differently from async and near-sync. Synchronous replication maintains continuous consistency between sites, while async and near-sync work with point-in-time snapshots. These incompatible recovery mechanisms cannot be orchestrated together in a single recovery operation.
Best Practice: Create separate Recovery Plans for synchronous-replicated workloads and async/near-sync workloads. If your Metro Availability (synchronous) workloads need to fail over together with async workloads, you'll need to execute multiple Recovery Plans in sequence.
Creating and Managing Recovery Plans
The process of creating a Recovery Plan follows a logical workflow:
Prerequisites
- Availability Zones configured between primary and recovery sites
- Protection Policy created and actively replicating data
- Network information for both sites documented
- Application dependencies mapped and understood
Configuration Steps
- Define the recovery location (on-premises secondary datacenter or cloud)
- Select VMs or categories to include in the plan
- Organize VMs into stages based on dependencies
- Map production networks to recovery site networks
- Configure test networks for isolated failover testing
- Validate the configuration to ensure all dependencies are met
Ongoing Management
Recovery Plans aren't "set it and forget it" artifacts. They require regular attention:
- Validation: Periodically review that all protected VMs are still included
- Testing: Perform test failovers quarterly (at minimum) to verify functionality
- Updates: Adjust staging and network mapping as your infrastructure evolves
- Documentation: Keep runbooks updated with Recovery Plan changes
Executing Recovery Plans
When it's time to use your Recovery Plan—whether for testing, planned migration, or an actual disaster—the execution process follows the orchestration you've defined. Recovery Plans can be executed in different modes depending on your scenario, and we'll dive deep into managing planned versus unplanned failovers in Part 7 of this series.
For now, understand that Recovery Plans execute the power-on sequence you've defined, apply the network mappings you've configured, and bring your applications online at the recovery site in a controlled, predictable manner.
Failover Execution Modes: Automatic vs Manual
Recovery Plans support two execution modes, each suited to different DR architectures and business requirements:
Manual Execution Mode
Manual mode requires human intervention to initiate failover operations. This is the default and most common configuration.
- When to use: Most Async and Near-Sync replication scenarios
- How it works: An administrator must explicitly trigger the failover operation through Prism Central
- Benefits: Provides control and verification before critical operations execute
- Use cases: Planned maintenance migrations, controlled disaster response, environments where verification is required before failover
Automatic Execution Mode
Automatic mode enables Recovery Plans to execute without human intervention when specific conditions are met. This requires Synchronous Replication with an external Witness.
- When to use: Synchronous replication with Metro Availability configured
- How it works: When the Witness detects primary site failure, failover initiates automatically
- Requirements:
- Synchronous replication between sites
- External Witness cluster (third location or cloud-based)
- Network connectivity from Witness to both sites
- Benefits: Minimizes RTO by eliminating human response time during disasters
- Use cases: Mission-critical workloads requiring near-zero RTO, 24/7 operations without on-call staff, regulatory requirements for automated failover
The choice between automatic and manual execution fundamentally impacts your Recovery Time Objective (RTO). Automatic execution can reduce RTO from hours (time to detect, respond, and execute) to minutes (detection and automated execution only), but requires the infrastructure investment of Synchronous replication and a Witness.
In-Guest Script Execution
Beyond power-on sequencing and network mapping, Recovery Plans can execute custom scripts inside guest VMs during failover operations. This capability enables advanced automation for application-specific configuration.
Common Script Use Cases
- Network reconfiguration (IP address changes, DNS updates, gateway modifications)
- Application service restart or reinitialization
- Database recovery procedures
- Load balancer registration/deregistration
- Monitoring system notifications
Script Execution Timing
Scripts can be executed at different stages of the recovery process:
- Pre-failover: Run before VMs power on (rare, usually for infrastructure preparation)
- Post-failover: Run after VMs boot (most common, for application configuration)
- Per-stage: Execute scripts at specific power-on stages for coordinated configuration
Limitations and Considerations
- Scripts require guest OS credentials stored in Prism Central
- Script execution adds time to your overall Recovery Time Objective
- Failed scripts can block recovery operations or leave applications misconfigured
- Testing script execution is critical during test failover operations
We'll explore in-guest scripting in much greater detail in Part 8, including practical examples for DNS management, IP reconfiguration, and application-specific automation. For now, understand that Recovery Plans provide the framework for these advanced automation scenarios, extending beyond simple VM power-on to complete application recovery orchestration.
Best Practices for Recovery Plans
Based on real-world implementations, here are key best practices:
- Start Simple - Don't try to build the perfect Recovery Plan on day one. Start with your most critical application, test it thoroughly, then expand.
- Test Regularly - A Recovery Plan that's never tested is a plan that will fail when you need it. Schedule quarterly test failovers at minimum.
- Use Categories for Scale - Individual VM selection works for 10 VMs. For 100 or 1,000 VMs, categories provide the scalability you need.
- Document Dependencies - Your Recovery Plan is only as good as your understanding of application dependencies. Invest time in dependency mapping.
- Align with Business Continuity Plans - Your Recovery Plans should directly support your business continuity objectives. Map stages to Recovery Time Objectives (RTOs).
- Monitor and Alert - Enable monitoring for Protection Policy replication status. A Recovery Plan is useless if recent data isn't available at the recovery site.
- Plan for Network Differences - Recovery sites often have different network architectures. Design your network mapping strategy before you need it.
The Complete DR Picture
Nutanix Recovery Plans are the operational complement to Protection Policies. Where Protection Policies provide the data foundation, Recovery Plans provide the orchestration layer that turns replicated data into running applications.
Together, they form a comprehensive disaster recovery solution:
- Protection Policies ensure data is where it needs to be
- Recovery Plans ensure applications come online in the right order, on the right networks, at the right time
The beauty of the Nutanix approach is that it takes what used to require complex scripting, manual runbooks, and crossed fingers, and transforms it into a tested, validated, automated process.
What's Next in the DR Series
In this post, we've explored how Recovery Plans orchestrate the failover process. In the next installment, we'll dive into testing and validation strategies and how to build confidence in your DR capabilities without risking production systems.
Whether you're protecting a handful of critical applications or orchestrating the recovery of an entire datacenter, Nutanix Recovery Plans provide the automation and reliability that modern disaster recovery demands.
Your data is protected. Your recovery plan is automated. Your business continuity is secured.