Nutanix Recovery Plans: Orchestrating DR Failover

Oct 4, 2025 · 13 min read · Disaster Recovery Recovery Plan Data Protection Business Continuity ·

Share on:

Overview

📖 Disaster Recovery in 2025 Series - Part 5 This post is part of my comprehensive disaster recovery series. New to the series? Start with the Complete Guide Overview to see what's coming, or catch up with Part 1 - Why DR Matters, Part 2 - Modern Disaster Recovery, Part 3 - Nutanix DR Overview, and Part 4 - Protection Policies.

In my previous post on Protection Policies, I explored how Nutanix ensures your data is replicated and protected at your disaster recovery site through automated, policy-driven snapshot and replication mechanisms. But having your data safely replicated is only half the battle. When disaster strikes, you need more than just data, you need a plan to bring your applications back online in the right order, with the right network configuration, and in a way that minimizes downtime.

This is where Nutanix Recovery Plans come in. Think of them as your automated DR runbook, orchestrating the complex choreography of failover to ensure your business-critical applications come back online smoothly and in the correct sequence.

What is a Recovery Plan?

A Recovery Plan in Nutanix is a comprehensive disaster recovery orchestration framework that defines exactly how your infrastructure should behave during a failover event. While Protection Policies handle the "what" and "when" of data replication, Recovery Plans handle the "how" of bringing everything back online.

At its core, a Recovery Plan specifies:

Which VMs or services should be recovered
In what order they should be powered on
How network connectivity should be mapped from primary to recovery site
How to test your DR readiness without impacting production

Recovery Plans transform what could be a chaotic, manual failover process into a predictable, automated workflow that you can test and validate before you ever need it in anger.

The Foundation: Protection Policies and Recovery Plans Working Together

Before you can create a Recovery Plan, you must have a Protection Policy in place. This is a fundamental prerequisite in the Nutanix DR architecture, and for good reason: there's no point in having a recovery plan if the data doesn't exist at the recovery site.

Here's how they work together:

Protection Policies continuously replicate VM snapshots from your primary site to your recovery location
Recovery Plans reference these protected VMs and orchestrate how they're restored and powered on

During failover, the Recovery Plan uses the most recent snapshot from the Protection Policy to bring VMs online at the recovery site

This separation of duties is elegant and practical. Protection Policies focus on data durability and availability, while Recovery Plans focus on application recovery orchestration.

Power-On Sequencing: The Heart of Recovery Plans

One of the most important features of a Recovery Plan is power-on sequencing. Not all applications are created equal, and many have strict dependencies. Your web servers can't authenticate users if Active Directory isn't running. Your application servers can't function if the database is still booting.

Recovery Plans organize VMs into stages numbered from 0 to N, where Stage 0 always starts first.

Recommended Staging Strategy

Stage 0: Foundation Services

Your infrastructure services should always be in Stage 0. This typically includes:

Domain Controllers (Active Directory)
DNS Servers
DHCP Servers
Network infrastructure VMs

These are the services that everything else depends on. Without DNS and AD, most modern applications simply won't function.

Stage 1: Core Application Infrastructure

Services that depend on Stage 0 but are themselves dependencies for other applications:

Database servers
Authentication services
Monitoring and logging systems
Certificate authorities

Stage 2-N: Application Tiers and End-User Services

Organize these based on your application dependencies:

Web servers and load balancers
Application servers
File servers
Desktop VDI pools
Less critical workloads

How Staging Works During Failover

During a failover event, Nutanix follows this sequence:

Powers on all VMs in Stage 0
Waits for all Stage 0 VMs to complete boot
Proceeds to Stage 1, powers on those VMs
Waits for Stage 1 completion
Continues through each stage in order

This staged approach ensures that services are available when their dependent applications need them. There's no guesswork, no manual intervention required—just a predictable, repeatable process.

Network Mapping: Bridging Two Worlds

One of the trickiest aspects of disaster recovery is network configuration. Your primary site and recovery site likely have different network architectures, VLANs, and IP schemes. Recovery Plans solve this problem with network mapping.

Production Network Mapping

When you create a Recovery Plan, you define how networks at your primary site map to networks at your recovery site. For example:

Primary VLAN 100 (Production Web) → Recovery VLAN 200 (DR Web)
Primary VLAN 101 (Production DB) → Recovery VLAN 201 (DR DB)
Primary VLAN 102 (Management) → Recovery VLAN 202 (DR Management)

During failover, VMs automatically connect to the appropriate recovery site networks. This mapping can be straightforward (same network names, different VLANs) or complex (completely different network topology at the DR site).

Test Network Mapping: The Isolated Sandbox

Here's where Recovery Plans truly shine: test failover capability. You can perform a complete test of your DR plan without impacting production workloads or production network spaces.

When you perform a test failover, you specify test networks that are completely isolated from production. This is critical: your test networks must be Layer 2 only networks with no gateway or routing configured. This isolation ensures:

Zero impact to production: Test VMs cannot communicate with production systems
No IP conflicts: Even if test VMs retain production IP addresses, they cannot conflict with live systems
Complete isolation: Without Layer 3 routing, test failovers are confined to their own broadcast domain
Safe testing: You can power on entire application stacks without risk to operations

When configuring test networks, ensure you:

Create dedicated VLANs specifically for DR testing (e.g., VLAN 900-999 for test failover)
Configure as Layer 2 only with no default gateway, no routing, and no connection to production networks
Verify isolation before first use by attempting to ping production resources from a test VM
Document the test network mapping in your DR runbooks for consistency

With properly isolated test networks, you can:

Validate that VMs power on correctly
Verify boot sequencing works as designed
Test application functionality in the recovery environment
Identify configuration issues before they matter
Perform full DR drills without change control windows

All of this happens in a completely isolated network segment, meaning zero risk to your production environment. Once testing is complete, you simply clean up the test environment, and your production systems remain untouched.

Network Reconfiguration Limitations

While Recovery Plans handle network mapping elegantly, there are important limitations to understand:

IP Address Changes During Failover

Recovery Plans can change the network VLANs that VMs connect to, but they do not automatically reconfigure IP addresses inside the guest operating system. This means:

VMs retain their original IP addresses when they fail over
If your recovery site uses a different IP scheme, manual reconfiguration or scripting is required
DHCP-based VMs will receive new addresses automatically if DHCP is available at the recovery site
Static IP VMs require either IP reuse at the recovery site or manual/scripted reconfiguration

Static IP Mapping Considerations

For environments where recovery site networks already have IP addresses in use that conflict with production VMs:

You cannot have the same IP address active on both sites simultaneously
Test failovers require isolated networks to avoid IP conflicts
Production failovers assume the primary site is down and the IP space is available
In-guest scripting (covered in Part 8) can automate IP reconfiguration during failover

Failback Network Complexity

Failback operations face similar network constraints. When returning to the primary site, VMs need their original network configuration restored, which may require:

Coordination to ensure IP addresses aren't duplicated during transition
Temporary network isolation during the failback process
Scripted reconfiguration to return to original network settings

Selecting VMs and Categories for Recovery

Recovery Plans offer flexible options for determining which VMs to protect and recover:

Individual VM Selection

You can explicitly add specific VMs to a Recovery Plan. This approach works well for:

Small, well-defined application stacks
High-value systems that need dedicated recovery plans
Environments with minimal change

Category-Based Selection

For more dynamic environments, you can add VMs based on Nutanix Categories (essentially tags). For example:

Category: App=ERP includes all ERP-related VMs
Category: Tier=Web includes all web tier VMs
Category: Criticality=Tier1 includes all business-critical systems

The category-based approach is more scalable and maintainable. When you add a new web server to production and tag it with Tier=Web, it's automatically included in the appropriate Recovery Plan—no manual updates required.

Critical Limitation: Mixing Replication Schedules

⚠️ Important: When selecting VMs for a Recovery Plan, you must be careful about mixing different replication schedule types. Do not include entities protected with synchronous replication in the same recovery plan as entities protected with asynchronous or near-sync replication. Doing so will cause the recovery to fail.

The rules are:

✅ Allowed: Asynchronous + Near-Sync in the same Recovery Plan
❌ Not Allowed: Synchronous + Asynchronous in the same Recovery Plan
❌ Not Allowed: Synchronous + Near-Sync in the same Recovery Plan
❌ Not Allowed: Synchronous + Asynchronous + Near-Sync in the same Recovery Plan

Why This Matters: Synchronous replication operates fundamentally differently from async and near-sync. Synchronous replication maintains continuous consistency between sites, while async and near-sync work with point-in-time snapshots. These incompatible recovery mechanisms cannot be orchestrated together in a single recovery operation.

Best Practice: Create separate Recovery Plans for synchronous-replicated workloads and async/near-sync workloads. If your Metro Availability (synchronous) workloads need to fail over together with async workloads, you'll need to execute multiple Recovery Plans in sequence.

Creating and Managing Recovery Plans

The process of creating a Recovery Plan follows a logical workflow:

Prerequisites

Availability Zones configured between primary and recovery sites
Protection Policy created and actively replicating data
Network information for both sites documented
Application dependencies mapped and understood

Configuration Steps

Define the recovery location (on-premises secondary datacenter or cloud)
Select VMs or categories to include in the plan
Organize VMs into stages based on dependencies
Map production networks to recovery site networks
Configure test networks for isolated failover testing
Validate the configuration to ensure all dependencies are met

Ongoing Management

Recovery Plans aren't "set it and forget it" artifacts. They require regular attention:

Validation: Periodically review that all protected VMs are still included
Testing: Perform test failovers quarterly (at minimum) to verify functionality
Updates: Adjust staging and network mapping as your infrastructure evolves
Documentation: Keep runbooks updated with Recovery Plan changes

Executing Recovery Plans

When it's time to use your Recovery Plan—whether for testing, planned migration, or an actual disaster—the execution process follows the orchestration you've defined. Recovery Plans can be executed in different modes depending on your scenario, and we'll dive deep into managing planned versus unplanned failovers in Part 7 of this series.

For now, understand that Recovery Plans execute the power-on sequence you've defined, apply the network mappings you've configured, and bring your applications online at the recovery site in a controlled, predictable manner.

Failover Execution Modes: Automatic vs Manual

Recovery Plans support two execution modes, each suited to different DR architectures and business requirements:

Manual Execution Mode

Manual mode requires human intervention to initiate failover operations. This is the default and most common configuration.

When to use: Most Async and Near-Sync replication scenarios
How it works: An administrator must explicitly trigger the failover operation through Prism Central
Benefits: Provides control and verification before critical operations execute
Use cases: Planned maintenance migrations, controlled disaster response, environments where verification is required before failover

Automatic Execution Mode

Automatic mode enables Recovery Plans to execute without human intervention when specific conditions are met. This requires Synchronous Replication with an external Witness.

When to use: Synchronous replication with Metro Availability configured
How it works: When the Witness detects primary site failure, failover initiates automatically
Requirements:
- Synchronous replication between sites
- External Witness cluster (third location or cloud-based)
- Network connectivity from Witness to both sites
Benefits: Minimizes RTO by eliminating human response time during disasters
Use cases: Mission-critical workloads requiring near-zero RTO, 24/7 operations without on-call staff, regulatory requirements for automated failover

The choice between automatic and manual execution fundamentally impacts your Recovery Time Objective (RTO). Automatic execution can reduce RTO from hours (time to detect, respond, and execute) to minutes (detection and automated execution only), but requires the infrastructure investment of Synchronous replication and a Witness.

In-Guest Script Execution

Beyond power-on sequencing and network mapping, Recovery Plans can execute custom scripts inside guest VMs during failover operations. This capability enables advanced automation for application-specific configuration.

Common Script Use Cases

Network reconfiguration (IP address changes, DNS updates, gateway modifications)
Application service restart or reinitialization
Database recovery procedures
Load balancer registration/deregistration
Monitoring system notifications

Script Execution Timing

Scripts can be executed at different stages of the recovery process:

Pre-failover: Run before VMs power on (rare, usually for infrastructure preparation)
Post-failover: Run after VMs boot (most common, for application configuration)
Per-stage: Execute scripts at specific power-on stages for coordinated configuration

Limitations and Considerations

Scripts require guest OS credentials stored in Prism Central
Script execution adds time to your overall Recovery Time Objective
Failed scripts can block recovery operations or leave applications misconfigured
Testing script execution is critical during test failover operations

We'll explore in-guest scripting in much greater detail in Part 8, including practical examples for DNS management, IP reconfiguration, and application-specific automation. For now, understand that Recovery Plans provide the framework for these advanced automation scenarios, extending beyond simple VM power-on to complete application recovery orchestration.

Best Practices for Recovery Plans

Based on real-world implementations, here are key best practices:

Start Simple - Don't try to build the perfect Recovery Plan on day one. Start with your most critical application, test it thoroughly, then expand.
Test Regularly - A Recovery Plan that's never tested is a plan that will fail when you need it. Schedule quarterly test failovers at minimum.
Use Categories for Scale - Individual VM selection works for 10 VMs. For 100 or 1,000 VMs, categories provide the scalability you need.
Document Dependencies - Your Recovery Plan is only as good as your understanding of application dependencies. Invest time in dependency mapping.
Align with Business Continuity Plans - Your Recovery Plans should directly support your business continuity objectives. Map stages to Recovery Time Objectives (RTOs).
Monitor and Alert - Enable monitoring for Protection Policy replication status. A Recovery Plan is useless if recent data isn't available at the recovery site.
Plan for Network Differences - Recovery sites often have different network architectures. Design your network mapping strategy before you need it.

The Complete DR Picture

Nutanix Recovery Plans are the operational complement to Protection Policies. Where Protection Policies provide the data foundation, Recovery Plans provide the orchestration layer that turns replicated data into running applications.

Together, they form a comprehensive disaster recovery solution:

Protection Policies ensure data is where it needs to be
Recovery Plans ensure applications come online in the right order, on the right networks, at the right time

The beauty of the Nutanix approach is that it takes what used to require complex scripting, manual runbooks, and crossed fingers, and transforms it into a tested, validated, automated process.

What's Next in the DR Series

In this post, we've explored how Recovery Plans orchestrate the failover process. In the next installment, we'll dive into testing and validation strategies and how to build confidence in your DR capabilities without risking production systems.

Whether you're protecting a handful of critical applications or orchestrating the recovery of an entire datacenter, Nutanix Recovery Plans provide the automation and reliability that modern disaster recovery demands.

Your data is protected. Your recovery plan is automated. Your business continuity is secured.