Nutanix DR: Planned vs. Unplanned Failovers

Overview

📖 Disaster Recovery in 2025 Series - Part 7 This post is part of my comprehensive disaster recovery series. New to the series? Start with the Complete Guide Overview to see what's coming, or catch up with Part 1 - Why DR Matters, Part 2 - Modern Disaster Recovery, Part 3 - Nutanix DR Overview, Part 4 - Protection Policies, Part 5 - Recovery Plans, and Part 6 - DR Testing Best Practices.

In my previous post on Recovery Plans, I explored how Nutanix orchestrates the failover process with automated boot sequencing and network mapping. But not all failovers are created equal. The approach you take and the automation Nutanix provides differs depending on whether you're executing a planned migration or responding to an actual disaster.

Understanding the distinction between planned and unplanned failovers is critical for setting the right expectations around data loss, recovery time, and operational procedures. Let's dive into how Nutanix DR handles each scenario.

The Fundamental Difference

The core distinction between planned and unplanned failovers comes down to one simple question: Is the primary site still accessible?

Planned Failover assumes your primary site is operational and you have time to orchestrate a clean migration. This might be for datacenter maintenance, infrastructure upgrades, or disaster avoidance (like an approaching hurricane).

Unplanned Failover assumes your primary site is down or unreachable due to an actual disaster; power failure, natural disaster, cyberattack, or other catastrophic event. You're working from the last replicated snapshot, and time is critical.

This fundamental difference drives everything else about how these operations execute.


Planned Failover: Zero Data Loss Migration

When you initiate a planned failover with Nutanix DR, you're telling the system "I have time to do this right." Nutanix automates the entire orchestration through Prism Central.

The Nutanix Planned Failover Process

Let's take a quick look at the sequence of activities that are executed during a Planned Failover with Nutanix DR:

Source VMs Gracefully Powered Down

The first step in the planned failover process is to gracefully shut down the guest VMs running at the primary site. Nutanix leverages standard operating system shutdown procedures to ensure a clean and orderly closure of all applications and services.

  • Operating System Integration: Nutanix communicates with the guest operating systems using tools like Nutanix Guest Tools (NGT) or standard OS commands to initiate the shutdown process.
  • Application Consistency: This graceful shutdown allows applications to properly close files, commit transactions, and finalize any pending operations. This is crucial for maintaining data consistency and preventing data corruption.
  • Minimized Data Loss: By allowing applications to shut down cleanly, the risk of data loss or inconsistencies is significantly reduced. This ensures that the migrated workloads start up in a stable and reliable state at the recovery site.

Final Snapshot Created

After the successful shutdown of the guest VMs, Nutanix creates a final recovery point, also known as a snapshot. This snapshot captures the absolute latest state of the workloads, ensuring that no data is lost during the failover process.

  • Point-in-Time Recovery: The final snapshot provides a precise point-in-time image of the VMs' disks and memory (if configured for memory snapshots).
  • Zero Data Loss: This snapshot guarantees that every bit of data up to the moment of shutdown is preserved. This is a critical aspect of planned failover, as it ensures that the recovery site has an exact replica of the primary site's data.
  • Data Integrity: The snapshot is created in a consistent state, reflecting the clean shutdown of the applications and operating systems. This ensures that the recovered VMs are in a usable and reliable state.

Final Replication

Once the final snapshot is created, it is replicated to the recovery site. Nutanix uses the Protection Policy settings configured for the VMs to determine how the replication is performed.

  • Protection Policy Settings: The Protection Policy defines parameters such as compression, deduplication, and encryption for the replication process. These settings optimize the transfer of data while ensuring security and efficiency.
  • Compression and Deduplication: Compression reduces the amount of data that needs to be transferred, while deduplication eliminates redundant data blocks. These techniques minimize bandwidth usage and accelerate the replication process.

Source VMs Deleted

After the successful replication of the final snapshot to the recovery site, the original VMs are removed from the primary cluster. This step is crucial to prevent split-brain scenarios, where the same VM could accidentally run in both locations simultaneously.

  • Split-Brain Prevention: Deleting the source VMs ensures that there is only one active instance of each VM at any given time. This eliminates the risk of data inconsistencies and conflicts that can arise from running the same VM in multiple locations.
  • Automatic Cleanup: The deletion process is automated as part of the planned failover workflow, reducing the risk of human error and ensuring that the primary site is properly cleaned up.
  • Resource Optimization: Removing the VMs from the primary site frees up resources that can be used for other purposes, such as running test workloads or performing maintenance tasks.

Recovery Site Activation

The final step in the planned failover process is to activate the VMs at the recovery site. This involves creating the VMs from the final snapshot and powering them on according to the Recovery Plan's boot sequencing.

  • VM Creation: Nutanix creates new VMs at the recovery site using the data from the final snapshot. These VMs are configured with the same settings as the original VMs, including CPU, memory, and disk configurations.
  • Boot Sequencing: The Recovery Plan defines the order in which the VMs are powered on. This is important for ensuring that applications and services start up in the correct sequence, with dependencies properly resolved. The boot sequence is typically organized into stages (Stage 0, Stage 1, etc.), with each stage containing a group of VMs that need to be started in a specific order.
  • Network Mapping: Network mappings are automatically applied to the VMs, ensuring that they are connected to the correct networks at the recovery site. This allows the VMs to communicate with each other and with external resources.
  • In-Guest Script Execution: Any configured in-guest scripts are executed after the VMs are powered on. These scripts can be used to perform tasks such as updating DNS records, configuring application settings, or running custom initialization routines.
  • Verification: After the VMs are activated, it is important to verify that they are functioning correctly and that all applications and services are running as expected. This may involve performing functional testing, checking logs, and monitoring performance metrics.

Executing Planned Failover in Prism Central

The actual execution workflow is straightforward:

  1. Connect to Prism Central at the recovery site
  2. Navigate to Data Protection > Recovery Plans
  3. Select your Recovery Plan and click Actions > Failover
  4. Choose Planned Failover as the failover type
  5. Type "Failover" in the confirmation field and click Failover

Nutanix displays real-time progress through each stage. VM shutdown status, snapshot creation, replication progress, and recovery site activation. The Last Failover Status column shows "Succeeded" when complete.

Key Characteristics of Planned Failover

  • Zero Data Loss Guaranteed - RPO is effectively zero because the final snapshot includes all data up to shutdown. This isn't estimated—it's guaranteed.
  • Clean Application State - Applications shut down properly using guest OS procedures, reducing corruption risk and ensuring consistency.
  • Time Investment Required - The graceful shutdown, final snapshot, and replication adds time. Duration depends on final change size and available bandwidth.
  • Requires Primary Site Access - The primary cluster must be online and accessible. If the site is down, you must use unplanned failover instead.
  • Automatic Source Cleanup - Original VMs are automatically removed, preventing accidental dual operation.

When to Use Planned Failover

Planned failover is ideal for:

  • Scheduled datacenter maintenance windows
  • Infrastructure migrations between sites
  • Evacuating ahead of predicted disasters (hurricanes, planned power outages)
  • Compliance scenarios requiring documented zero-data-loss migration
  • DR testing with complete data validation

Unplanned Failover: Speed Over Perfection

Unplanned failover operates under very different assumptions. Your primary site is down. You need applications running now. There's no time for graceful shutdowns or final snapshots. Nutanix prioritizes speed and availability.

The Nutanix Unplanned Failover Process

Let's take a quick look at the sequence of activities that are executed during a Unplanned Failover with Nutanix DR:

Latest Available Snapshot Identified

The first step in the Nutanix DR process is to identify the most recent replicated recovery point at the DR site. This recovery point represents a snapshot of the VMs and their data at a specific point in time. The age of this snapshot, and therefore the potential data loss window, is determined by your Protection Policy Recovery Point Objective (RPO).

  • Near-Sync Replication: For applications requiring the lowest possible RPO, Nutanix offers Near-Sync replication, which can achieve RPOs as low as 1 minute. This means that in the event of a disaster, the maximum data loss would be limited to one minute's worth of changes.
  • Asynchronous Replication: For less critical applications, asynchronous replication is a common choice. This typically involves replicating data on an hourly or daily basis, resulting in RPOs of 1 hour or more.

The system automatically identifies the most recent consistent snapshot that has been successfully replicated to the DR site. This ensures that the recovered VMs are based on the most up-to-date data available.

Recovery Point Cloned

Once the latest available snapshot is identified, Nutanix clones it using its efficient redirect-on-write technology. This cloning process is crucial for minimizing recovery time.

  • Redirect-on-Write: Instead of creating a full copy of the snapshot data, redirect-on-write creates a lightweight clone that initially shares the same data blocks as the original snapshot. Any subsequent writes to the cloned VMs are redirected to new blocks, leaving the original snapshot intact.
  • Instant VM Creation: This approach allows VMs to be created almost instantly, without the need to wait for a full data copy. The VMs use metadata to reference the data stored in the snapshot. This significantly reduces the time required to bring the VMs back online.

VMs Created and Powered On

With the snapshot cloned, the next step is to create and power on the guest VMs at the recovery site. This process is guided by a Recovery Plan, which defines the order in which VMs should be started and any necessary network configurations.

  • Boot Sequencing: The Recovery Plan defines the order in which the VMs are powered on. This is important for ensuring that applications and services start up in the correct sequence, with dependencies properly resolved. The boot sequence is typically organized into stages (Stage 0, Stage 1, etc.), with each stage containing a group of VMs that need to be started in a specific order.
  • Network Mapping: Network mappings are automatically applied to the VMs, ensuring that they are connected to the correct networks at the recovery site. This allows the VMs to communicate with each other and with external resources.
  • In-Guest Script Execution: Any configured in-guest scripts are executed after the VMs are powered on. These scripts can be used to perform tasks such as updating DNS records, configuring application settings, or running custom initialization routines.
  • Verification: After the VMs are activated, it is important to verify that they are functioning correctly and that all applications and services are running as expected. This may involve performing functional testing, checking logs, and monitoring performance metrics.
  • Independent Operation: Importantly, this entire process occurs independently of the primary site. No communication with the primary site is required or attempted. This is critical in a disaster scenario where the primary site is inaccessible.

Accept Data Loss Window

A critical aspect of any disaster recovery strategy is understanding the potential for data loss. In the event of a disaster, any changes made to the VMs after the last successful replication to the DR site will be permanently lost.

  • RPO as the Limiting Factor: The amount of data loss is directly determined by your RPO. If your RPO is 1 minute, the maximum data loss will be 1 minute's worth of changes. If your RPO is 1 hour, the maximum data loss will be 1 hour's worth of changes.
  • Unavoidable Loss: This data loss is unavoidable when the primary site is completely inaccessible. However, by carefully selecting an appropriate RPO for each application, you can minimize the potential impact of data loss.

Executing Unplanned Failover in Prism Central

The execution is similar to planned failover, but faster:

  1. Connect to Prism Central at the recovery site
  2. Navigate to Data Protection > Recovery Plans
  3. Select your Recovery Plan and click Actions > Failover
  4. Choose Unplanned Failover as the failover type
  5. Type "Failover" in the confirmation field and click Failover

Nutanix immediately begins activating VMs at the recovery site without attempting to contact the primary cluster. Progress is tracked in real-time through the Recovery Plan dashboard.

Key Characteristics of Unplanned Failover

  • Bounded Data Loss - Maximum data loss equals your Protection Policy RPO. If you replicate every 5 minutes (Near-Sync), you lose at most 5 minutes. If you replicate hourly (Async), you lose at most 1 hour.
  • Fast Execution - No waiting for shutdowns or final replication. Recovery starts immediately from the last snapshot.
  • No Primary Site Dependency - Works even when the primary site is completely unreachable, offline, or destroyed.
  • Crash-Consistent Recovery - VMs restore from snapshots that captured a point-in-time state. Application-aware VSS snapshots (Windows) or pre/post-scripts help ensure application consistency.
  • Post-Recovery Protection - After unplanned failover, create a new Protection Policy to take local snapshots of the now-production VMs at the recovery site.

When to Use Unplanned Failover

Unplanned failover is designed for:

  • Actual disaster scenarios (site failure, power loss, natural disaster)
  • Network connectivity loss to the primary site
  • Primary cluster failures that prevent graceful shutdown
  • Time-critical situations where speed matters more than the last few minutes of data
  • Testing realistic disaster response and measuring actual RTO

Recovery Point Selection Limitations

An important distinction between planned and unplanned failover is recovery point selection:

Planned Failover

  • You cannot select a past recovery point
  • The failover always uses the final snapshot created during the planned failover process
  • This ensures zero data loss and the most current state
  • There's no option to "go back in time" to an earlier snapshot

Unplanned Failover

  • You can select from available recovery points
  • Nutanix presents a list of replicated snapshots at the recovery site
  • You can choose a specific point-in-time if needed (for example, before a ransomware attack)
  • This flexibility is critical for disaster scenarios where the most recent data may be corrupted

This difference reflects the use cases: planned failover assumes you want the latest, cleanest state. Unplanned failover acknowledges that sometimes the most recent snapshot isn't the one you want to recover—especially in cyberattack or corruption scenarios.

The RPO Reality Check

Unplanned failover forces you to live with your Protection Policy RPO. If you configured hourly replication to save on bandwidth and storage, you accept up to 1 hour of data loss during a real disaster. If that's unacceptable, you need Near-Sync (1-15 minute RPO) or Synchronous replication (zero RPO).

This is why testing unplanned failover is critical—it validates whether your RPO settings actually align with business requirements when you lose data for real.


Cross-Cluster Live Migration: An Alternative to Planned Failover

For environments where even planned downtime is unacceptable, Nutanix offers cross-cluster live migration as an alternative to traditional planned failover. While technically a planned failover activity, it operates differently enough to warrant its own discussion.

How It Works

  • VMs migrate while remaining powered on and accessible
  • Storage replication happens in real-time during migration
  • Network connectivity transitions seamlessly to the destination cluster
  • Users experience minimal to no service interruption

The Trade-Off: Time

Cross-cluster live migration is significantly more time-consuming than planned failover:

  • Large VMs with active workloads can take hours to migrate
  • Network bandwidth becomes a critical constraint
  • Migration duration is proportional to VM size and change rate
  • Multiple VMs migrating simultaneously compound the time requirement

When to Consider Live Migration

  • Zero-downtime requirements for mission-critical applications
  • Datacenter evacuations where services must remain online
  • Infrastructure migrations during business hours
  • Applications that cannot tolerate even brief outages

For most DR scenarios, the planned failover approach (with graceful shutdown) is faster and more efficient. Live migration is a premium option reserved for workloads where availability truly cannot be interrupted.


Nutanix-Specific Considerations

Beyond the basic planned vs unplanned distinction, Nutanix DR includes several technical considerations that affect failover behavior.

Replication Type Impact

Your Protection Policy replication type influences failover characteristics:

Asynchronous Replication (1-24 hour RPO)

  • Planned failover waits for final snapshot replication (could add significant time)
  • Unplanned failover uses last scheduled recovery point (data loss up to RPO)
  • Best for workloads tolerant of hours of data loss

Near-Synchronous Replication (1-15 minute RPO)

  • Planned failover final snapshot typically replicates quickly
  • Unplanned failover loses at most 15 minutes of data
  • Balances data protection with bandwidth efficiency

Synchronous Replication (Zero RPO)

  • Planned failover still follows graceful shutdown process
  • Unplanned failover with witness enables automatic failover (Metro Availability)
  • Both modes achieve zero data loss, but unplanned is faster

Witness and Automatic Failover

For Synchronous replication with Metro Availability, Nutanix supports automatic unplanned failover:

  • External witness cluster monitors site availability
  • When primary site fails, witness triggers automatic unplanned failover
  • Recovery Plans execute automatically without human intervention
  • Minimizes RTO by eliminating human detection and response time

This is the only scenario where failover can execute automatically. All other configurations require manual initiation through Prism Central.

Post-Failover State Management

After any failover (planned or unplanned), important state changes occur:

Recovery Plan Status

  • The Recovery Plan enters "Failover" state at the recovery site
  • VMs are now "active" at the recovery location
  • The primary site (if accessible) shows entities as "inactive"

Protection Policy Behavior

  • Existing Protection Policy from primary to recovery site becomes inactive
  • You should create a new Protection Policy at the recovery site for local snapshots
  • If the primary site comes back online, configure reverse replication for eventual failback

Network Configuration

  • Production network mappings apply automatically
  • VMs connect to recovery site networks as defined in the Recovery Plan
  • In-guest IP addresses remain unchanged (unless modified by guest scripts)
  • DNS and routing may require manual updates depending on your architecture

Test Failover: The Third Option

While not a "real" failover, Nutanix also supports Test Failover:

  • Uses metadata to clone the latest snapshot
  • Creates VMs on isolated test networks
  • Allows validation without affecting production or DR readiness
  • Cleanup is simple—just delete test VMs when done

Test failover uses the same snapshot cloning technology as unplanned failover but operates in complete isolation. This lets you validate both planned and unplanned scenarios without risk.

Choosing the Right Approach

The decision between planned and unplanned failover isn't always obvious. Here's guidance:

Use Planned Failover When:

  • Performing scheduled datacenter maintenance
  • Migrating infrastructure between sites
  • Evacuating ahead of predicted disasters (severe weather forecasts)
  • Testing full DR capabilities with zero data loss
  • Compliance requires zero-data-loss migration validation

Use Unplanned Failover When:

  • Primary site is completely down or unreachable
  • Network connectivity to primary site is lost
  • Time is critical and you can't wait for graceful shutdown
  • Testing realistic disaster response and measuring actual RTO
  • Validating whether your RPO settings are actually acceptable

Impact on Recovery Objectives

These two approaches directly impact your DR metrics:

Recovery Point Objective (RPO)

  • Planned Failover: Zero data loss (RPO = 0)
  • Unplanned Failover: Data loss up to Protection Policy RPO

Recovery Time Objective (RTO)

  • Planned Failover: Longer due to shutdown and final replication
  • Unplanned Failover: Faster, immediate activation from last snapshot

Understanding this trade-off helps set realistic expectations. Planned failover optimizes for data integrity. Unplanned failover optimizes for speed.

What About Failback?

Both planned and unplanned failovers can be reversed through failback operations, returning workloads to the original primary site once it's restored. The failback process mirrors the failover approach:

  • Planned Failback follows the same graceful process as planned failover
  • Unplanned Failback uses the last available snapshot if the current recovery site fails

We'll explore failback procedures, considerations, and best practices in detail in Part 8 of this series.

The Power of Choice

What makes Nutanix DR powerful is that you have both options available through the same Recovery Plan. You don't need separate runbooks or procedures. The Recovery Plan orchestration remains the same, only the execution mode changes based on your circumstances.

This flexibility means you can:

  • Test with planned failover to validate zero data loss migration
  • Practice unplanned failover to measure realistic disaster response
  • Choose the appropriate mode when the actual need arises

Your Protection Policies define the data availability foundation. Your Recovery Plans define the orchestration. And the failover mode you select determines the balance between speed and data integrity.

What's Next in the DR Series

In this post, we've explored the critical differences between planned and unplanned failover operations. In the next installment, we'll dive into failback operations and DR testing strategies and how to return to normal operations and how to validate your DR capabilities without impacting production.

Understanding when and how to use each failover type is essential for effective disaster recovery. The right choice depends on your circumstances, but having both options automated and ready to execute gives you the flexibility to respond appropriately whether you're dealing with planned maintenance or an actual catastrophe.