As with years past, as we come to a close on year’s end, I take some time to clean up the As-Built documentation templates I maintain for deployments; as part of this activity, I always come across outdated sections based on current feature sets, or areas that I think can use some additional content.
This week, I noticed that my documentation about placing hosts into maintenance mode with Nutanix, whether running AHV or ESXi, is a bit outdated and could use some touching up.
Coming out of that, I thought it might be nice to discuss your different maintenance mode operations available with Nutanix and be a little hypervisor-specific for the options available.
I’ll write a 3-part series discussing the different scenarios for maintenance mode and detail the process and some known issues and/or limitations across the different hypervisors.
Today’s post will focus on maintenance mode and why it’s important. It will also cover maintenance mode operations with the Nutanix native hypervisor, AHV (Acropolis Hypervisor).
- Part 2 will cover ESXi.
- Part 3 will cover upgrading ESXi thru Prism.
The series will focus on AOS 6.5+, which is, at the time of this writing (12/23/23), the latest LTS release from Nutanix.
For this blog series, I’ll use a 4-node NX-1065 cluster imaged with the specific hypervisor for that post. For the ESXi post, I will use a pre-existing vCenter instance for cluster management.
Node Maintenance
It’s Important, right?
First, let’s discuss when maintenance mode is used and why it’s an important piece of lifecycle maintenance – or care and feeding of a platform.
Nutanix, in their documentation, details Node Maintenance pretty well:
You are required to gracefully place a node into the maintenance mode or non-operational state for reasons such as making changes to the network configuration of a node, performing manual firmware upgrades or replacements, performing CVM maintenance, or any other maintenance operations.
So, when do we use node maintenance? You might have picked up, based on the quote, that it’s useful when we’re making changes to a node (network, etc.), hardware replacements, or just general maintenance. Here’s where I think of maintenance mode activities, both automated and manual:
- AOS or LCM operations (automated thru LCM)
- Node Hardware replacement (PSU, NIC, etc.)
- Physically moving hardware
Maintenance mode is also useful when we just want to take a node out of normal VM operations for troubleshooting purposes.
Maintenance Mode – Where do we start?
Placing hosts into Maintenance mode differs slightly based on the hypervisor in use. Nutanix, starting with the AOS release of 6.1.2, gives options on how you want to enter maintenance mode, either through CLI or through the Prism Web Console. So, let’s do a brief check on our options per hypervisor:
- AHV: CLI or Web Console
- ESXi: CLI or Web Console. There’s also an option for Maintenance mode through vCenter or the ESXi host directly; we’ll cover that in Part 2.
- Hyper-V: CLI ONLY
Note: You must exit the node from maintenance mode using the same method that you have used to put the node into maintenance mode. For example, if you use the CLI to put the node into maintenance mode, you must use the CLI to exit the node from maintenance mode. Similarly, if you use the web console to put the node into maintenance mode, you must use the web console to exit the node from maintenance mode.
What happens when we place a host into maintenance mode?
When placing a node into maintenance mode, we need to remember that we can only place one note at a time into maintenance mode for each cluster.
When a host is placed in maintenance mode, depending on the state of the VM, certain VM operations will take place.
For Guest VMs, the outcomes or availability of the VMs during maintenance mode operations (the scenarios are the same for both AHV and ESXi) depend on the type of VMs they are classified as:
- High Availability VMs: VMs will be migrated to alternate hosts.
- Pinned/RF1 VMs: These VMs will be powered off as long as the hosts are in maintenance mode.
After exiting maintenance mode, all Pinned/RF1 guest VMs are powered on while the live migrated VMs are automatically rebalanced throughout the cluster.
Note: As I’ll cover in the ESXi section, depending on your vSphere licensing, the workloads will not be automatically migrated or rebalanced through the cluster.
Node Maintenance Mode with AHV
Part 1 in this series will cover Node maintenance with AHV. When running a Nutanix cluster based on AHV, we can place a node into maintenance mode or remove a node from maintenance using CLI commands within any Controller VM or from Prism Element.
Nutanix recommends using the Web Console to place nodes into maintenance mode. However, the CLI option can be activated through scripting or automation workflows.
Note: Regardless of the method to place the node into maintenance mode (CLI or GUI), the AHV host is not automatically shut down.
Maintenance Mode using Web Console
Entering Maintenance Mode
The preferred method with AHV to place a host into maintenance mode is from Prism Element. This ability is somewhat newer from a capability perspective, starting around AOS release 6.1.2. When using Prism Element to place a host into maintenance mode, the following tasks are performed:
- The AHV host initiates entering maintenance mode.
- VMs are either migrated or powered off.
- The AHV host enters maintenance mode.
- The CVM enters maintenance mode.
- The CVM powers down.
So, let’s unpack the steps that go into entering maintenance mode. From Prism Element, once you select the Enter Maintenance Mode option:
- The host will live migrate any non-pinned VMs to alternate hosts and evaluate any Pinned or RF1 VMs that need to be powered off.
- The CVM enters maintenance mode, informing the other CVMs that it will not participate in the cluster, and finally shuts down.
Once the host is in maintenance mode, we can validate that the host is in maintenance mode through both the Web Console and the CLI.
I will place the node LABNTNX01 into maintenance mode from the Web Console, and we’ll validate the node status to ensure the maintenance mode actions were successfully entered.
After entering the node into maintenance mode, we can validate that the node entered maintenance mode successfully thru both Prism Element and CLI.
Let’s take a look at Prism Element first. We should see that the host is in maintenance mode AND that the CVM is powered down.
Hovering over the host details, we see the host showing as under maintenance.
On the VM table, we can also see that the CVM is powered down.
From the CLI, we can also validate this.
Issuing the command cluster status
from any CVM, we should see that the CVM shows as down while the others show as up.
We can also validate this thru NCLI, using the command ncli host ls
. In this case, we expect to see the status of Under Maintenance Mode with a value of true.
Exiting Maintenance Mode
Now that we’ve placed the node LABNTNX01 into maintenance mode via Prism Element let’s evaluate the steps taken when removing the node from maintenance mode. The following tasks are performed:
- The CVM is powered on.
- CVM is taken out of maintenance mode.
- The AHV host is taken out of maintenance mode.
So, let’s unpack the steps that go into exiting maintenance mode. From Prism Element, once you select the Exit Maintenance Mode option:
- The CVM will be powered back on, exit maintenance mode, and rejoin the cluster. Data Resiliency will return once the CVM service is fully restored, and the cluster returns to a stable state.
- The AHV host will exit maintenance mode.
- The cluster will balance out non-pinned VMs and evaluate any pinned or RF1 VMs that need to be powered.
Once the node is removed from maintenance mode, we can validate that the node successfully exited maintenance mode through both the Web Console and the CLI.
Now that we’ve completed our maintenance operations, let’s remove the node from maintenance mode. We do this by selecting the host, and then Exit Maintenance Mode. This will initiate the process to remove the host from maintenance mode.
Just like entering maintenance mode, we can validate that the node successfully exited maintenance mode through Prism Element and the CLI.
Let’s take a look at Prism Element first. We should see that the host is in maintenance mode AND that the CVM is powered down.
Hovering over the host details, we see that the host is no longer showing as under maintenance.
Viewing the VM table, we can see that the CVM is powered back online, and our data resiliency status is OK.
From the CLI, we can also validate this.
Issuing the command cluster status
from any CVM, we should see that all of our CVMs now show a status of Up.
We can also validate this thru NCLI, using the command ncli host ls
. In this case, we expect to see the status of Under Maintenance Mode with a value of false.
Maintenance Mode using CLI
Entering Maintenance Mode
While the preferred method of placing a host into maintenance mode is via Prism Element, doing so via CLI is also viable. It provides more flexibility regarding how the node is placed into maintenance mode.
Placing the host into maintenance mode via CLI is a 2-step process, first putting the AHV host into maintenance mode and then the CVM into maintenance mode. A noticeable difference with the CLI method is that the CVM is not automatically shut down, as it is when doing so from the GUI.
One of the more interesting functions of AHV is that you can place the AHV host under maintenance while keeping the CVM up and running and still operational in the cluster. This option is only available from the CLI, not within Prism Element. I find this useful for short-term scenarios where I want to remove workloads from a running node while keeping the CVM participating in the cluster.
So, let’s unpack the steps to enter maintenance mode through the CLI.
- The AHV host will live migrate any non-pinned VMs to alternate hosts and evaluate any Pinned or RF1 VMs that need to be powered off.
- The AHV host enters maintenance mode.
- Optional: The CVM enters maintenance mode, informing the other CVMs that it will not participate in the cluster.
- Optional: The CVM and host can be shutdown.
Let’s look at putting the host LABNTNX02 into maintenance mode via CLI.
- SSH into a CVM in the cluster, NOT on the host you want to put into maintenance mode.
- Let’s check the status of our hosts via cli using the command
acli host.list
, to ensure none are currently in maintenance mode. Sure, we could do this through Prism Element, but that’s not the point here, is it?
All the hosts look good, so let’s place LABNTNX02 into maintenance mode via CLI.
Issue the below command from a CVM, with the specifics for the host being addressed.
acli host.enter_maintenance_mode hypervisor-IP-address [wait="{ true | false }" ] [non_migratable_vm_action="{ acpi_shutdown | block }" ]
We can validate the host is in maintenance mode, using the command acli host.get 10.10.120.22
. We can see that host LABNTNX02 now has a Node State of EnteredMaintenanceMode and Schedulable status of False. Note that I did not include the Wait and non_migratable_vm_action commands in my statement, as I knew I didn’t have any workloads.
If we issue the command ncli host ls
, we can see that the CVM is NOT in maintenance mode, only the host. This can be helpful if we want to isolate the host from any workloads but keep the CVM online.
Let’s go ahead and place the CVM into maintenance mode as well. You’re still SSH’d into a CVM, right?
Let’s get the CVM ID for host LABNTNX02, using the command ncli host list
. The CVM ID will be the number AFTER the ID field, in your case it’s 7.
We can see that this host isn’t in maintenance mode, so let’s put it into maintenance mode.
Since we have the Host ID, enter the command ncli host edit id=7 enable-maintenance-mode=true
.
After waiting a few minutes, we can validate that the CVM is now in maintenance mode, using the command we used earlier.
Note: When placing the CVM into maintenance mode via CLI, when you issue the command cluster status
, the CVM that we manually put into maintenance mode does not appear, as it did when we used Prism Element, in that case it showed Down.
As I pointed out, the CVM also shuts down automatically when using the Web Console to enter maintenance mode. When using the CLI, it does not shut down automatically. If we needed to power down this host, we could now power off the CVM and the host; the cluster would remain stable.
Exiting Maintenance Mode
Now that we’ve placed the node LABNTNX01 and the associated CVM into maintenance mode via CLI let’s evaluate the steps taken when removing the node from maintenance mode. The following tasks are performed:
- Remove CVM from maintenance mode.
- The AHV host is taken out of maintenance mode.
So, let’s unpack the steps to exit maintenance mode from CLI:
- The CVM will be powered back on, exit maintenance mode, and rejoin the cluster. Data Resiliency will return once the CVM service is fully restored and the cluster returns to a stable state.
- The AHV will exit maintenance mode.
- The cluster will balance out non-pinned VMs and evaluate any pinned or RF1 VMs that need to be powered.
Once the node is removed from maintenance mode, we can validate that the node successfully exited maintenance mode through both the Web Console and the CLI.
Now that we’ve completed our maintenance operations, let’s remove the node from maintenance mode. Let’s go ahead and remove the CVM and AHV host from maintenance mode. You’re still SSH’d into a CVM, right?
To remove the CVM from maintenance mode, we use the same commands we used to enter maintenance mode but toggle the command slightly.
Let’s get the CVM ID for host LABNTNX02, using the command ncli host list
. The CVM ID will be the number AFTER the ID field, in 0ur case it’s 7.
We can see that this CVM is in maintenance mode, so let’s exit it from maintenance mode.
Since we have the Host ID, enter the command ncli host edit id=7 enable-maintenance-mode=false
. After waiting a few minutes, we can validate that the CVM has exited maintenance mode, using the command we used in step 1.
Issuing the command cluster status
, we can see that the CVM has rejoined the cluster, and our Data Resiliency status in Prism Element should now return to Ok.
Our final step is to remove the AHV host from maintenance mode, using the command acli host.exit_maintenance_mode 10.10.120.22
. We can validate that our AHV host has existed maintenance mode, and we’re back in full operations!
Wrap-up
As we’ve seen, AHV makes it very easy to perform maintenance operations on cluster nodes, both at the Hypervisor and CVM levels. Giving the ability for either GUI or CLI provides administrators to use the method that works best for them, while achieving the same results.
Thanks for reading, stay tuned for Part 2 and Part 3, where I walk through the same process for the ESXi platform and ESXi upgrades using Nutanix LCM.