From Outage to Insight: Resilient Public Sector IT

Mar 20, 2026 · 12 min read · Nutanix Operations Resilience ·

Share on:

Overview

Last week I had the opportunity to attend CentralSquare Engage 2026, and as always, it was a reminder of why this space matters. Representing eGroup at this event is something I genuinely look forward to. We have been supporting Public Safety and Public Administration agencies for over ten years, and conferences like this are where you get to have the real conversations, the ones that go beyond sales cycles and get into the operational realities these teams face every day.

This year I had the chance to present a session titled From Outage to Insight: How Public Sector Agencies Build Resilient Systems. The session started with a number that tends to get people's attention: 5 minutes and 26 seconds.

That is the total allowable unplanned downtime per year to achieve 99.999% uptime, what most people know as five nines. It sounds like an achievable goal until you consider what it actually means in practice. That 5:26 is not a monthly budget or a weekly allowance. It is the entire year. A single unexpected reboot, a network hiccup during a patching window, or a storage event that takes a few minutes to self-heal can consume that entire window in one shot. For agencies running mission-critical workloads, five nines is not just a number on a spec sheet; it is a commitment that demands architectural decisions at every layer of the stack, and a budget conversation to match.

Availability	Uptime %	Downtime / Year	Downtime / Month
Two 9s	99%	3.65 days	7.31 hours
Three 9s	99.9%	8.77 hours	43.83 minutes
Four 9s	99.99%	52.60 minutes	4.38 minutes
Five 9s	99.999%	5 min 26 sec	26.3 seconds

The jump from four nines to five nines looks modest on paper, a single decimal place. In practice, it is the difference between just under an hour of acceptable downtime per year and under six minutes. That gap is where architecture either earns its keep or falls short.

Resilience Is Not Unique to Public Sector

Before diving into the specifics of public sector environments, it is worth stepping back for a moment. The challenges around resilience are not unique to government agencies. Healthcare organizations face the same fundamental pressures: a hospital that loses access to its EMR system or clinical applications is not just dealing with an IT outage, it is dealing with a patient safety event. Financial institutions operate under similar constraints, where a payment processing failure or trading system outage carries regulatory and reputational consequences that extend well beyond the cost of the downtime itself.

What connects all of these verticals is a simple truth: planning, preparation, and deliberate action are what separate a recoverable situation from a crisis. The specific workloads differ, the compliance requirements differ, and the tolerance for downtime differs. But the underlying discipline required to build resilient systems is consistent across all of them. Public sector is not special because resilience matters there. It is notable because the consequences of getting it wrong are often visible to the public in ways that a back-office financial system failure is not.

What Downtime Looks Like in Public Sector

In commercial environments, downtime is largely a revenue and productivity problem. In public sector, the stakes tend to be more immediate and more visible. A delayed 911 dispatch, inaccessible court records, or a disrupted permitting system can have real consequences for the people those agencies serve. That framing tends to sharpen the conversation when discussing infrastructure investment with leadership, because the impact is not just internal.

The honest answer to "how do you build resilient infrastructure" is that it depends. "Resilience" means something very different to a 911 dispatch center than it does to a courts administration office or a general government agency. The systems they protect, the budgets available, and the tolerance for downtime vary significantly. That gap between a one-size-fits-all solution and actual operational reality is exactly where most resilience planning breaks down.

A Framework Worth Following

The core of the session was a four-step framework: Identify, Prioritize, Design, and Validate. Rather than trying to protect everything equally, the goal is to start by understanding which failures actually matter. Tier 1 systems like CAD, 911 dispatch, and records management have a very short leash when it comes to acceptable downtime. Tier 2 systems like email and GIS can tolerate more. Getting that tiering right before designing anything else is what makes the rest of the plan coherent. If you are looking for a deeper dive into why disaster recovery matters and how to frame these conversations with leadership, that is a good starting point.

From there, the conversation around Recovery Point Objectives and Recovery Time Objectives becomes a lot more productive. When you know which systems are Tier 1, you can have an honest discussion about whether synchronous replication is worth the cost, or whether near-zero async replication provides the protection you actually need for a given workload. The goal is not to maximize resilience across the board; it is to match the right level of protection to each workload tier. I covered the differences between asynchronous, near-synchronous, and synchronous protection policies in detail as part of the disaster recovery series, including how each aligns to different RPO requirements.

Once the protection strategy is in place, the next question is what happens when you actually need to fail over. Recovery plans provide the orchestration layer: boot sequencing, network mapping, and repeatable automation that removes guesswork from the process. But a recovery plan that has never been executed is just a theory. Regular DR testing is what turns that theory into confidence, and it can be done non-disruptively so there is no excuse to skip it.

One thing that came up throughout the session was the human element, which often gets treated as an afterthought in infrastructure planning. Documented runbooks, cross-trained staff, sustainable on-call rotations, and regular DR drills are not soft topics. They are operational necessities. A backup you have never tested is a backup you do not have. A DR plan that only one person understands is a risk, not a safety net.

Common Pitfalls

The session also touched on where resilience planning tends to fall apart in practice. Treating all workloads the same is one of the more common and costly mistakes. When everything is considered equally critical, investment gets spread thin and the systems that genuinely cannot afford downtime end up under-protected.

Skipping application-level high availability is another one. Infrastructure resilience keeps the platform running, but it does not guarantee that the application running on top of it will recover cleanly. Domain services and databases have their own clustering and failover mechanisms, and applications built with native high availability or designed to run stateless behind a load balancer add yet another layer of resilience above the infrastructure. These application-level considerations exist above the hypervisor layer and can make the difference between a brief disruption and a prolonged outage.

Finally, the shared responsibility model in cloud environments continues to create confusion. Cloud providers keep their infrastructure running. That is not the same as ensuring your data is protected, your workloads are recoverable, or that your RPO and RTO targets are actually being met. Many organizations do not fully appreciate that gap until they need to recover from something.

The Cloud Is Not a Resilience Strategy

It is worth spending a moment on the assumption that moving to the cloud solves the resilience problem. It does not. Whether you are running workloads in a colocation facility, relying on a SaaS provider for a critical business function, or deploying infrastructure in a hyperscaler like AWS or Azure, you are still exposed to outages that are outside of your control.

Hyperscaler region outages are not hypothetical. AWS, Azure, and Google Cloud have all experienced significant regional disruptions that took services offline for hours. If your architecture assumes a single region will always be available, your resilience plan has a gap. Multi-region or hybrid designs add cost and complexity, but for Tier 1 workloads, they are often the only way to meet the recovery targets that actually matter.

SaaS applications introduce a different kind of risk. When a critical platform like your CRM, HR system, or even your email provider goes down, you have no ability to fix it. You are entirely dependent on the provider's response. That does not mean you should avoid SaaS, but it does mean your resilience planning needs to account for what happens when a service you do not control becomes unavailable. What is the fallback? How long can you operate without it? Have you actually tested that scenario?

Colocation adds yet another dimension. You may own the hardware, but you are relying on someone else's power, cooling, and network connectivity. An internet circuit failure, a cooling system malfunction, or a provider-side maintenance window can all take your environment offline regardless of how well your infrastructure is designed.

The common thread is that "the cloud" in any of its forms is someone else's infrastructure. That does not make it unreliable, but it does mean that resilience planning cannot stop at the application and platform layer. It has to extend to the connectivity, the provider dependencies, and the failure scenarios that live outside your four walls.

Building on a Foundation That Supports the Goal

Over the years, we've had a hyperfocus at eGroup on standardizing with Nutanix as the infrastructure foundation for the resilient solutions we design and deploy. That decision did not happen by accident. When you are working toward availability targets like five nines, the platform you build on has to earn its place in the design. Nutanix delivers built-in redundancy at the storage and compute layer, self-healing architecture that handles common failure scenarios without manual intervention, and non-disruptive rolling upgrades that do not force you to choose between staying current and staying online. The distributed storage fabric means that a drive or node failure is absorbed transparently, with data rebuilt automatically across remaining resources without administrator intervention. Prism Central provides a single management plane for policy-driven protection, replication monitoring, and recovery plan orchestration across sites. For organizations that need to extend their resilience posture into the cloud, Nutanix Cloud Clusters (NC2) provide a consistent operating model across on-premises and public cloud environments, keeping DR architecturally familiar rather than introducing an entirely new platform to manage during a crisis.

But a platform is only as strong as what surrounds it. Resilience does not live in a single layer.

Network resiliency is one of the most overlooked components of the stack. Redundant top-of-rack switches, link aggregation, and diverse uplink paths are table stakes for any environment targeting high availability. A single switch failure or a lost uplink should be a logged event, not a service interruption. For organizations with multiple sites, diverse WAN paths and SD-WAN failover ensure that replication traffic and management connectivity remain intact even when a circuit goes down. The network is the connective tissue between every other layer of the design, and if it is not resilient, nothing built on top of it can be either.

Data protection has to go beyond snapshots on a schedule. Application-consistent backups, tested recovery processes, and retention policies that align with both operational and compliance requirements are the baseline. But in the current threat landscape, immutability has become equally critical. Immutable snapshots and write-once storage ensure that backup data cannot be encrypted, modified, or deleted by ransomware or a compromised administrative account. This is not a nice-to-have; it is the difference between a recoverable ransomware event and a catastrophic one. When your backup data is the last line of defense, it has to be protected from the same threats you are defending the production environment against.

For the agencies and organizations we work with, Nutanix is a critical piece of the puzzle, but it is one piece. A resilient Nutanix cluster sitting behind a single switch, protected by mutable backups, and connected over a single WAN circuit is still fragile. The real value comes from treating every layer of the stack as part of the same resilience conversation. When the platform, network, data protection, and application layers are all designed around the same availability goals, the result is an architecture where no single failure can take the whole thing down. That is the goal, and it takes deliberate effort across every layer to get there.

The Throughline

Whether you are building resilience for a 911 center, a hospital network, or a financial services firm, the throughline is the same. Understand what matters most, design protection around real risks rather than theoretical ones, and validate that your plan actually works before you need it. The organizations that come out of outages with minimal impact are rarely the ones with the most sophisticated technology. They are the ones that did the planning work ahead of time.

Much of this post focuses on industries that operate under significant oversight and compliance requirements, and for good reason: the consequences of failure in those environments are immediate and public. But the absence of a regulatory mandate does not mean the absence of risk. A manufacturing company that loses its ERP system for a day, a logistics firm that cannot access its routing platform, or a professional services organization that loses access to client data: these are all real impacts with real costs. Revenue stops, customers lose confidence, and recovery becomes more expensive the longer it takes. Planning and preparation should not take a backseat just because no auditor is asking to see your DR plan. Every organization has systems it cannot afford to lose and timelines it cannot afford to miss. The discipline of identifying those systems, designing protection around them, and validating that protection regularly applies regardless of industry. Compliance may force the conversation, but operational reality is what makes it necessary.

It was a good week at Engage. One of the best parts of these events is reconnecting with people I have worked with over the years, folks who have taken their planning and preparation to the next level and are now operating with real confidence in their resilience posture. Those conversations are a reminder that this work pays off. Equally valuable were the new conversations with agencies and organizations just starting to think through these challenges. The awareness is there. What they are often looking for is a structured way to approach it, one that accounts for their actual budget and operational constraints rather than an idealized architecture that assumes unlimited resources.

If you attended Engage and want to continue the conversation, or if you are working through resilience planning at your organization, I would love to hear from you. You can connect with me on LinkedIn or reach me directly at mike@mikedent.io. You can also find more detail on DR strategies and architecture in the disaster recovery series over at mikedent.io/disaster-recovery.