Azure IaaS Resiliency at Scale: Keeping Critical Apps Running When the Cloud Gets Grumpy

Cloud outages are a bit like printer jams: nobody plans for them, everyone swears they “rarely happen,” and somehow they always show up at the worst possible time—like during a product launch, a payroll run, or the quarterly “we promise the board we’re stable now” presentation.

That is why I’m glad Microsoft is leaning hard into a message many infrastructure teams have been trying to tattoo onto architecture diagrams for a decade: disruption isn’t an edge case. It’s part of the operating environment. In the Microsoft Azure Blog post “Azure IaaS: Keep critical applications running with built-in resiliency at scale”, Igal Figlin (CVP, Azure Compute PM) makes the case that resilient outcomes come from combining Azure’s platform capabilities across compute, storage, and networking—and then actually operating them like you mean it.

This article is an expanded, independently reported deep dive based on that RSS item and the original post. I’ll summarize what Microsoft is saying, connect it to the broader reliability engineering reality, and (politely) translate it into decisions you can make on a Tuesday afternoon when you’re staring at a migration spreadsheet and wondering how much chaos your budget can tolerate.

Original source: Microsoft Azure Blog, “Azure IaaS: Keep critical applications running with built-in resiliency at scale,” by Igal Figlin (Published April 1, 2026). Read it here.

Resiliency at scale: a mindset shift, not a checkbox

The Azure post leads with a simple but important reframing: don’t ask if disruption will happen; ask how your application behaves when it does. That sounds obvious until you’ve watched an organization spend months “lifting and shifting” a legacy system into the cloud, only to recreate the same single points of failure they had on-prem—just with better coffee and a more modern portal UI.

In practice, resiliency is not one feature. It is the combined effect of:

Isolation (fault domains, availability zones, region separation)
Redundancy (multiple instances, replicated data, multiple endpoints)
Failover design (how traffic and dependencies switch during trouble)
Recovery design (backup, restore, disaster recovery orchestration)
Operations (testing, drills, observability, and continuous improvement)

Microsoft’s theme—explicitly—also includes shared responsibility: Azure can provide the building blocks, but customers must assemble them into an architecture that matches business needs and risk tolerance. If you’ve ever assumed a cloud provider will “just handle it,” you already know how that story ends: with a post-incident review where someone says “we didn’t configure that.”

Compute resiliency: availability starts with placement

Compute is usually where outages become visible first. If your web tier collapses, customers don’t care that your storage is perfectly redundant—they care that they can’t log in.

Microsoft highlights two key ideas for Azure IaaS compute resiliency:

Don’t place all your eggs on one piece of infrastructure (or one update domain, fault domain, or zone).
Automate healthy capacity so losing instances doesn’t mean losing the service.

Virtual Machine Scale Sets: scaling and availability, with less hand-holding

Azure Virtual Machine Scale Sets (VMSS) are Microsoft’s go-to mechanism for running a fleet of VMs—front-end tiers, app tiers, batch workers—while letting the platform handle deployment and lifecycle management at scale. In the Azure Well-Architected guidance for VMs and scale sets, Microsoft emphasizes that scale sets have autoscale capabilities and can distribute load across multiple VMs and availability zones. That distribution is the part that matters when the world gets weird.

The key distinction is that VMSS is not only about scaling up when traffic spikes. It’s also about staying upright when instances disappear—because instances will disappear. Sometimes politely (planned maintenance). Sometimes dramatically (hardware failure). Sometimes in ways that trigger incident bridges and existential dread.

When you’re designing for resiliency with VMSS, consider these patterns:

Stateless front ends (or mostly stateless): easiest to spread across zones and replace quickly.
Application tiers with session state: push state into a resilient store (cache/database) so compute can be disposable.
Batch/work queues: run multiple workers across zones; use retry policies and idempotent job processing.

Availability Zones: datacenter-level isolation inside a region

Availability Zones exist because “a region” is not a single building. Zones provide datacenter-level isolation, with independent power, cooling, and networking. The Azure post reinforces the classic design goal: if one zone has trouble, instances in another zone keep serving the workload.

In Azure reliability documentation, Microsoft also describes how fault domains work within zones and regions. The practical takeaway is: zone-aware design lowers the blast radius. It doesn’t remove risk; it contains it.

One caveat worth stating plainly: multi-zone architecture is not free. It can increase:

Cost (more instances, cross-zone data transfer depending on service)
Complexity (health probing, failover behavior, state management)
Latency sensitivity (especially for chatty east-west traffic)

But if you’re running revenue-generating or safety-critical systems, “it costs more” is not a rebuttal. It’s an input into a business decision: pay for redundancy now, or pay for downtime later—possibly with interest.

Storage resiliency: your data is the application’s long memory

Compute is replaceable. Data is not. (Or if your data is replaceable, please call me; I want to interview your compliance team.)

Microsoft’s Azure IaaS post highlights storage redundancy options and why they map directly to recovery objectives. That’s the right framing. Storage choices are not just “performance and capacity.” They define what you can realistically promise during and after an incident.

Azure Storage redundancy models (and what they’re actually good for)

Azure Storage offers multiple redundancy models. Microsoft’s documentation describes (among others):

Locally redundant storage (LRS): replicates data within a single datacenter.
Zone-redundant storage (ZRS): replicates data synchronously across availability zones in a region.
Geo-redundant storage (GRS) and read-access geo-redundant storage (RA-GRS): replicate data to a secondary region (asynchronously), with RA-GRS allowing reads from the secondary.

For disaster recovery planning, Microsoft’s guidance also notes that geo redundancy copies data asynchronously to a secondary geographic region. That async detail matters, because it implies potential data loss depending on timing (your RPO), even if your durability story is excellent.

A useful way to think about this is to align redundancy with failure scope:

Rack/server problems → LRS often covers the basics (within a datacenter).
Datacenter/zone issues → ZRS helps keep data available in-region.
Regional issues → GRS/RA-GRS (and broader DR patterns) become relevant.

Also: redundancy is not backup. Replication won’t save you from accidental deletion, ransomware encryption, or a bad deployment that “helpfully” wipes a table. You need backup and recovery processes for that.

Managed disks, snapshots, Azure Backup, and Azure Site Recovery

The Azure post calls out managed disks and VM-based workload recovery mechanisms like snapshots, Azure Backup, and Azure Site Recovery (ASR). This is where the conversation gets concrete: these tools influence your Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Microsoft Learn documentation for Azure Site Recovery is explicit that Site Recovery replicates data from one region to another, and provides tutorials for moving VMs to another Azure region via ASR. This is the foundation for classic region-to-region disaster recovery of IaaS workloads. In older Azure blog coverage, Microsoft also announced support for disaster recovery of zone-pinned VMs to another region using ASR, underscoring that zonal deployments and regional DR can coexist if you design for it.

For organizations still heavily VM-centric (because of vendor appliances, legacy stacks, regulated workloads, or “it works and we’re afraid to touch it”), ASR remains one of the most straightforward DR building blocks in Azure. It’s not magic—you still must test failovers, validate networking, handle DNS and identity dependencies, and avoid assuming your secondary region has limitless capacity at your exact worst moment. But it’s a strong baseline.

Networking resiliency: being “up” doesn’t matter if nobody can reach you

Networking is where infrastructure teams learn humility. Your VMs might be fine, your storage might be fine, and yet users are staring at timeouts because traffic isn’t reaching the healthy instances. Microsoft’s post frames networking correctly: resiliency means maintaining reachability and routing around failure.

Azure’s key traffic-management services mentioned include:

Azure Load Balancer for distributing traffic across instances
Application Gateway for Layer 7 routing and web app scenarios
Traffic Manager for DNS-based routing across endpoints
Azure Front Door for global traffic steering and failover at the edge

The distinction between these services matters less than the design principle: you need a plan for how clients find healthy endpoints when some endpoints become unhealthy. In mature architectures, the plan is layered: local load balancing within a zone/region, plus global distribution and failover across regions. That often means combining a regional L4/L7 balancer with global routing (Front Door or Traffic Manager) depending on protocol and application behavior.

A practical example: the “two-tier” failover story

Here’s a simple scenario many teams can relate to:

You run a customer-facing web app on VMSS across three zones in East US.
Application Gateway handles HTTPS termination and routes to instances.
You maintain a warm standby in Central US with replicated data via ASR and geo-redundant storage where appropriate.
Azure Front Door sits in front, probing both regions and failing over if East US becomes unavailable (or if your own health model says it’s unhealthy).

In a zonal incident, you may not need region failover at all—the app can continue in-region. In a regional incident, you may accept a controlled failover to the secondary region, knowing you may have some RPO gap if replication was asynchronous. None of this is theoretical; it’s the difference between “customers experienced a brief performance blip” and “we were trending on social media for all the wrong reasons.”

Not every workload needs the same armor plating (and that’s okay)

One of the strongest parts of the Azure post is the acknowledgement that resiliency is not one-size-fits-all. A stateless tier can recover by replacement. A stateful tier may require replication, backup, and carefully tested failover paths. And “mission-critical” workloads typically require tighter targets and more rigorous operational discipline.

This sounds obvious, yet many organizations still apply either:

Overkill (gold-plating every dev/test workload with multi-region DR and then wondering why cloud spend looks like a phone number), or
Underkill (treating a revenue system like a hobby project and calling it “acceptable risk”).

A better approach is to classify workloads by business impact and map each class to patterns and guardrails. For example:

Tier 0 (mission critical): multi-zone minimum, tested DR, clear RTO/RPO, aggressive monitoring, documented runbooks.
Tier 1 (important): zonal where possible, backups, some DR capabilities, periodic drills.
Tier 2 (standard): single-region, redundancy where it’s cheap, restore-based recovery may be acceptable.

If that feels too “process-y,” remember: the alternative is letting outages define your architecture for you, which is a bold strategy. Not a good one, but bold.

Migrations: the best time to fix resiliency debt (because you’re already breaking things)

Microsoft argues that migration is a golden moment to rebuild resiliency rather than reproduce old patterns. I agree, and not only because it’s the kind of statement that looks great in a blog post. It’s because migration projects already involve:

Touching networking and identity
Rebuilding environments
Changing backup strategies
Rewriting runbooks
Relearning failure modes

If you don’t use that disruption to reduce single points of failure, you’ll end up scheduling a “Phase 2 Resiliency Program” later—also known as “the thing we do after the next outage.”

Case study callout: Carne Group and “resiliency as a migration outcome”

The Azure post references Carne Group using Azure Site Recovery and Terraform-based landing zones to streamline cutover and improve recovery readiness, including a quote from Stéphane Bebrone about being able to rebuild a duplicate site in another region and recover within roughly a day in a worst-case scenario.

There are two broader lessons here:

Infrastructure as code (IaC) is a resiliency tool. Repeatable deployments reduce configuration drift and speed recovery.
DR is not only about technology. It’s about the ability to execute a plan under pressure. IaC helps you execute.

In my experience, the “we can rebuild the environment quickly” capability is one of the most underappreciated forms of resilience. Teams obsess over active-active designs (sometimes correctly), but forget that fast, reliable environment recreation can reduce the recovery timeline drastically for many applications—especially internal systems where a short outage is tolerable but a multi-day rebuild is not.

Operating resiliency: the part you can’t buy with a SKU

After deployment, architectures degrade. Not because engineers are careless, but because systems evolve: new dependencies, new endpoints, new data flows, new teams, new “temporary” exceptions that become permanent. The Azure post explicitly calls out the need for ongoing validation through testing, drills, fault simulations, and observability.

This aligns with what the broader reliability community has been saying for years: resilience is an operational practice. You don’t know your real RTO until you test it. You don’t know your blast radius until you simulate failure. And you don’t know if your alerts are useful until you have an incident at 3 a.m. and you learn which alarms are merely decorative.

Reliability guidance and mission-critical design patterns

Microsoft has been expanding reliability guidance in Microsoft Learn, including mission-critical workload design resources. The mission-critical application design guidance emphasizes planning for failures, scalability considerations, and using patterns that increase availability and recoverability.

Even if you never build a formal “mission-critical” reference architecture, these documents are helpful because they force you to confront uncomfortable questions like:

What is the maximum tolerable data loss for this system?
How quickly must it return to service?
What dependencies make recovery slower than we admit in meetings?
Which components are single points of failure today?

“Resiliency in Azure” (preview): tooling for assessment and validation

One interesting nugget from the Azure blog post is a reference to “Resiliency in Azure” (a GitHub-based initiative/tooling) released in preview at Microsoft Ignite, with a public preview planned for Microsoft Build 2026. That suggests Microsoft is trying to operationalize resiliency reviews and validation in a more structured way—something customers have wanted, because “best practices” are great until you have 800 subscriptions, 1,200 resource groups, and a naming convention that stopped being enforced during the last reorg.

I’ll be watching how that project evolves. If it becomes a practical way to assess posture, generate actionable recommendations, and track progress over time (without turning into a compliance checkbox engine), it could be genuinely useful—especially for organizations running significant IaaS footprints.

Resiliency patterns for Azure IaaS: what to do on Monday morning

Let’s turn philosophy into decisions. Below are pragmatic patterns you can apply depending on workload type and constraints.

Pattern 1: Zonal scale-out for stateless tiers (the “keep serving” baseline)

Best for: web front ends, API gateways, stateless microservices, worker fleets

Use VMSS across availability zones.
Use health probes and autoscale policies (and test them).
Keep state outside the VM (cache/database/queue), so instance loss isn’t a catastrophe.
Use regional load balancing plus a global entry point if you have multi-region needs.

Pattern 2: Stateful workloads with explicit data durability choices

Best for: databases on VMs, file services, legacy apps that write to disk

Choose Azure Storage redundancy (LRS/ZRS/GRS/RA-GRS) aligned to failure scope and RPO needs.
Use Azure Backup for restore-based recovery and retention needs.
Use snapshots for quick point-in-time rollback (but manage them; snapshots without lifecycle are how you accidentally build a museum).
Consider ASR for region-to-region recovery if the workload remains VM-centric.

Pattern 3: Region-to-region DR for critical VM workloads (when you can’t PaaS your way out)

Best for: enterprise apps with vendor requirements, regulated workloads, specialized appliances

Use Azure Site Recovery to replicate to a secondary region.
Define and test failover/failback procedures.
Document dependency mapping (identity, DNS, secrets, external APIs, on-prem connectivity).
Run periodic DR drills and measure actual RTO/RPO.

Pattern 4: “Recovery by redeployment” with IaC (fast rebuild beats fragile failover)

Best for: internal apps, standardized environments, modern CI/CD shops

Use Terraform/Bicep to define landing zones and core infrastructure.
Automate environment creation in a secondary region.
Store configuration in version control; treat drift as a defect.
Combine with backups/replication for data layers.

This is especially powerful when your app tier can be rebuilt quickly and the main recovery challenge is data and configuration, not compute.

Industry context: why cloud resiliency is getting louder in 2026

Microsoft is not publishing resiliency guidance in a vacuum. The industry is dealing with several converging trends:

More distributed architectures (microservices, event-driven systems, multi-region user bases)
AI-driven demand spikes (new workloads that can be bursty and expensive)
Regulatory scrutiny (operational resilience requirements in finance and other sectors)
Higher user expectations (downtime tolerance continues to shrink)

At the same time, organizations are confronting a hard truth: cloud concentration risk is now a board-level conversation in many industries. Microsoft has published perspectives on concentration risk and cloud resilience, arguing that proper mitigations and exit planning can reduce residual risk and that cloud availability has improved over time. Whether you agree with every conclusion, the fact that these documents exist is a signal: customers are asking pointed questions about how outages, regional incidents, and dependency failures are managed.

Where Azure IaaS fits versus PaaS (and why “just use PaaS” isn’t always helpful)

If you spend any time on architecture Twitter or in cloud-native circles, you’ll hear: “Don’t run VMs. Use managed services.” That advice is directionally correct—but incomplete.

Azure IaaS remains central because:

Some workloads can’t be modernized quickly (or ever, due to vendor constraints).
Some customers need OS-level control.
Some industries require specific configurations and audit controls.
Migration reality is messy; IaaS is often the bridge to modernization.

What Microsoft is effectively saying in its Azure IaaS series is: even if you’re running IaaS, you can still build a trusted, resilient platform foundation—if you use the right building blocks and operate them with discipline.

Common failure modes (and how Azure’s building blocks help)

Let’s name the monsters. Here are common classes of failure and what Azure IaaS features are designed to mitigate:

Single VM failure: use multiple instances (VMSS), load balancing, and health probes.
Rack/hardware domain issues: spread across fault domains; avoid single-instance designs.
Planned maintenance disruption: use multiple instances and update-domain-aware patterns; design for rolling updates.
Zonal disruption: deploy across availability zones; use ZRS where applicable.
Regional disruption: use geo-redundant patterns; ASR for VM replication; global routing and failover.
Human error: backups, versioned deployments, IaC, and guardrails (policy).
Dependency failure: multi-endpoint design, timeouts, retries, circuit breakers, and realistic chaos testing.

Notice that only some of these are “Azure problems.” Many are application and operations problems that happen to be revealed by infrastructure events. Resilience is the art of making sure those revelations happen in test environments instead of in production.

What I’d ask a team before calling an Azure IaaS workload “resilient”

If you want a quick readiness checklist—one that doesn’t pretend a single metric solves everything—here are the questions I’d ask:

Do you know your RTO and RPO in minutes (not vibes)?
Is the application deployed across availability zones where feasible?
Can you lose an instance or a zone without paging the entire company?
Is traffic steering automated (health probes, failover routing) and tested?
Is your data redundancy choice explicit and aligned to business needs?
Are backups tested by restore, not assumed?
Do you run DR drills with measured outcomes?
Is infrastructure defined as code and reproducible?
Do you have observability that tells you impact (not just CPU graphs)?

If half of these produce uncomfortable silence, you don’t have a resiliency problem—you have a visibility problem. The good news is that visibility problems are solvable, and Azure’s documentation and tooling ecosystem is steadily getting better at supporting that journey.

Bottom line: resilient outcomes come from combining features—and practicing

Microsoft’s April 1, 2026 post is not a giant product launch. It’s more important than that. It’s a reminder that Azure IaaS resiliency is an architectural outcome, not a marketing adjective.

If you take one thing away, make it this: build for disruption as a normal operating condition. Use availability zones and VMSS to contain compute failures. Pick storage redundancy based on failure scope and recovery objectives. Design traffic steering so users reach healthy endpoints. Use ASR, backup, and IaC to make recovery predictable. And then test everything—because the cloud will eventually test it for you, and it will not be gentle.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org