Azure Application Gateway: Common Mistakes & The Right WAF Rollout (Healthcare-Ready)

Azure Application Gateway with WAF v2 is the front door for most SaaS workloads I review. It is also the single most consistently mis-deployed component in those reviews. The pattern repeats across organisations of every size:

CloudOps stands up Application Gateway with WAF in Prevention mode on day one to "tick the security box."
Within days, the WAF blocks legitimate traffic — JSON bodies, file uploads, customer SSO callbacks, the marketing team's redirect chains.
Engineers respond by flipping WAF into Detection mode "temporarily."
Six months later, Detection is still on. The platform has compliance signage that says WAF, and protection that is purely cosmetic.

I have walked into too many environments where this is the steady state. The fix is not a product — it is a disciplined rollout. This article catalogs the mistakes I see most often and lays out the playbook I use to take SaaS apps from no WAF to full Prevention without breaking real users.

Who this is for: solution architects responsible for the front door of a SaaS platform — and especially those of us working on healthcare, where WAF mis-configuration is not just a security gap, it is a HIPAA and HITRUST control failure. The general playbook applies to every workload; the healthcare addendum (§9) is mandatory reading if any byte of ePHI flows through the gateway.

2–4 wk

Detection baseline

95%+

False positives tuned

PHI in WAF logs

6 yr

HIPAA log retention

What's inside

The 10 mistakes I see most often
Reference architecture — WAF in front of SaaS
The rollout playbook: Detection → tune → Prevention
Tuning the false positives without weakening the WAF
Risk register — threat, blast radius, mitigation
Monitoring & alerting that actually works
Terraform baseline (App Gateway + WAF Policy)
Healthcare addendum — HIPAA · HITRUST · FHIR (mandatory)
Architect's checklist
Further reading — official documentation

1. The 10 Mistakes I See Most Often

Before the design pattern, the anti-patterns. If you recognise three or more of these in your environment, your WAF is not doing what your compliance team thinks it is doing.

Prevention mode on day one with no traffic baseline — guaranteed to block legitimate users and force a panicked rollback.
Detection mode left on indefinitely after the rollback "for now." There is no protection — only logs no one reads.
Disabling entire OWASP rule groups (e.g. all of RuleGroup 942 SQL Injection) instead of excluding the specific noisy rule.
Wildcard exclusions like RequestArgNames = * that effectively turn off whole inspection categories for every request.
No diagnostic logs sent to Log Analytics — you cannot tune what you cannot see.
WAF Policy attached at the gateway level only, with no per-listener or per-URI overrides, so tuning for one app weakens protection for all of them.
SKU sprawl — using Standard_v2 (no WAF) in lower environments, WAF_v2 in production. The tuning never transfers.
Backend HTTP and no end-to-end TLS — TLS terminates at the gateway and traffic flows unencrypted to the backend over the private subnet.
Public IP with no NSG / no private endpoint backend — bypassing the gateway is one misconfigured DNS record away.
No autoscaling, fixed two instances — first marketing campaign fills the gateway and customers see 502s.

2. Reference Architecture — WAF in Front of SaaS

The right architecture is single-ingress, private backends, end-to-end TLS, and one WAF Policy per application surface. Diagram first, then the playbook for getting there safely.

One public IP, one WAF, multiple private backends. Diagnostics to Sentinel; certificates from Key Vault.

3. The Rollout Playbook — Detection → Tune → Prevention

This is the four-phase rollout I use every time. It is boring, it works, and it keeps both the security team and the application owners on side.

Phase 1 · Week 0

Deploy in Detection

WAF v2 enabled, OWASP CRS 3.2, mode = Detection. Diagnostic logs to Log Analytics from minute one.

Phase 2 · Weeks 1–3

Baseline the traffic

Daily KQL review of AzureDiagnostics. Cluster matches by rule ID, host, URI. Catch every real-traffic anomaly.

Phase 3 · Weeks 3–4

Tune surgically

Add scoped exclusions per rule ID + selector (never wildcard, never disable groups). Re-test in a staging gateway.

Phase 4 · Week 4+

Flip to Prevention

Per-listener cutover, smallest blast radius first. Alert on every block for 7 days. Then steady-state.

💡 Architect's Tip

"Whitelist" is the wrong mental model. You are excluding specific WAF rules for specific request attributes on specific URIs — never permitting traffic. Frame every exclusion request to the security team that way and the conversation gets a lot easier.

4. Tuning Without Weakening the WAF

Most WAF horror stories trace back to a single bad habit: when something gets blocked, the team disables the whole rule group. That converts a precision instrument into a placebo. The right pattern is targeted exclusions.

KQL to baseline blocks before you tune

// Top WAF rule IDs triggering, by host + URI
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and Category == "ApplicationGatewayFirewallLog"
| where action_s in ("Matched","Blocked")
| summarize hits = count() by ruleId_s, hostname_s, requestUri_s, action_s
| order by hits desc
| take 50

Scoped exclusion (right way)

Exclude rule 942100 only for the JSON field filter on the search API — not everywhere:

# Terraform: per-rule, per-selector exclusion
managed_rules {
  exclusion {
    match_variable          = "RequestArgNames"
    selector                = "filter"
    selector_match_operator = "Equals"
    excluded_rule_set {
      type    = "OWASP"
      version = "3.2"
      rule_group {
        rule_group_name = "REQUEST-942-APPLICATION-ATTACK-SQLI"
        excluded_rules  = ["942100"]
      }
    }
  }
}

Anti-pattern (do not ship this)

# Disables the entire SQLi rule group for every request - do NOT do this
disabled_rule_group {
  rule_group_name = "REQUEST-942-APPLICATION-ATTACK-SQLI"
}

5. Risk Register — Threat, Blast Radius, Mitigation

Every WAF mistake maps to a specific risk. This is the table I bring to architecture reviews:

Threat / mistake	Blast radius	Mitigation
Detection left on indefinitely	WAF is a logger, not a control. SQLi, XSS, RFI reach the app.	Time-boxed rollout with a Prevention date in the change record
Whole rule group disabled	Entire OWASP category off — silent over-permission	Scoped exclusions per rule ID + selector; weekly review
Day-one Prevention, no baseline	Legitimate traffic blocked; emergency rollback erodes security mandate	Mandatory 2–4 week Detection baseline gate
Backend reachable directly	Attacker bypasses the WAF	Private Endpoint backends; NSG deny from internet; policy enforce
TLS terminates at gateway only	Plaintext between gateway and backend	End-to-end TLS, backend HTTPS settings with pinned root CA
No diagnostics	No tuning possible; incident forensics blind	Diagnostic setting to LA on day one; 90-day retention minimum
Fixed instance count	Capacity saturation under marketing spike; 502s to customers	Autoscale 2 → 10 minimum on WAF_v2
PHI in WAF diagnostic logs (URI / body capture)	HIPAA breach: ePHI written to Log Analytics and downstream Sentinel/SIEM with broader access scope than the application database	Disable `request_body_check` capture on PHI endpoints OR scrub at ingestion; segregate WAF workspace; RBAC + customer-managed keys
No geo-filter on US-only healthcare SaaS	Expands HIPAA threat surface; non-US recon traffic inflates WAF noise and complicates BAA scope	WAF Policy `geo_match` custom rule — allow only contracted geographies; alert on denied geos

6. Monitoring & Alerting That Actually Works

Detection mode is useless without someone watching. Set these alerts on day one — Detection or Prevention:

Spike in matched rules for any single rule ID (>3x rolling 24h baseline) — early signal of a real attack or a deploy that introduced false positives.
Blocked requests above threshold per backend — anything above ~0.1% of total traffic in steady state deserves investigation.
Backend health probe failures — these are routinely mistaken for WAF blocks during incidents; correlate the two.
Certificate expiry < 30 days — alert from Key Vault, not from production failure.
Gateway autoscale ceiling hit — your sizing assumption was wrong; investigate before the next traffic peak.

Wire all of these into Microsoft Sentinel or your existing SIEM. WAF logs without correlation are noise; correlated with sign-in logs, NSG flow logs, and backend app logs, they are gold.

Continuous tuning lifecycle — review, exclude, delete

A WAF is not a “deploy once” control. Traffic shifts, applications ship new endpoints, ruleset versions change, and attackers probe what worked yesterday. Without a continuous loop the exclusions you added in week three quietly become permanent over-permissions — and that is precisely where attackers find the gap. The discipline I run on every healthcare gateway:

Daily · SOC

Triage blocks & matches

Sentinel workbook: top rule IDs, new host/URI combinations, geo anomalies. Anything new gets a ticket the same day.

Weekly · AppOps

Tune false positives

Add scoped exclusions in IaC (rule ID + selector + URI). Every exclusion ships with a review date and an owner.

Monthly · SecOps

Hunt & rule-version diff

Threat-hunt across 30 days of WAF logs. Diff managed-ruleset version notes and update custom rules to cover newly published CVEs.

Quarterly · Architect

Delete stale exclusions

Any exclusion past its review date is removed by default. If the false positive comes back, it gets re-added with fresh evidence. No silent permanent allowances.

⚠️ Stale exclusions are an attack surface

Every exclusion is, by definition, a hole in the managed ruleset. An exclusion added two years ago for an endpoint that no longer exists is a free pass for an attacker who finds a way to reach a matching URI or argument name. Treat the exclusion list like firewall rules — reviewed quarterly, deleted aggressively, owned by a named engineer, and tracked in Git so every change is auditable.

KQL to flag exclusions that have not been hit in 90 days — strong candidates for deletion:

// Find recently "matched but not blocked" rules - candidates whose exclusions may be unnecessary
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and Category == "ApplicationGatewayFirewallLog"
| where TimeGenerated > ago(90d)
| summarize last_hit = max(TimeGenerated), hits = count() by ruleId_s, hostname_s, requestUri_s
| where datetime_diff('day', now(), last_hit) > 60
| order by last_hit asc

7. Terraform Baseline — App Gateway + WAF Policy

Deploy the same baseline to every environment. Tuning that happens only in production never gets validated, and exclusions that live only in the portal disappear at the next redeploy.

# WAF Policy: start in Detection, flip to Prevention via variable
resource "azurerm_web_application_firewall_policy" "this" {
  name                = "wafp-${var.app}"
  resource_group_name = var.rg
  location            = var.location

  policy_settings {
    enabled                     = true
    mode                        = var.waf_mode    # "Detection" until baseline complete
    request_body_check          = true
    file_upload_limit_in_mb     = 100
    max_request_body_size_in_kb = 128
  }

  managed_rules {
    managed_rule_set {
      type    = "OWASP"
      version = "3.2"
    }
  }
}

# Application Gateway WAF v2 with autoscale + Key Vault cert
resource "azurerm_application_gateway" "this" {
  name                = "agw-${var.app}"
  resource_group_name = var.rg
  location            = var.location
  firewall_policy_id  = azurerm_web_application_firewall_policy.this.id

  sku {
    name = "WAF_v2"
    tier = "WAF_v2"
  }
  autoscale_configuration {
    min_capacity = 2
    max_capacity = 10
  }

  ssl_certificate {
    name                = "tls-cert"
    key_vault_secret_id = azurerm_key_vault_certificate.tls.secret_id
  }
  # ... listeners, backend pools, end-to-end TLS settings ...
}

Pair this with an Azure Policy assignment at the management group that denies creation of Application Gateways without an attached WAF Policy. That single guardrail prevents 90% of the day-one mistakes from ever shipping.

8. Healthcare Addendum — HIPAA · HITRUST · FHIR (Mandatory)

If your gateway terminates traffic for a healthcare SaaS — a payer portal, a provider app, a claims API, a FHIR endpoint — every architectural decision above gets a compliance dimension on top of the security one. Skipping this section in a healthcare design review is, in my experience, the single fastest way to fail a HITRUST audit on a perfectly good Azure platform.

⚠️ The PHI-in-logs gotcha

Application Gateway WAF logs the request URI in full and, when request_body_check is enabled, can capture the matched portion of the request body. On a healthcare API where the URI carries identifiers (/Patient/12345, /claims/MRN/...) or the body carries ePHI, those values land in Log Analytics, Sentinel, and any downstream SIEM. That is an unintended PHI store with its own access model. Plan for it before you turn WAF on, not after the auditor finds it.

HIPAA §164.312 — how the gateway satisfies each technical safeguard

HIPAA control	Gateway / WAF implementation
§164.312(a)(1) Access Control	Private Endpoint backends; NSG deny-from-internet; gateway is the only ingress; Front Door or App Gateway WAF — not both bypassed
§164.312(a)(2)(iv) Encryption & Decryption	TLS 1.2+ at the listener; end-to-end TLS to backend; cipher suite policy restricted (no TLS 1.0/1.1, no legacy ciphers)
§164.312(b) Audit Controls	Diagnostic settings → Log Analytics → Sentinel; 6-year retention minimum; immutable archive tier after 90 days
§164.312(c)(1) Integrity	WAF Prevention mode enforces request integrity (no tampered headers/bodies passed to backend); OWASP CRS 3.2 + bot manager
§164.312(d) Person / Entity Authentication	Gateway enforces mTLS / OAuth at the listener where required; AAD-backed SSO for admin plane; never anonymous to backend
§164.312(e)(1) Transmission Security	End-to-end TLS — terminating at the gateway and sending plaintext to a private subnet still fails this control; pin backend root CA

HITRUST CSF — the controls auditors actually look at

01.j Network Access Control — gateway is the documented ingress; backend cannot be reached directly (proven by an explicit deny rule, not by absence).
09.s On-line Transactions — WAF inspects and protects every online PHI transaction in Prevention mode; Detection-only is a finding.
09.aa Audit Logging — WAF + backend logs correlated in Sentinel; retention ≥ 6 years for HIPAA, with documented review cadence.
10.f Cryptographic Key Management — TLS certificate sourced from Key Vault with auto-rotation and HSM-backed keys; no certs on the gateway file system.
10.m Control of Technical Vulnerabilities — managed ruleset version pinned and reviewed quarterly; CVE-driven custom rules added between releases.

FHIR & claims API tuning patterns

These are the exclusions I add early in every healthcare WAF rollout — they show up in the baseline within hours and they are safe when scoped tightly:

FHIR _search with complex query parameters — exclude rule 942100 (SQLi) for RequestArgNames selectors filter, _filter, composite on /fhir/*/_search only.
Bulk export job IDs — exclude 920420 (request content type) for the application/fhir+ndjson path; do not disable the rule globally.
X12 / EDI claim payloads — large bodies trigger 920170 / 920270; raise max_request_body_size_in_kb only for the claims listener, not platform-wide.
OAuth callback URLs from payer / provider IdPs — exclude 931130 (RFI) for RequestArgNames=redirect_uri, scoped to the /oauth/callback URI.

Geo-restriction is a compliance lever, not just security

For US-only healthcare SaaS the WAF policy should explicitly allow only the contracted geographies. It shrinks the noise the SOC tunes against, and it reduces the BAA threat surface you must defend. A 5-line custom rule:

# Allow US + service-account geos; everything else blocked at WAF
custom_rules {
  name      = "geo-allow-us-only"
  priority  = 10
  rule_type = "MatchRule"
  action    = "Block"
  match_conditions {
    match_variables { variable_name = "RemoteAddr" }
    operator           = "GeoMatch"
    negation_condition = true
    match_values       = ["US"]
  }
}

🏥 Healthcare design rule

For any listener carrying ePHI: Prevention mode is the only acceptable steady state, end-to-end TLS is non-negotiable, the WAF Log Analytics workspace is treated as a regulated data store (CMK, private link, RBAC, 6-year retention), and every exclusion has an expiry date that re-validates against current FHIR / X12 traffic.

9. Architect's Checklist

WAF v2 (not Standard_v2). OWASP CRS 3.2 attached.
Mode = Detection at launch. Calendar entry for the Prevention cutover.
Diagnostic logs to Log Analytics — day one, every environment.
Per-rule, per-selector exclusions in IaC. No disabled rule groups.
Private Endpoint backends. Public access disabled by policy.
End-to-end TLS. Cert from Key Vault, auto-rotation enabled.
Autoscale 2 → 10 minimum. Alert on ceiling reached.
Sentinel rules for block-spike, rule-spike, backend health, cert expiry.
Azure Policy: deny App Gateway without WAF Policy attached.
Quarterly WAF tuning review. Exclusions expire if not re-validated.

Healthcare workloads — add these to the checklist:

HIPAA §164.312 mapping documented per listener (access, encryption, audit, integrity, transmission).
WAF diagnostic workspace treated as PHI: CMK-encrypted, Private Link, RBAC-locked, 6-year retention.
PHI scrubbing or body-capture suppression on endpoints whose URI or body carries identifiers.
Geo-allow list (custom rule) restricted to contracted countries; alert on denied geos.
FHIR / X12 exclusions are scoped per-URI, per-selector, with expiry dates.
TLS 1.2+ minimum, modern cipher policy; TLS 1.0/1.1 explicitly denied.
BAA scope diagram includes the gateway, the WAF policy, the diagnostic workspace, and Key Vault.
HITRUST controls 01.j, 09.s, 09.aa, 10.f, 10.m have named evidence owners.

10. Further Reading — Official Documentation

Everything in this article is grounded in current Microsoft guidance. These are the primary references I keep open during a WAF review — follow them for the authoritative configuration details and the latest version notes:

Application Gateway & WAF — core

Application Gateway overview — service model, SKUs, listeners, backend pools.
WAF on Application Gateway — overview — Detection vs Prevention, modes, policy model.
Configure WAF rules and exclusions — the authoritative reference for per-rule, per-selector exclusions.
OWASP CRS rule groups & rules — every managed rule ID and its purpose (3.2, 3.1, 3.0).
Custom WAF rules — syntax for geo, IP, rate-limit, match-condition custom rules.
WAF best practices — Microsoft's own deployment guidance; aligns with the Detection → Prevention pattern above.

Operations — logging, monitoring, troubleshooting

Diagnostic settings & logs — enabling access, performance, and firewall logs to Log Analytics.
WAF metrics & alerts — metrics to alert on for the monitoring section above.
WAF troubleshooting — the official runbook for diagnosing false positives before you add an exclusion.
Request size & body limits — essential reading before tuning for X12 / FHIR bulk payloads.

Security & TLS

TLS / SSL on Application Gateway — listener-side cipher policy.
End-to-end TLS — step-by-step for backend re-encryption.
Key Vault certificates with Application Gateway — the documented pattern for the Terraform baseline above.
Microsoft Sentinel data connectors — the WAF connector for correlated detection.

Compliance — HIPAA & HITRUST

Azure HIPAA / HITECH offering — Microsoft's published mapping of Azure services to HIPAA Security Rule controls.
Azure HITRUST CSF offering — HITRUST scope and Microsoft's shared-responsibility statement.
Healthcare landing zone reference architecture — broader pattern the WAF sits inside.

Infrastructure-as-Code

Terraform azurerm_web_application_firewall_policy — every argument in the policy snippet above.
Terraform azurerm_application_gateway — listeners, backend settings, autoscale, Key Vault certs.

Key Takeaway

Application Gateway WAF is not a checkbox — it is an operational discipline. Stand it up in Detection, baseline real traffic for weeks, tune surgically, then flip to Prevention with a date you committed to in advance. For healthcare that discipline becomes a compliance obligation: Prevention is the only acceptable steady state for ePHI listeners, every exclusion is scoped and expires, every diagnostic byte is treated as PHI, and HIPAA §164.312 / HITRUST evidence is generated by the architecture — not added by the audit team six months later. Do that and you ship a WAF that protects the application and the BAA.