Azure Application Gateway with WAF v2 is the front door for most SaaS workloads I review. It is also the single most consistently mis-deployed component in those reviews. The pattern repeats across organisations of every size:
- CloudOps stands up Application Gateway with WAF in Prevention mode on day one to "tick the security box."
- Within days, the WAF blocks legitimate traffic — JSON bodies, file uploads, customer SSO callbacks, the marketing team's redirect chains.
- Engineers respond by flipping WAF into Detection mode "temporarily."
- Six months later, Detection is still on. The platform has compliance signage that says WAF, and protection that is purely cosmetic.
I have walked into too many environments where this is the steady state. The fix is not a product — it is a disciplined rollout. This article catalogs the mistakes I see most often and lays out the playbook I use to take SaaS apps from no WAF to full Prevention without breaking real users.
Who this is for: solution architects responsible for the front door of a SaaS platform — and especially those of us working on healthcare, where WAF mis-configuration is not just a security gap, it is a HIPAA and HITRUST control failure. The general playbook applies to every workload; the healthcare addendum (§9) is mandatory reading if any byte of ePHI flows through the gateway.
- The 10 mistakes I see most often
- Reference architecture — WAF in front of SaaS
- The rollout playbook: Detection → tune → Prevention
- Tuning the false positives without weakening the WAF
- Risk register — threat, blast radius, mitigation
- Monitoring & alerting that actually works
- Terraform baseline (App Gateway + WAF Policy)
- Healthcare addendum — HIPAA · HITRUST · FHIR (mandatory)
- Architect's checklist
- Further reading — official documentation
1. The 10 Mistakes I See Most Often
Before the design pattern, the anti-patterns. If you recognise three or more of these in your environment, your WAF is not doing what your compliance team thinks it is doing.
- Prevention mode on day one with no traffic baseline — guaranteed to block legitimate users and force a panicked rollback.
- Detection mode left on indefinitely after the rollback "for now." There is no protection — only logs no one reads.
- Disabling entire OWASP rule groups (e.g. all of RuleGroup 942 SQL Injection) instead of excluding the specific noisy rule.
- Wildcard exclusions like
RequestArgNames=*that effectively turn off whole inspection categories for every request. - No diagnostic logs sent to Log Analytics — you cannot tune what you cannot see.
- WAF Policy attached at the gateway level only, with no per-listener or per-URI overrides, so tuning for one app weakens protection for all of them.
- SKU sprawl — using Standard_v2 (no WAF) in lower environments, WAF_v2 in production. The tuning never transfers.
- Backend HTTP and no end-to-end TLS — TLS terminates at the gateway and traffic flows unencrypted to the backend over the private subnet.
- Public IP with no NSG / no private endpoint backend — bypassing the gateway is one misconfigured DNS record away.
- No autoscaling, fixed two instances — first marketing campaign fills the gateway and customers see 502s.
2. Reference Architecture — WAF in Front of SaaS
The right architecture is single-ingress, private backends, end-to-end TLS, and one WAF Policy per application surface. Diagram first, then the playbook for getting there safely.
3. The Rollout Playbook — Detection → Tune → Prevention
This is the four-phase rollout I use every time. It is boring, it works, and it keeps both the security team and the application owners on side.
AzureDiagnostics. Cluster matches by rule ID, host, URI. Catch every real-traffic anomaly."Whitelist" is the wrong mental model. You are excluding specific WAF rules for specific request attributes on specific URIs — never permitting traffic. Frame every exclusion request to the security team that way and the conversation gets a lot easier.
4. Tuning Without Weakening the WAF
Most WAF horror stories trace back to a single bad habit: when something gets blocked, the team disables the whole rule group. That converts a precision instrument into a placebo. The right pattern is targeted exclusions.
KQL to baseline blocks before you tune
// Top WAF rule IDs triggering, by host + URI
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and Category == "ApplicationGatewayFirewallLog"
| where action_s in ("Matched","Blocked")
| summarize hits = count() by ruleId_s, hostname_s, requestUri_s, action_s
| order by hits desc
| take 50
Scoped exclusion (right way)
Exclude rule 942100 only for the JSON field filter on the search API — not everywhere:
# Terraform: per-rule, per-selector exclusion
managed_rules {
exclusion {
match_variable = "RequestArgNames"
selector = "filter"
selector_match_operator = "Equals"
excluded_rule_set {
type = "OWASP"
version = "3.2"
rule_group {
rule_group_name = "REQUEST-942-APPLICATION-ATTACK-SQLI"
excluded_rules = ["942100"]
}
}
}
}
Anti-pattern (do not ship this)
# Disables the entire SQLi rule group for every request - do NOT do this
disabled_rule_group {
rule_group_name = "REQUEST-942-APPLICATION-ATTACK-SQLI"
}
5. Risk Register — Threat, Blast Radius, Mitigation
Every WAF mistake maps to a specific risk. This is the table I bring to architecture reviews:
| Threat / mistake | Blast radius | Mitigation |
|---|---|---|
| Detection left on indefinitely | WAF is a logger, not a control. SQLi, XSS, RFI reach the app. | Time-boxed rollout with a Prevention date in the change record |
| Whole rule group disabled | Entire OWASP category off — silent over-permission | Scoped exclusions per rule ID + selector; weekly review |
| Day-one Prevention, no baseline | Legitimate traffic blocked; emergency rollback erodes security mandate | Mandatory 2–4 week Detection baseline gate |
| Backend reachable directly | Attacker bypasses the WAF | Private Endpoint backends; NSG deny from internet; policy enforce |
| TLS terminates at gateway only | Plaintext between gateway and backend | End-to-end TLS, backend HTTPS settings with pinned root CA |
| No diagnostics | No tuning possible; incident forensics blind | Diagnostic setting to LA on day one; 90-day retention minimum |
| Fixed instance count | Capacity saturation under marketing spike; 502s to customers | Autoscale 2 → 10 minimum on WAF_v2 |
| PHI in WAF diagnostic logs (URI / body capture) | HIPAA breach: ePHI written to Log Analytics and downstream Sentinel/SIEM with broader access scope than the application database | Disable request_body_check capture on PHI endpoints OR scrub at ingestion; segregate WAF workspace; RBAC + customer-managed keys |
| No geo-filter on US-only healthcare SaaS | Expands HIPAA threat surface; non-US recon traffic inflates WAF noise and complicates BAA scope | WAF Policy geo_match custom rule — allow only contracted geographies; alert on denied geos |
6. Monitoring & Alerting That Actually Works
Detection mode is useless without someone watching. Set these alerts on day one — Detection or Prevention:
- Spike in matched rules for any single rule ID (>3x rolling 24h baseline) — early signal of a real attack or a deploy that introduced false positives.
- Blocked requests above threshold per backend — anything above ~0.1% of total traffic in steady state deserves investigation.
- Backend health probe failures — these are routinely mistaken for WAF blocks during incidents; correlate the two.
- Certificate expiry < 30 days — alert from Key Vault, not from production failure.
- Gateway autoscale ceiling hit — your sizing assumption was wrong; investigate before the next traffic peak.
Wire all of these into Microsoft Sentinel or your existing SIEM. WAF logs without correlation are noise; correlated with sign-in logs, NSG flow logs, and backend app logs, they are gold.
Continuous tuning lifecycle — review, exclude, delete
A WAF is not a “deploy once” control. Traffic shifts, applications ship new endpoints, ruleset versions change, and attackers probe what worked yesterday. Without a continuous loop the exclusions you added in week three quietly become permanent over-permissions — and that is precisely where attackers find the gap. The discipline I run on every healthcare gateway:
Every exclusion is, by definition, a hole in the managed ruleset. An exclusion added two years ago for an endpoint that no longer exists is a free pass for an attacker who finds a way to reach a matching URI or argument name. Treat the exclusion list like firewall rules — reviewed quarterly, deleted aggressively, owned by a named engineer, and tracked in Git so every change is auditable.
KQL to flag exclusions that have not been hit in 90 days — strong candidates for deletion:
// Find recently "matched but not blocked" rules - candidates whose exclusions may be unnecessary
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and Category == "ApplicationGatewayFirewallLog"
| where TimeGenerated > ago(90d)
| summarize last_hit = max(TimeGenerated), hits = count() by ruleId_s, hostname_s, requestUri_s
| where datetime_diff('day', now(), last_hit) > 60
| order by last_hit asc
7. Terraform Baseline — App Gateway + WAF Policy
Deploy the same baseline to every environment. Tuning that happens only in production never gets validated, and exclusions that live only in the portal disappear at the next redeploy.
# WAF Policy: start in Detection, flip to Prevention via variable resource "azurerm_web_application_firewall_policy" "this" { name = "wafp-${var.app}" resource_group_name = var.rg location = var.location policy_settings { enabled = true mode = var.waf_mode # "Detection" until baseline complete request_body_check = true file_upload_limit_in_mb = 100 max_request_body_size_in_kb = 128 } managed_rules { managed_rule_set { type = "OWASP" version = "3.2" } } } # Application Gateway WAF v2 with autoscale + Key Vault cert resource "azurerm_application_gateway" "this" { name = "agw-${var.app}" resource_group_name = var.rg location = var.location firewall_policy_id = azurerm_web_application_firewall_policy.this.id sku { name = "WAF_v2" tier = "WAF_v2" } autoscale_configuration { min_capacity = 2 max_capacity = 10 } ssl_certificate { name = "tls-cert" key_vault_secret_id = azurerm_key_vault_certificate.tls.secret_id } # ... listeners, backend pools, end-to-end TLS settings ... }
Pair this with an Azure Policy assignment at the management group that denies creation of Application Gateways without an attached WAF Policy. That single guardrail prevents 90% of the day-one mistakes from ever shipping.
8. Healthcare Addendum — HIPAA · HITRUST · FHIR (Mandatory)
If your gateway terminates traffic for a healthcare SaaS — a payer portal, a provider app, a claims API, a FHIR endpoint — every architectural decision above gets a compliance dimension on top of the security one. Skipping this section in a healthcare design review is, in my experience, the single fastest way to fail a HITRUST audit on a perfectly good Azure platform.
Application Gateway WAF logs the request URI in full and, when request_body_check is enabled, can capture the matched portion of the request body. On a healthcare API where the URI carries identifiers (/Patient/12345, /claims/MRN/...) or the body carries ePHI, those values land in Log Analytics, Sentinel, and any downstream SIEM. That is an unintended PHI store with its own access model. Plan for it before you turn WAF on, not after the auditor finds it.
HIPAA §164.312 — how the gateway satisfies each technical safeguard
| HIPAA control | Gateway / WAF implementation |
|---|---|
| §164.312(a)(1) Access Control | Private Endpoint backends; NSG deny-from-internet; gateway is the only ingress; Front Door or App Gateway WAF — not both bypassed |
| §164.312(a)(2)(iv) Encryption & Decryption | TLS 1.2+ at the listener; end-to-end TLS to backend; cipher suite policy restricted (no TLS 1.0/1.1, no legacy ciphers) |
| §164.312(b) Audit Controls | Diagnostic settings → Log Analytics → Sentinel; 6-year retention minimum; immutable archive tier after 90 days |
| §164.312(c)(1) Integrity | WAF Prevention mode enforces request integrity (no tampered headers/bodies passed to backend); OWASP CRS 3.2 + bot manager |
| §164.312(d) Person / Entity Authentication | Gateway enforces mTLS / OAuth at the listener where required; AAD-backed SSO for admin plane; never anonymous to backend |
| §164.312(e)(1) Transmission Security | End-to-end TLS — terminating at the gateway and sending plaintext to a private subnet still fails this control; pin backend root CA |
HITRUST CSF — the controls auditors actually look at
- 01.j Network Access Control — gateway is the documented ingress; backend cannot be reached directly (proven by an explicit deny rule, not by absence).
- 09.s On-line Transactions — WAF inspects and protects every online PHI transaction in Prevention mode; Detection-only is a finding.
- 09.aa Audit Logging — WAF + backend logs correlated in Sentinel; retention ≥ 6 years for HIPAA, with documented review cadence.
- 10.f Cryptographic Key Management — TLS certificate sourced from Key Vault with auto-rotation and HSM-backed keys; no certs on the gateway file system.
- 10.m Control of Technical Vulnerabilities — managed ruleset version pinned and reviewed quarterly; CVE-driven custom rules added between releases.
FHIR & claims API tuning patterns
These are the exclusions I add early in every healthcare WAF rollout — they show up in the baseline within hours and they are safe when scoped tightly:
- FHIR
_searchwith complex query parameters — exclude rule942100(SQLi) forRequestArgNamesselectorsfilter,_filter,compositeon/fhir/*/_searchonly. - Bulk export job IDs — exclude
920420(request content type) for theapplication/fhir+ndjsonpath; do not disable the rule globally. - X12 / EDI claim payloads — large bodies trigger
920170/920270; raisemax_request_body_size_in_kbonly for the claims listener, not platform-wide. - OAuth callback URLs from payer / provider IdPs — exclude
931130(RFI) forRequestArgNames=redirect_uri, scoped to the/oauth/callbackURI.
Geo-restriction is a compliance lever, not just security
For US-only healthcare SaaS the WAF policy should explicitly allow only the contracted geographies. It shrinks the noise the SOC tunes against, and it reduces the BAA threat surface you must defend. A 5-line custom rule:
# Allow US + service-account geos; everything else blocked at WAF
custom_rules {
name = "geo-allow-us-only"
priority = 10
rule_type = "MatchRule"
action = "Block"
match_conditions {
match_variables { variable_name = "RemoteAddr" }
operator = "GeoMatch"
negation_condition = true
match_values = ["US"]
}
}
For any listener carrying ePHI: Prevention mode is the only acceptable steady state, end-to-end TLS is non-negotiable, the WAF Log Analytics workspace is treated as a regulated data store (CMK, private link, RBAC, 6-year retention), and every exclusion has an expiry date that re-validates against current FHIR / X12 traffic.
9. Architect's Checklist
- WAF v2 (not Standard_v2). OWASP CRS 3.2 attached.
- Mode = Detection at launch. Calendar entry for the Prevention cutover.
- Diagnostic logs to Log Analytics — day one, every environment.
- Per-rule, per-selector exclusions in IaC. No disabled rule groups.
- Private Endpoint backends. Public access disabled by policy.
- End-to-end TLS. Cert from Key Vault, auto-rotation enabled.
- Autoscale 2 → 10 minimum. Alert on ceiling reached.
- Sentinel rules for block-spike, rule-spike, backend health, cert expiry.
- Azure Policy: deny App Gateway without WAF Policy attached.
- Quarterly WAF tuning review. Exclusions expire if not re-validated.
Healthcare workloads — add these to the checklist:
- HIPAA §164.312 mapping documented per listener (access, encryption, audit, integrity, transmission).
- WAF diagnostic workspace treated as PHI: CMK-encrypted, Private Link, RBAC-locked, 6-year retention.
- PHI scrubbing or body-capture suppression on endpoints whose URI or body carries identifiers.
- Geo-allow list (custom rule) restricted to contracted countries; alert on denied geos.
- FHIR / X12 exclusions are scoped per-URI, per-selector, with expiry dates.
- TLS 1.2+ minimum, modern cipher policy; TLS 1.0/1.1 explicitly denied.
- BAA scope diagram includes the gateway, the WAF policy, the diagnostic workspace, and Key Vault.
- HITRUST controls 01.j, 09.s, 09.aa, 10.f, 10.m have named evidence owners.
10. Further Reading — Official Documentation
Everything in this article is grounded in current Microsoft guidance. These are the primary references I keep open during a WAF review — follow them for the authoritative configuration details and the latest version notes:
Application Gateway & WAF — core
- Application Gateway overview — service model, SKUs, listeners, backend pools.
- WAF on Application Gateway — overview — Detection vs Prevention, modes, policy model.
- Configure WAF rules and exclusions — the authoritative reference for per-rule, per-selector exclusions.
- OWASP CRS rule groups & rules — every managed rule ID and its purpose (3.2, 3.1, 3.0).
- Custom WAF rules — syntax for geo, IP, rate-limit, match-condition custom rules.
- WAF best practices — Microsoft's own deployment guidance; aligns with the Detection → Prevention pattern above.
Operations — logging, monitoring, troubleshooting
- Diagnostic settings & logs — enabling access, performance, and firewall logs to Log Analytics.
- WAF metrics & alerts — metrics to alert on for the monitoring section above.
- WAF troubleshooting — the official runbook for diagnosing false positives before you add an exclusion.
- Request size & body limits — essential reading before tuning for X12 / FHIR bulk payloads.
Security & TLS
- TLS / SSL on Application Gateway — listener-side cipher policy.
- End-to-end TLS — step-by-step for backend re-encryption.
- Key Vault certificates with Application Gateway — the documented pattern for the Terraform baseline above.
- Microsoft Sentinel data connectors — the WAF connector for correlated detection.
Compliance — HIPAA & HITRUST
- Azure HIPAA / HITECH offering — Microsoft's published mapping of Azure services to HIPAA Security Rule controls.
- Azure HITRUST CSF offering — HITRUST scope and Microsoft's shared-responsibility statement.
- Healthcare landing zone reference architecture — broader pattern the WAF sits inside.
Infrastructure-as-Code
- Terraform
azurerm_web_application_firewall_policy— every argument in the policy snippet above. - Terraform
azurerm_application_gateway— listeners, backend settings, autoscale, Key Vault certs.
Key Takeaway
Application Gateway WAF is not a checkbox — it is an operational discipline. Stand it up in Detection, baseline real traffic for weeks, tune surgically, then flip to Prevention with a date you committed to in advance. For healthcare that discipline becomes a compliance obligation: Prevention is the only acceptable steady state for ePHI listeners, every exclusion is scoped and expires, every diagnostic byte is treated as PHI, and HIPAA §164.312 / HITRUST evidence is generated by the architecture — not added by the audit team six months later. Do that and you ship a WAF that protects the application and the BAA.