SAML: The Undead Protocol That Keeps Haunting Me

SAML is the undead protocol of enterprise identity: you think you’ve moved on, and then a “simple SSO integration” drags you back into XML, metadata, and a pile of edge cases.

My first serious run-in was as a university sysadmin, deploying Shibboleth-backed SSO for Mattermost and a couple of in-house apps. I assumed SAML would eventually lose to OIDC on ergonomics. Fast forward to 2024: OIDC is strong, but SAML is still very much alive in enterprise land.

Why SAML breaks (and sometimes becomes a security problem)

If you want the mental model, it’s this: SAML relies on XML signatures (XML-DSig). XML-DSig relies on canonicalization and precise element selection. The whole thing is brittle in ways that JSON + JWT simply isn’t. JWT isn’t automatically safe; it’s just a simpler representation with fewer parser-edge surprises than XML-DSig.

The failure modes I keep seeing fall into a few archetypes:

1) Validate one thing, use another

This is the classic shape behind signature wrapping issues and “element selection” mistakes: you validate a signature that covers some element, then you consume identity attributes from a different element.

Signature wrapping bugs are usually element-selection and reference-resolution mistakes in the application; round-trip instability is a separate (but related) representation-mismatch hazard that can make those mistakes exploitable in surprising ways.

This is also where the Mattermost 2020/2021 disclosure clicked for me: signature wrapping is about selecting the wrong element to trust, but XML round-trip instability (like issues in Go’s encoding/xml, and later, other parsers) can be an amplifier when your pipeline transforms/serializes XML between “validation” and “use”.

That’s not just an annoying bug. If your SAML pipeline validates after a transformation/round-trip step, while other code reads from an earlier representation, you can end up validating one structure and authenticating with data from another. That’s identity confusion, and in the worst case, auth bypass.

(Mattermost write-ups: coordinated disclosure and securing XML implementations across the web.)

2) Trust isn’t pinned tightly enough

SAML has a lot of fields that must line up with what you expect:

Issuer / EntityID you trust
Audience you’re willing to accept
Destination / Recipient / ACS URL you expect
InResponseTo you initiated (or you have a replay story)

If you accept values you didn’t pin, or you trust unsigned fields, you get “works in test, weird in prod” at best and security issues at worst.

3) Time, replay, and metadata drift

Operationally, the most common pain is boring:

NotBefore / NotOnOrAfter + clock skew
replay handling (InResponseTo, assertion IDs)
IdP metadata drift and certificate rotation

Half of “SAML is broken” incidents are “the IdP rotated signing cert” or “the metadata changed and nobody told the SP”.

Why debugging feels awful

SAML is still the default in many enterprise IdP integrations, especially legacy SaaS and older IdP configurations. Debugging it remains an exercise in patience.

The typical failure loop looks like this:

User attempts login.
The IdP decides something is wrong (missing group, missing attribute, wrong app assignment).
The IdP returns a generic SAML failure status, or logs the real reason only in the IdP event logs.
The SP discards the response (bad signature, wrong Audience, wrong Destination, cert mismatch) or treats any failure as “start login again”.
The user gets a silent redirect loop until someone turns on debug logs in the right place.

In other words: the reason often exists, but not where the user can see it.

How to survive debugging it (a checklist)

The few things that repeatedly save time:

Log correlation IDs: Response ID, Assertion ID, InResponseTo, RelayState, and whatever session/request ID your app uses.
Decode first, then reason: base64 decode, handle HTTP-Redirect deflate, pretty-print the XML, then inspect fields.
Check the “alignment” fields: Issuer, Audience, Destination/Recipient/ACS URL, InResponseTo.
Check time and replay: NotBefore / NotOnOrAfter, acceptable skew, and a real replay story.
Pin trust to metadata: validate against IdP metadata, and handle certificate rotation deliberately (refresh metadata on a schedule, alert on cert/key drift, stage changes).
Record SP expectations as code: EntityID, ACS URL, expected binding(s), required signatures, and what claims you require.
Be explicit about what shapes you accept: signed assertion vs signed response, signature placement, and which bindings you enable (only enable the ones you actually use).
Reject weak crypto: reject SHA-1 signatures and require modern algorithms (and make mismatches obvious in logs).
Reject ambiguity: reject messages with multiple candidates for the identity-bearing element (multiple Assertions / multiple Subject/NameID paths) unless you have deterministic selection rules.
Harden the XML parser: disable DTDs/external entities (no XXE), and avoid “helpful” features that fetch URLs.
Treat RelayState as hostile: don’t treat it as an arbitrary URL; bind it to your session/login attempt and allowlist any redirects.

If you need to escalate to an IdP admin (or file an incident ticket), collect:

IdP metadata version (or last-updated timestamp) and signing cert fingerprint
SP EntityID and ACS URL you expect
Response ID, Assertion ID, InResponseTo
Issuer, Audience, Destination/Recipient
Exact timestamps + timezone + observed clock skew

Don’t log raw assertions in production

It’s tempting to “just dump the SAML response” while you debug. Don’t do that in production:

Assertions often contain PII (emails, names), group membership, and sometimes opaque identifiers you don’t want in logs.
Many orgs retain logs for a long time, and log access is often wider than you think.

What I capture instead (enough to debug, low leakage):

Response ID / Assertion ID / InResponseTo
Issuer / Audience / Destination / ACS URL
signing cert fingerprint (from metadata) + which metadata version was used
timestamp fields + observed clock skew
RelayState and your internal request/session correlation ID

The worst debugging trick: turning off verification

When SAML integrations go sideways, you’ll eventually find a doc thread or config snippet that says “just disable signature verification and see if it works.” And yeah, you can often make the happy-path login appear to work that way.

That’s also the point where SAML stops being “enterprise SSO” and becomes “trust whatever a browser posts.”

Sometimes the misconfiguration is subtle (and the docs are sloppy about it): you might be disabling response signature validation but still requiring signed assertions, or vice versa. Or you might be turning off encryption. The nuance matters for debugging, but the punchline is the same: you’re removing the security properties that make SAML worth doing in the first place.

If you need it as a temporary diagnostic:

do it only in a locked-down sandbox
timebox it (minutes, not days)
log loudly that verification is off
revert before real users touch it

In production, treat “unsigned assertions accepted” as a sev-0 misconfiguration, not a convenience.

If you’re debugging in a browser flow, also check cookie attributes and callback URLs. If you’re debugging an API client, watch for libraries trying to “help” with redirects.

I built a small SAML Response Analyzer to make the first steps less painful: it runs locally in your browser (no uploads), decodes SAML responses and query bindings, and surfaces common fields (Issuer, Destination, Audience, NameID, attributes) alongside the decoded XML.

Why SAML persists

SAML persists because of:

It’s embedded in procurement checklists and old SaaS.
IdP admins already have SAML playbooks, templates, and muscle memory.
Some vendors never built OIDC properly (or only for their “new” product line).

If you get to choose, prefer OIDC. If you don’t, treat SAML like a crypto protocol, not an auth checkbox.

Update (2026-02-09): Acceptance

I’ve stopped waiting for SAML to die.

If an enterprise integration needs SAML, I’ll implement SAML. I don’t have to like it, but I do want it to be correct, testable, and boring to operate.