Why Email Deliverability Monitoring Breaks at Scale

Deliverability monitoring works well when you have one ESP, one domain, and one team responsible for email. Most guides are written for exactly that scenario.

But most organizations sending at scale do not operate that scenario. They have multiple ESPs serving different use cases. They have multiple brands or business units with separate domains. They have transactional email running through infrastructure that marketing teams never see. And they have deliverability signals scattered across systems that were never designed to talk to each other.

This article explains the specific points at which standard deliverability monitoring breaks as organizations grow, and why the tooling most teams rely on was not built for the environment they are actually operating in.

Here is a representative scenario: a B2C retailer runs transactional email through SendGrid and marketing campaigns through Klaviyo. A Microsoft throttling problem develops in the Klaviyo stream over four days. The SendGrid dashboard remains clean throughout. The Klaviyo dashboard shows a modest bounce rate increase but nothing that triggers a threshold alert. It is only when the team exports data from both systems and compares them manually that the pattern becomes clear: 4xx deferral rates at Microsoft have been climbing for days, concentrated on the Klaviyo sending subdomain. By the time the investigation concludes, four days of marketing traffic have been delivered with significant latency to Microsoft inboxes, and the reputation signal has already moved.


The single-ESP assumption

Virtually every deliverability monitoring tool and best practice guide assumes a single sending environment. One ESP. One account. One set of IPs. One dashboard.

That assumption holds for small senders. It breaks for mid-to-large organizations in predictable ways.

A retail company might send transactional email, order confirmations, shipping notifications, and password resets through one ESP while running marketing campaigns through another. A SaaS company might have product notification email running through SendGrid while its marketing team operates in Brevo and its sales team sends sequences through a separate outbound platform. A financial services company might route different product lines through different ESPs for compliance or contractual reasons.

In each of these cases, deliverability problems do not respect the boundary between systems. A reputation problem that starts in one stream can affect shared IP pools, shared domain reputation, or shared authentication infrastructure. A change to sending patterns in the marketing platform can affect how mailbox providers evaluate traffic from the transactional platform, because both share the same organizational domain.

Monitoring each ESP independently misses these cross-system effects entirely.


The schema fragmentation problem

When an email fails to deliver, every ESP records that event. But they record it differently.

SendGrid uses its own event schema. A hard bounce in SendGrid is a JSON object with specific field names, a specific structure for the SMTP response code, and specific metadata fields. Brevo uses a different schema. Mailgun uses a different schema. Klaviyo uses a different schema. PowerMTA, KumoMTA and GreenArrow, the major on-premise MTAs, write event data to logs and accounting files in formats that are incompatible with cloud ESP webhook payloads.

This is not a minor formatting difference. The same underlying event — a 550 5.1.1 rejection from Microsoft's mail servers — will look structurally different depending on which system generated the record. Comparing bounce rates across ESPs requires normalizing those records into a common schema first. That normalization is manual work, and it is not a one-time task. Every time an ESP updates its event format, the normalization breaks.

For organizations running two or three ESPs, this means building and maintaining custom ETL pipelines. For organizations running more, it means either investing heavily in data engineering or accepting that cross-ESP analysis will always be delayed, partial, and error-prone.


The reporting delay problem

Deliverability problems move fast. A reputation shift at Gmail can affect inbox placement within hours. A new blocklist entry can appear and start rejecting traffic within minutes. An IP that tips into a spam trap cluster can go from clean to problematic in a single sending session.

The tools most teams use to monitor reputation do not operate at that speed.

Google Postmaster Tools reports domain reputation and spam rates with a delay of one to two days. It reports at the domain level for each domain you have individually registered and verified in the tool. A subdomain that has not been explicitly added to Postmaster Tools will not appear in the reporting. More significantly, Google has announced that domain reputation will be removed from Postmaster Tools in the near future, which removes one of the primary signals the tool currently provides. A reputation event that happens on Tuesday afternoon may not be visible until Thursday. By then, a team managing significant volume has sent hundreds of thousands of additional messages into a deteriorating environment.

Microsoft's SNDS reports IP reputation and spam trap data, but on its own schedule and with its own latency. SNDS data is available the following day, in a format that requires separate access and separate interpretation from ESP event data.

ESP dashboards show delivery status in real time, but only for traffic within that ESP. They do not show reputation trends. They do not show what is happening at other providers. They do not correlate a bounce spike with the reputation signal that caused it.

The result is that the most current information available to a deliverability team is the ESP bounce rate, which shows outcomes without causes, while the tools that show causes operate on a one-to-two day delay. Diagnosing what is happening in real time requires triangulating between systems that are never synchronized.


The MTA and ESP divide

Organizations sending at the highest volumes often operate a combination of cloud ESPs and on-premise or dedicated MTAs. PowerMTA, GreenArrow, and similar platforms handle high-volume sending with direct control over IP management, retry behavior, and throttling policy. They generate detailed SMTP-level event data: every connection attempt, every response code, every retry, every deferral, every delivery.

That data is operationally valuable. It is also structurally incompatible with cloud ESP event streams.

PowerMTA writes accounting files in a columnar log format. GreenArrow writes to a PostgreSQL database or pushes delivery events to endpoints. Neither produces events in a format that can be directly compared to SendGrid webhook payloads or Klaviyo event exports without significant transformation.

For organizations using both cloud ESPs and on-premise MTAs, this means that SMTP-level intelligence from the MTA and delivery event data from the ESP exist in separate systems with no automatic connection. A deferral pattern building in the MTA logs is invisible to anyone watching only the ESP dashboard, and vice versa.


The multi-domain complexity problem

As sending volume grows, most organizations adopt a subdomain strategy to isolate reputation across different sending streams. Transactional email sends from one subdomain. Marketing sends from another. Re-engagement campaigns send from a third. Each subdomain builds its own reputation independently.

This isolation is valuable, but it multiplies the monitoring surface. Instead of watching one domain, a team is watching three, five, or ten. Postmaster Tools provides separate views for each sending domain. SNDS provides separate views for each sending IP. Blocklist monitors need to check each domain and IP combination independently.

The aggregated picture — what is actually happening across all of this infrastructure simultaneously — does not exist in any standard tool. It has to be assembled manually, pulling data from each source and combining it into a view that no individual tool provides.

For teams managing this manually, the practical result is that monitoring is always incomplete. There is simply not enough time to check every domain, every IP, every source, every day. Monitoring becomes sampling: checking the most important streams and hoping the ones not checked are behaving normally.


What breaks first

When monitoring is manual and fragmented, the problems that get caught are the ones that are obvious. A hard bounce rate that spikes from 0.5% to 15% in an ESP dashboard is hard to miss. A soft bounce rate that drifts from 3% to 6% over two weeks is easy to miss, especially if it is happening in a subdomain that is checked less frequently.

The soft, gradual problems are consistently the most expensive. A domain reputation that degrades slowly does not trigger a threshold alert. It just results in progressively worse inbox placement over weeks, until a sender notices that engagement metrics have been declining without an obvious cause. By that point, the reputation damage has been accumulating for weeks and the recovery process takes longer than the initial degradation did.

Deferral patterns are another category that standard monitoring handles poorly. A deferral is a temporary rejection: the receiving server accepted the connection but asked the sending server to try again later. Deferrals are normal up to a point. At elevated rates, they indicate that a receiving domain is throttling traffic, which is an early signal of reputation deterioration. But deferrals require MTA-level visibility to observe. They are not consistently surfaced in ESP dashboards, and they require pattern analysis across time to distinguish normal transient deferrals from systematic throttling.

By the time a throttling problem is visible in delivery rates, the underlying reputation issue has already progressed significantly.


What a monitoring architecture for scale actually requires

The gaps described above are not gaps in any individual tool. Google Postmaster Tools does what it was designed to do. SNDS does what it was designed to do. ESP dashboards do what they were designed to do. The gap is structural: none of these tools was designed to work together, and none of them was designed to provide the cross-source, continuous correlation that organizations operating at scale actually need.

An architecture that addresses these gaps has three core requirements.

First, a unified data layer. All events from all sending systems — ESP webhooks, MTA logs, mailbox provider telemetry — need to be ingested into a single data store with a normalized schema. This is the prerequisite for every other capability. Without it, cross-source analysis requires rebuilding the normalization layer every time a query is needed.

Second, continuous monitoring rather than on-demand querying. Problems that develop gradually are only visible to a system that is watching continuously. A monitoring system that requires a human to initiate a query will only find the problems that the human thought to look for. A system that monitors continuously and surfaces deviations autonomously can find problems that no one thought to look for.

Third, cross-source correlation rather than single-source alerting. An alert that fires when a bounce rate crosses a threshold in one ESP is better than no alert. An alert that fires when a bounce rate increase in one ESP correlates with a reputation shift in Postmaster Tools and a deferral increase in the MTA logs is more valuable, because it arrives with context that makes diagnosis faster and remediation more targeted.

This type of architecture — one that provides unified ingestion, continuous monitoring, and cross-source correlation — is what separates a monitoring tool from a deliverability intelligence system. The distinction matters at scale because scale is precisely where the limitations of individual monitoring tools become operationally significant.

Practitioners in the space are increasingly describing this class of system as an Agentic Email Intelligence Platform, a term that reflects both the autonomous monitoring behavior and the intelligence layer that makes cross-source correlation actionable. The concept is explained in detail in our piece on what an Agentic Email Intelligence Platform is.


Continue reading

This article is part of a five-part series on email deliverability intelligence.


Frequently asked questions

How do you monitor email deliverability across multiple ESPs?

Monitoring across multiple ESPs requires a normalized data layer that ingests events from each ESP in their native format and converts them into a common schema for analysis. Without that layer, cross-ESP analysis means pulling exports from each system separately and combining them manually — a process that is slow, error-prone, and does not scale. The alternative is a dedicated intelligence layer that handles ingestion and normalization automatically and provides a unified view across all sending infrastructure.

Why does deliverability monitoring get harder as sending volume increases?

Volume itself is not the primary challenge. The challenge is the operational complexity that typically accompanies volume growth: multiple ESPs, multiple domains, multiple sending streams, and multiple teams. Each addition multiplies the monitoring surface without providing tools to observe the cross-system effects. A problem that originates in one stream can affect reputation in another, and standard monitoring tools do not surface those relationships automatically.

What is the difference between delivery rate and deliverability?

Delivery rate measures whether a message was accepted by the receiving mail server. A message can have a 99% delivery rate and still be landing in spam folders at every major mailbox provider. Deliverability measures inbox placement, which is a function of reputation, engagement history, authentication, and content — none of which are directly visible in ESP delivery rate metrics. Delivery rate is a necessary metric but not a sufficient one for understanding actual deliverability performance.

What is an Agentic Email Intelligence Platform?

An Agentic Email Intelligence Platform, or AEIP, is a class of system built specifically for the multi-ESP, multi-domain, multi-MTA environment that standard monitoring tools were not designed for. It ingests events from all sending systems continuously, normalizes them into a unified schema, and applies automated correlation and anomaly detection across the combined data. The defining characteristic is that it surfaces findings without requiring a user to initiate a query: when a pattern develops across two or more sources simultaneously, the system identifies it and escalates it, whether or not anyone was looking at that combination of signals at that moment.

Deliverability intelligence — every week in your inbox

Get practical deliverability analysis, multi-ESP monitoring insights, and expert commentary from Engagor — straight to your inbox.

Engagor Platform

Don't be the last to know.

Engagor monitors your deliverability across every ISP and ESP/MTA — so your team catches issues before your subscribers do.

Not ready yet? Get deliverability insights and expert analysis delivered to your inbox.