Unified Ad Event Logging Across Systems

Designed a unified logging architecture at Meta that reduced ad incident debugging time by 80% across all surfaces using a global correlation ID system.

KotlinAndroidLoggingObservabilityMetaSystem DesignDebugging

Unified Logging

Company: Meta
Surface: Cross-surface (Reels, Feed, Threads)
Role: Android Software Engineer
Impact Area: Observability, Incident Debugging, Log Accuracy


The Problem

At Meta scale, a single ad event touches many systems simultaneously. An impression event generates log entries in the client-side analytics pipeline, the server delivery log, the ad ranking decision log, the billing system, and the measurement and attribution infrastructure. When something goes wrong, whether a reported billing discrepancy, a sampled log anomaly, an impression count mismatch between client and server, or a conversion event without a traceable delivery record, diagnosing the issue requires joining evidence from all of these systems.

Before Unified Logging, that process was painful, slow, and often inconclusive.

Each system had its own internal ID scheme for ad events. The client generated client-side event IDs based on its local session state. The server generated delivery IDs based on the ranking response. The billing system generated billing transaction IDs at charge time. These IDs were not correlated with each other. Joining them required multi-hop table joins across systems with different sampling rates, different schema conventions, and latency differences that meant the same event could appear in different tables minutes apart.

Worst of all, the sampling inconsistency made many join-based analyses fundamentally unreliable. A log table sampled at 10% and a billing table sampled at 1% do not produce a valid join unless the sampling decisions for the same event happen to align, which with independent sampling, they often do not.

Debugging a single production incident required hours of log archaeology. Even then, the conclusion was often uncertain because the cross-system evidence could not be reliably correlated.


The Solution: One ID, Everywhere

Unified Logging solves this with a single intervention: generate one globally unique identifier per ad event, once, at the server side at the moment the ad is selected for delivery, and propagate that identifier to every system that touches the event from that moment forward.

The identifier travels in the ad response payload to the client. The client reads it and attaches it to every client-side log entry generated for that ad event: delivery logs, impression logs, engagement logs, and any derived events. The server attaches it to the delivery log, the ranking decision log, and the billing event at charge time.

Every log table, across every system, now shares a single join key for each ad event. Any investigation starts with that key and can query any table directly, without multi-hop ID resolution and without cross-system schema translation.


Why Server-Side ID Generation Matters

The decision to generate the ID on the server at ranking decision time, rather than on the client, was deliberate and important.

If the ID were generated on the client, the server-side logs (delivery log, ranking log, billing log) would not have the ID until the client sent it back in a subsequent request, creating a timing gap and a dependency on client-side uptime and connectivity for server-side log completeness.

Generating the ID server-side at the moment the ad is selected means it is available to all server-side systems immediately, without any client involvement. The ID flows downstream to the client as part of the ad response, ensuring the client has it for all subsequent client-side logging without any round-trip dependency.

This single-source-of-truth generation model is what makes the correlation reliable across all systems and all network conditions.


Sampling Consistency

A second significant benefit of Unified Logging is sampling consistency across log tables.

Before Unified Logging, each log table applied its own independent sampling logic. The same ad event might be sampled into the client impression table at a 10% rate, into the server delivery table at a 5% rate, and into the billing table at a 1% rate, with no coordination between the sampling decisions. Any cross-table analysis on sampled data had systematic bias because the sample populations across tables were not the same.

With a shared unique identifier, sampling can be made deterministic based on the identifier value itself, using a hash-based approach. The sampling decision for an event is derived from the hash of its unified_log_id, and the same hash function is applied across all tables. The result is that if an event is included in the sample for one table, it is included in the sample for all tables at the same effective rate.

This makes sampled log analysis reliably accurate in a way it could not be before. Cross-table joins on sampled data now produce consistent, unbiased results because the sample is the same sample everywhere.


Cross-Team Implementation

Implementing Unified Logging required coordination across multiple engineering teams at Meta: the ads ranking team (server-side ID generation), the client logging infrastructure team (ID propagation and attachment in client-side analytics), and the billing pipeline team (ID ingestion and storage in billing event tables).

I drove the cross-team alignment on three key decisions:

ID format. UUID v4 (128-bit) was selected for its global uniqueness guarantee, its availability in both client and server environments without shared state, and its compatibility with existing log schema field types across all target tables.

Field name convention. Using the same field name (unified_log_id) across all tables was a non-negotiable requirement for the feature to deliver its value. Schema divergence where different tables used different field names for the same identifier would have reintroduced the join complexity the feature was designed to eliminate.

Propagation contract. Every event derived from an original ad delivery event must carry the same unified_log_id as the original event. Click events derived from an impression must carry the impression's ID. Conversion events attributed to a click must carry the same ID. This chain of custody for the identifier is what enables full lifecycle tracing across the entire funnel.


Impact on Debugging Workflows

The practical impact on incident debugging was immediate and dramatic.

Before Unified Logging, a production incident involving an ads delivery anomaly typically required 2 to 4 hours of investigation: identifying which tables might have relevant information, building multi-hop joins across incompatible ID schemes, filtering for the relevant time window, and managing the uncertainty introduced by inconsistent sampling across tables.

After Unified Logging, the same investigation starts with a single lookup: retrieve the unified_log_id for the suspect event and query any table directly using that key. The full event lifecycle, from ranking decision through delivery through impression through billing through attribution, is visible in a single joined query. Investigation time collapsed to under 30 minutes for most incident types.

Unified Logging also enabled a new class of automated monitoring that was impossible before: cross-table consistency checks that flag events where a billing record exists without a corresponding client impression record, or where a client impression fires without a matching server delivery record. These automated checks catch anomalies in real time, before they escalate to user-reported incidents.


Outcome

Unified Logging is the observability foundation that the rest of the ads delivery system depends on. It makes debugging fast, makes sampled analysis reliable, and makes automated anomaly detection possible.

The engineering principle behind it is simple but often deferred: observability infrastructure is not optional and is not something to build when you need it. When you need it, you are already in the middle of a production incident. Build the shared correlation layer before the incidents arrive. The cost of building it proactively is small. The cost of not having it, measured in hours of investigation time per incident across a team of engineers, compounds every week it does not exist.

At the scale Meta operates, that compound cost is significant. Unified Logging eliminated it.