Home
← back to home

BART Observability

Real-time telemetry for transit data pipelines.

tech stack

OpenTelemetryGCPPub/SubDataflowPython

The Problem

Distributed data pipelines processing real-time BART (San Francisco Transit) data were prone to "silent failures" where data would drop between ingestion and storage without clear logs.

What I Built

Engineered a comprehensive observability layer using OpenTelemetry to trace every transit event from the initial API poll through GCP Pub/Sub and Dataflow transformations into BigQuery.

Architecture & Approach

Implemented custom OpenTelemetry instrumentation in Python. Spans are propagated through Pub/Sub headers, allowing Cloud Trace to stitch together a single timeline for a message as it moves through various managed services.

Impact & Results

Identified a 5% data loss bottleneck in the Pub/Sub-to-Dataflow connector.

Reduced mean-time-to-discovery (MTTD) for pipeline errors from hours to minutes.

Created a reusable observability pattern for enterprise GCP data lakes.

Key Decisions & Tradeoffs

Traded off a small amount of pipeline latency (~15ms) for high-fidelity tracing that was deemed essential for production reliability.