BART Observability
Real-time telemetry for transit data pipelines.
tech stack
OpenTelemetryGCPPub/SubDataflowPython
The Problem
Distributed data pipelines processing real-time BART (San Francisco Transit) data were prone to "silent failures" where data would drop between ingestion and storage without clear logs.
What I Built
Engineered a comprehensive observability layer using OpenTelemetry to trace every transit event from the initial API poll through GCP Pub/Sub and Dataflow transformations into BigQuery.
Architecture & Approach
Implemented custom OpenTelemetry instrumentation in Python. Spans are propagated through Pub/Sub headers, allowing Cloud Trace to stitch together a single timeline for a message as it moves through various managed services.
Impact & Results
Identified a 5% data loss bottleneck in the Pub/Sub-to-Dataflow connector.
Reduced mean-time-to-discovery (MTTD) for pipeline errors from hours to minutes.
Created a reusable observability pattern for enterprise GCP data lakes.
Key Decisions & Tradeoffs
Traded off a small amount of pipeline latency (~15ms) for high-fidelity tracing that was deemed essential for production reliability.
