Energy Data Lake
Unifying fragmented power grid data on GCP.
tech stack
GCPDataprocBigQueryVertex AIPython
The Problem
Power grid data in the US is highly fragmented across 9 different ISO systems, each with unique schemas, update frequencies, and API quirks, making unified analytics impossible.
What I Built
Architected a real-time Energy Data Lake on GCP that ingests, cleans, and standardizes data from all major US grid operators. The platform provides a single source of truth for grid status, pricing, and carbon intensity.
Architecture & Approach
Used GCS for raw landing, Cloud Functions for localized ingestion, and Dataproc (Spark) for medallion-architecture pipelines (Bronze/Silver/Gold) that transform raw JSON into optimized BigQuery tables.
Impact & Results
Successfully integrated data from all 9 US ISOs into a single SQL interface.
Process 50GB+ of daily grid data with 99.9% ingestion uptime.
Enabled Vertex AI-driven pricing forecasts with a 15% improvement in accuracy.
Key Decisions & Tradeoffs
Chose Dataproc over Dataflow for the transformation layer to leverage existing Spark-based cleaning libraries and better handle batched historical re-processing.
