Home
← back to home

Energy Data Lake

Unifying fragmented power grid data on GCP.

tech stack

GCPDataprocBigQueryVertex AIPython

The Problem

Power grid data in the US is highly fragmented across 9 different ISO systems, each with unique schemas, update frequencies, and API quirks, making unified analytics impossible.

What I Built

Architected a real-time Energy Data Lake on GCP that ingests, cleans, and standardizes data from all major US grid operators. The platform provides a single source of truth for grid status, pricing, and carbon intensity.

Architecture & Approach

Used GCS for raw landing, Cloud Functions for localized ingestion, and Dataproc (Spark) for medallion-architecture pipelines (Bronze/Silver/Gold) that transform raw JSON into optimized BigQuery tables.

Impact & Results

Successfully integrated data from all 9 US ISOs into a single SQL interface.

Process 50GB+ of daily grid data with 99.9% ingestion uptime.

Enabled Vertex AI-driven pricing forecasts with a 15% improvement in accuracy.

Key Decisions & Tradeoffs

Chose Dataproc over Dataflow for the transformation layer to leverage existing Spark-based cleaning libraries and better handle batched historical re-processing.