experience about blogs bookshelf

Blog

Articles I’ve written on Medium, Substack and Dev.to.

January 24, 2026•Substack
When Optionality Starts Expiring
I recently read a reflection written by a close friend, someone who thinks deeply and lives deliberately. It stayed with me, not because it said something new, but because it surfaced questions I have been postponing. At 27, life still feels open. There is enough time to course-correct, enough energy to experiment, enough reassurance that potential will somehow convert itself into outcomes. But beneath that comfort is a quieter realization. Optionality does not disappear overnight. It tends to expire slowly, often without any announcement at all. Thanks for reading! Subscribe for free to receive new posts and support my work. For a long time, simply being early in your journey carries weight. You are afforded patience. Mistakes are treated as learning. Detours are excused as exploration. Progress is measured generously, because expectation is still elastic. Over time, attention shifts. Opportunities no longer revolve around what you could grow into, but around what you have already demonstrated. The questions become more specific and far less forgiving. Reliability starts to matter more than promise. Slowly, potential stops opening doors on its own. This is not unfair. It is simply how responsibility scales. I have been thinking about this a lot as I entered 2026, especially while living away from family in a different country altogether. From the outside, life looks stable. From the inside, it feels increasingly self-directed. There is no invisible safety net anymore, especially when you’re on a visa. There is no default path anymore and whatever comfort exists is something you’ve had to build deliberately. And comfort has a strange side effect. It can make stillness feel like progress. The question I have been wrestling with is not whether I am doing well. By most conventional measures, I am. The harder question is whether I am still placing myself in conditions that force growth, or whether I am quietly optimizing for familiarity and calling it balance. There is a difference between being content and being unchallenged. The two often look identical in the moment. This tension shows up most clearly when I think about money. Not in the aspirational sense of wealth, but in the practical sense of resilience. Financials, investments, assets. For me, their purpose has less to do with fulfillment and more to do with insulation. They are a way of acknowledging that life does not ask for permission before it disrupts you. I was reminded of this recently in the most confronting way. I lost my close cousin to sudden cardiac arrest. There was no buildup. No warning. No chance to prepare emotionally or logistically. Loss like that collapses the illusion that planning guarantees safety. You realize how fragile timelines really are, how arbitrary control can be. While you cannot plan for tragedy, being unprepared compounds its impact. Financial stability does not prevent grief, but I’ve seen how instability can amplify everything that comes with it. This realization extends beyond money. It applies to career, too. There is a version of my life where I do not take the next big leap. Where I continue refining what I already know, staying competent, staying employed, staying safe. That path is not wrong. It is comfortable. It is defensible. But I have to be honest about its cost. Growth plateaus rarely announce themselves, at least not in obvious ways. They happen when you stop introducing new constraints into your life. When you stop choosing environments that demand more from you. When you rely on momentum instead of intention. The fear is not failure. The fear is never creating the conditions that would make failure possible. For a long time, I’ve been told to follow a happy-go-lucky philosophy. Trusting that things work out. But since the time decided to move to the US, 3 years ago precisely, I’ve believed that effort, curiosity and decency eventually compound. I still believe that. But I no longer believe it works without participation. I’ve realized that optimism on its own isn’t a strategy. It’s a posture. And without action, it doesn’t really lead anywhere. This becomes more important when I think about relationships. Finding a partner, building a shared life. These are not problems you solve on a timeline. They are not boxes you check. But neither are they outcomes you can outsource entirely to fate. There is a quiet danger in waiting indefinitely while telling yourself you are being patient. At some point, patience turns into passivity. Passivity is still a choice, even if it does not feel like one. What I am trying to cultivate instead is intention without urgency. Seriousness without anxiety. Showing up fully without forcing outcomes. Accepting that some things cannot be controlled, while still taking responsibility for participation. At 27, I do not feel fully grown in the way the internet suggests I should. But I do feel more accountable. Not in a rigid or fearful way, but in a steady, grounded sense of ownership over my life. Life does not fall apart at this age. It narrows and narrowing is not a loss if you are choosing what remains. Thanks for reading! Subscribe for free to receive new posts and support my work.
May 27, 2025•Medium
Git-Aware .env Diff Tool using Go
Git-Aware .env Diff Tool using Go 🔍 The Problem: Invisible .env Drift We’ve all been there. You check out a new branch, deploy to staging, and suddenly nothing works. Logs point to missing API keys or unexpected ports. After digging, you realize someone changed a .env file on another branch. The change went unnoticed because .env files usually aren't tracked rigorously and most teams don't diff them during reviews. This happened to me one too many times. So I built goenvdiff: a CLI tool that compares .env files across Git branches or commits and shows what's added, removed, or changed. But that was just the beginning. ⚙️ MVP: A Basic Diff Tool The first version of goenvdiff was intentionally simple: Written in Go Used git show to pull .env files from refs Parsed them with godotenv Showed a colorized diff Supported --json for pipelines Powered by Cobra CLI A typical usage: goenvdiff --from main --to feature/login --path .env Output: + API_KEY added (abc123) - DEBUG removed (was true) ~ PORT changed from 8080 to 9090 Useful? Yes. Production-ready? Not quite. ❌ The Limitations While the tool worked, it wasn’t very useful yet: Only worked with one .env file at a time No support for .env.production, .env.test, etc. Couldn’t compare working directory vs Git history No awareness of secrets drift Not usable inside CI or GitHub workflows No output formatting for markdown or HTML The idea was good, but it needed a serious upgrade to be dev-ready. 🧪 From Toy to Tool: Making goenvdiff Actually Useful I broke down the evolution into four product-focused phases. Phase 1: Real Developer Use Multi-file support: .env.* globs Local vs Git diff: Compare uncommitted vs committed Secret drift detection: Flag SECRET, API_KEY, etc. Better output context: Show commit hashes and timestamps Phase 2: Workflow Integration Pre-commit hook: Prevent sensitive drift before commit CI validation: Use in GitHub Actions to block unsafe merges - name: Env Diff run: | goenvdiff --from main --to HEAD --json --path .env > diff.json jq '.[] | select(.Key=="API_KEY")' diff.json && exit 1 || exit 0 Phase 3: Output Polish Markdown export: For GitHub PRs HTML export: For CI dashboards Custom color themes: Light/dark modes Phase 4: Advanced Diffs Semantic changes: Type-aware diffing Explain mode: Suggest impacted systems or configs 🔬 Architecture & Flow +------------+ +------------------+ +---------------+ | Git Commit | ---> | Read .env file | ---> | Parse KeyVals | +------------+ +------------------+ +---------------+ | | | v | +--------------------------+ +---> another Git ref ---> | Diff Key-Value Pairs | | - Added / Removed / Mod | +--------------------------+ | v +------------------------------------+ | Print Output / Export JSON / MD | +------------------------------------+ 🎓 Lessons Learned Go was the right choice: fast, static binaries, easy CLI tools git show over go-git: simpler and more reliable for small tools Engineers love clean diffs: color-coded, commit-aware changes help catch real bugs CI integration matters: A tool becomes useful when it can break the build for the right reasons 🚀 What’s Next --match ".env*" support for multiple files Markdown/HTML export Severity tagging for high-risk env changes Homebrew tap for one-line installs 📚 Try It Out go install github.com/ashishsalunkhe/goenvdiff@latest Or clone it: git clone https://github.com/ashishsalunkhe/goenvdiff.git cd goenvdiff go build -o goenvdiff Try it: goenvdiff --from main --to feature/login --path .env 👋 Final Thoughts If you’ve ever been burned by unseen .env changes, you’ll get why this tool exists. But building a tool is one thing. Making it actually useful something a dev team wants to install, use in CI, and trust with secrets takes iteration, feedback, and a shift from “it works” to “it integrates.” I’d love feedback, contributions, or just a GitHub star if you find it helpful. Repo: github.com/ashishsalunkhe/goenvdiff
#developer-productivity#developer-tools#platform-engineering#go-language#git
May 14, 2025•Medium
Monitoring Microservices on EKS with OpenTelemetry, Prometheus, and Grafana: A Student’s Guide
Photo by Alex Kulikov on Unsplash Kubernetes enables you to orchestrate complex, distributed systems composed of containerized microservices. While powerful, this abstraction makes it harder to answer basic operational questions like: Is my service healthy? Which pod is consuming excess memory? Why is latency spiking during deployments? Traditional logging and ad hoc metrics fall short in dynamic, autoscaled environments. That’s where observability shines. Observability isn’t just about tools; it’s about gaining insight into your system’s state through metrics, logs, and traces. In this blog, I walk through the individual journey of building a full-stack observability platform using OpenTelemetry, Prometheus, and Grafana, all running on Amazon EKS. Phase 1: Environment and Initial Application Setup Docker Deployment To begin, I launched a dedicated EC2 instance (t3.large with a 16 GB EBS volume) and installed both Docker and Docker Compose. I then cloned my GitHub repository containing the OpenTelemetry demo and used the docker-compose.yml file to bring the application online. To confirm everything was working correctly, I ran: docker ps docker-compose logs These commands ensured that all services were running as expected. Once I confirmed the application was functional and accessible via its defined endpoints, I cleaned up the environment by terminating the instance. Kubernetes Setup The next step involved provisioning an EKS cluster. I created a separate EC2 instance to act as the EKS client and attached an IAM role with sufficient permissions (EksAllAccess, IamLimitedAccess, AWSCloudFormationFullAccess, and AmazonEC2FullAccess). Using eksctl, I deployed the cluster based on a predefined configuration file: eksctl create cluster -f eks-cluster-deployment.yaml I deployed the OpenTelemetry demo application to a namespace named otel-demo: kubectl apply --namespace otel-demo -f opentelemetry-demo.yaml I verified the health of all pods and services using kubectl get all -n otel-demo and reviewed logs from essential components like the frontend proxy. Since the frontend proxy service was originally exposed as a ClusterIP, I updated it to a LoadBalancer type to make the application accessible externally. Phase 2: YAML Splitting and Modular Deployment To streamline resource management, I split the monolithic YAML file into smaller files categorized by resource type: Deployments, Services, ConfigMaps, and Secrets. This organization enabled targeted deployments and simplified error tracking. I applied the configuration recursively after first setting up the namespace: kubectl apply -f namespace.yaml kubectl apply -f ./open-telemetry --recursive --namespace otel-demo This approach offered multiple advantages: Independent configuration for each microservice Easier debugging and faster rollback Safe, parallel development and deployment Enhanced scalability and reliability Key architectural components included: Namespace YAML: Provided logical isolation for grouped resources Telemetry Stack: Consisted of OpenSearch, Jaeger, OpenTelemetry Collector, Prometheus, and Grafana Web Application Services: Covered a full suite of microservices, including backend and frontend services, along with Kafka for messaging Phase 3: Integrating Helm for Deployment To further simplify the deployment and configuration process, I leveraged Helm, a Kubernetes package manager. I added the OpenTelemetry Helm chart repository, updated it, and created a new namespace for the Helm-managed deployment: helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts helm repo update kubectl create namespace otel-demo-helm helm install otel-demo open-telemetry/opentelemetry-demo -n otel-demo-helm Helm’s templating system allowed me to simulate changes by editing the values.yaml file. For instance, increasing the number of replicas was as simple as modifying the following: replicaCount: 3 Then applying the update: helm upgrade otel-demo -f values.yaml -n otel-demo-helm In case of deployment issues, Helm provided an easy rollback mechanism: helm rollback otel-demo -n otel-demo-helm Observability in Action: Grafana Dashboards and Alerting Once the observability stack was operational, I used port forwarding to access the Grafana dashboard from my local machine: kubectl port-forward svc/grafana 3000:3000 -n otel-demo-helm Inside Grafana, I added Prometheus as the primary data source. I then imported prebuilt dashboards and created several custom panels using PromQL queries such as: sum(rate(container_cpu_usage_seconds_total[1m])) by (pod) sum(container_memory_working_set_bytes) by (node) These visualizations offered insights into pod-level and node-level resource usage, network throughput, and overall system health. Alerting Setup To detect anomalies in real-time, I integrated AlertManager with Prometheus. I configured rules to send alerts for critical conditions such as: More than 3 pod restarts within a 10-minute window CPU usage exceeding 80% over a 5-minute period These alerts were routed through AWS SNS to trigger email notifications, ensuring that I would be informed even during off-hours. Common Pitfalls and Fixes Throughout the project, several challenges emerged that tested the robustness of my deployment and my understanding of Kubernetes internals. Persistent Volume Claims (PVCs) Remaining in Pending State Initially, Prometheus and other components relying on persistent storage failed to start because their associated PVCs remained stuck in a Pending state. This issue was traced back to the EKS cluster not having the Amazon EBS CSI (Container Storage Interface) driver enabled. Without it, Kubernetes was unable to dynamically provision EBS volumes for PVCs. To resolve this, I enabled the CSI driver by associating an IAM OIDC provider with the EKS cluster and creating a service account with the necessary IAM permissions. I then installed the aws-ebs-csi-driver as a cluster add-on. After completing this setup, the PVCs successfully bound to dynamically created volumes, allowing the applications to start. Incomplete Metric Exposure from Nodes Although the OpenTelemetry Collector successfully gathered metrics from application services, it lacked visibility into cluster-level metrics such as node CPU, memory, and disk usage. This limited the effectiveness of Grafana dashboards. To address this, I deployed a standalone Prometheus server with node exporters configured to scrape metrics from the Kubernetes nodes directly. I also ensured the Prometheus configuration included appropriate ServiceMonitor resources and discovery rules. With these additions, I was able to visualize granular infrastructure metrics in Grafana. Helm Deployment Errors and Misconfigurations During multiple Helm chart upgrades, I encountered errors that resulted in application downtime. These issues typically stemmed from invalid configurations in the values.yaml file or from version mismatches in the Helm chart. I mitigated this by maintaining a version-controlled values.yaml file, using helm diff to preview changes, and setting up helm history and helm rollback workflows. In cases where a deployment failed, I used kubectl describe and pod logs to trace the root cause and then restored the last known good configuration using Helm’s rollback feature. This workflow ensured minimal disruption during updates. Final Thoughts and Lessons Learned This hands-on project reinforced the importance of modular configuration and observability in production-grade systems. Some key takeaways include: Structuring YAML files by resource type simplifies maintenance and rollback Helm significantly reduces the complexity of deploying and managing multi-component applications Observability should be integrated from day one, not added as an afterthought Effective alerting mechanisms prevent minor issues from escalating into outages The combination of OpenTelemetry, Prometheus, and Grafana provided a powerful toolkit for monitoring, tracing, and visualizing microservices performance across the EKS cluster. Conclusion What began with a Docker Compose experiment evolved into a robust, production-like observability platform on Kubernetes. By leveraging AWS EKS, Helm, and the OpenTelemetry ecosystem, I was able to design, deploy, and monitor a multi-service application with real-time insights and automated alerting. This journey underscored the necessity of observability in cloud-native systems and equipped me with the confidence to scale, debug, and maintain modern infrastructure with industry best practices. GitHub Repository: EKS-Open-Telemetry
#ek#prometheus#opentelemetry#kubernetes#grafana
April 21, 2025•Medium
Observability in Motion: OpenTelemetry + GCP for Real-Time Data Engineering with BART API
Photo by Cedric Letsch on Unsplash In the world of real-time data engineering, streaming pipelines are often treated as black boxes, you see data in and data out, but have little visibility into what happens in between. This lack of observability becomes critical when working with high-throughput, time-sensitive data sources, where even small lags or failures can ripple downstream and erode trust in your platform. In this post, I’ll walk through how I instrumented a real-time data pipeline on Google Cloud Platform using Bay Area Rapid Transit (BART) API data and brought observability to life using OpenTelemetry, Cloud Monitoring (Prometheus), Jaeger, and Grafana, all wired together through the OpenTelemetry Collector. 🚆 Why BART? The Bay Area Rapid Transit system provides real-time train data via public APIs , including estimated arrival times, service alerts, and station statuses. This makes it a great candidate for: Real-time ingestion and processing Latency-sensitive applications (e.g., live dashboards, alerts) A testbed for showcasing streaming pipeline observability 🔧 Architecture Overview Here’s the high-level setup I built for this pipeline: GCP Architecture 🔍 Step 1: Ingest Real-Time Data from BART API with Cloud Function I used the BART “Estimated Departures” (etd) endpoint to collect live train data every 10 seconds using a scheduled Cloud Function. import requests, json from google.cloud import pubsub_v1 publisher = pubsub_v1.PublisherClient() topic_path = publisher.topic_path("your-project-id", "bart-etd") def fetch_bart_etd(request): res = requests.get("http://api.bart.gov/api/etd.aspx?cmd=etd&orig=ALL&key=YOUR_API_KEY&json=y") for station in res.json()['root']['station']: data = json.dumps(station).encode("utf-8") publisher.publish(topic_path, data=data, station=station['abbr']) return "Published!" Trigger this Cloud Function using Cloud Scheduler every 10 seconds. ⚙️ Step 2: Stream Processing with Dataflow (Apache Beam) Next, I used Google Cloud Dataflow (which runs Apache Beam) to process the Pub/Sub stream. The job performs the following: Parses nested JSON Filters out missing/invalid records Adds metadata (e.g., processing timestamp) Writes to both BigQuery (for analytics) and Cloud Storage (as backup) I also implemented the Beam job using OpenTelemetry: Add the OpenTelemetry Java Agent to your Beam pipeline Set export configurations for OTLP: --jvm_flags=-javaagent:/path/opentelemetry-javaagent.jar \ --otel.exporter.otlp.endpoint=http://otel-collector:4317 \ --otel.service.name=bart-dataflow This lets your Dataflow job automatically emit traces and metrics! 📦 Step 3: Export Telemetry with OpenTelemetry Collector on Cloud Run I deployed the OpenTelemetry Collector as a lightweight service on Cloud Run, configured to receive OTLP signals from Dataflow and forward them to GCP-native tools. Here’s a trimmed-down otel-collector-config.yaml: receivers: otlp: protocols: grpc: http: exporters: googlecloud: project: your-project-id prometheus: endpoint: "0.0.0.0:8889" jaeger: endpoint: "jaeger:14250" insecure: true service: pipelines: traces: receivers: [otlp] exporters: [googlecloud, jaeger] metrics: receivers: [otlp] exporters: [prometheus] Use the googlecloud exporter for integration with Cloud Trace or forward to Jaeger for flexibility. 📈 Visualize with Grafana, Cloud Monitoring & Jaeger Grafana Dashboards: Ingestion lag from Cloud Pub/Sub Throughput of messages into BigQuery Latency from Pub/Sub to storage sink Jaeger/Cloud Trace: View end-to-end traces for each record Analyze stage-by-stage latency across function, pub/sub, and Dataflow This visibility drastically reduces the mean time to debug (MTTD) and helps you quickly identify slow stages or stuck workers. 🧪 Step 4: Alerting with Prometheus + Alertmanager Need to know if no BART data has arrived in the last 60 seconds? rate(pubsub_ingestion_count{topic="bart-etd"}[1m]) == 0 Trigger Slack or email alerts using Alertmanager or GCP’s built-in alerting policies. 🚀 Key Takeaways OpenTelemetry works seamlessly with GCP’s streaming stack Observability isn’t just for SREs, it’s critical for data reliability engineering With a few configuration steps, you get deep insights into how each component behaves in real-time 🌐 Repo & Resources 🔗 BART API Documentation 🔗 OpenTelemetry Collector Configs 🔗 Google Cloud Dataflow 🔗 GitHub Repo (Demo Pipeline)
#google-cloud-platform#observability#bay-area#data-engineering#opentelemetry
April 16, 2025•Medium
Building an Urban Mobility Data Platform: Addressing Last-Mile Connectivity in the DMV Region
Photo by Maria Oswalt on Unsplash The project aimed to bridge last-mile connectivity gaps in the DMV region by building a low-latency, geospatially-aware, multi-source analytics platform that integrates open transportation data, shared mobility trends, and socio-demographic context. This blog provides a step-by-step technical breakdown of how we implemented each data engineering layer, from ingestion through modeling to visualization, with precise descriptions of decisions, edge cases, and low-level configurations. Architecture Architecture The architecture follows a modular, serverless design built entirely on AWS. It supports both batch and real-time ingestion using Lambda functions and Glue Python Shell jobs, with raw data stored in Amazon S3 following a structured, source-partitioned layout. PySpark-based Glue ETL jobs handle normalization, geospatial enrichment, and schema alignment before writing Parquet outputs to a partitioned processed zone. Athena powers analytical querying with support for spatial joins, while QuickSight dashboards provide interactive visualizations. Observability is built in from the ground up using CloudWatch, SNS alerts, and daily cost reports via a custom Lambda scraping AWS Cost Explorer. The design ensures scalability, traceability, and low operational overhead. Raw Data Ingestion: Source-by-Source Deep Dive We began by identifying the four main data sources, each requiring a distinct ingestion strategy: 1. Capital Bikeshare Trip Data (CSV) Source: Capital Bikeshare Data Portal Frequency: Monthly Ingestion: Lambda function written in Python 3.9, deployed using AWS SAM CLI. Script used requests.get() with streaming enabled (stream=True) to avoid memory overload during large file downloads (typically ~300MB). Validated CSV header structure using Python’s built-in csv module before uploading to S3. Created S3 prefix s3://lastmile/raw/capitalbikeshare/year=YYYY/month=MM/ for time-based partitioning. Attached metadata for lineage: Content-MD5, ETag, Ingest-Timestamp, and Source-URL as part of PutObject call. 2. WMATA Ridership Data (CSV) Source: WMATA Open Data Frequency: Daily Ingestion: Glue Python Shell job written in Python 3.6. Utilized tenacity library for retries with backoff in the presence of 5xx errors. Raw files stored with naming convention: station_ridership_YYYYMMDD.csv. Generated audit trail logs using logging module and sent to CloudWatch Log Group. 3. Transit App API (JSON) Source: Live feeds with stops, predictions, and vehicle locations. Frequency: Every 4 hours Ingestion: Lambda function used urllib3 for persistent connection pooling. Parsed nested JSON with schema drift using pydantic to validate and coerce fields. Compressed responses with gzip and wrote to S3 using put_object(Body=buffered_stream) to reduce I/O time. Partitioned by api_name, ts_hour, and region. 4. U.S. Census Bureau (ACS 5-year Estimates) Source: Census Data API Frequency: Static Ingestion: Glue Spark job invoked Python function via mapPartitions to call REST API and process response line-by-line. Used .repartition(20) before API call to parallelize queries per FIPS region. Wrote processed data to s3://lastmile/raw/census/acs_5yr/ as newline-delimited JSON. Schema Evolution and Storage in S3 Our goal was to support versioned schema tracking, raw lineage, and query-efficient structure. We implemented a zone-based layout: s3://lastmile/ ├── raw/ # Lineage-preserving, unparsed │ ├── source=.../ # One folder per dataset │ └── ... └── processed/ # Query-optimized outputs ├── domain=.../ # One folder per analytical domain └── ... All processed outputs used parquet with Snappy compression. Partition strategy: region, year, month, with enforced data typing. Added JSON schema definitions to a separate s3://lastmile/schemas/ folder for cross-checks and downstream tooling. ETL Logic and Data Transformations (AWS Glue) Each transformation pipeline was unit-tested locally with PySpark and deployed via Glue 3.0. Example config: Job bookmark enabled for deduplication. --enable-continuous-cloudwatch-log for detailed step logs. Memory: 6 DPUs Bikeshare Trip Normalization trip_df = spark.read.option("header", True).csv("s3://lastmile/raw/capitalbikeshare/*.csv") trip_df = trip_df.withColumn("trip_id", sha2(concat_ws("-", col("start_time"), col("end_time")), 256)) trip_df = trip_df.withColumn("duration_min", round(col("duration") / 60, 2)) trip_df = trip_df.dropna(subset=["start_station", "end_station"]) WMATA Station Aggregation wmata_df = spark.read.option("header", True).csv("s3://lastmile/raw/wmata/*.csv") wmata_df = wmata_df.withColumn("weekday_normalized", col("avg_weekday") / col("station_capacity")) Real-time Stop Coordinates Normalization @udf(returnType=StringType()) def round_point(lat, lon): return f"{round(float(lat), 4)}|{round(float(lon), 4)}" api_df = api_df.withColumn("location_key", round_point("stop_lat", "stop_lon")) Data Modeling and Spatial Indexing To analyze multimodal proximity relationships, we designed a hybrid star schema: Fact: fact_trip_metrics, fact_station_load Dims: dim_metro, dim_bike_station, dim_nearby_stops, dim_demographics Spatial joins were executed via Athena SQL: SELECT a.station_id, b.metro_id FROM dim_bike_station a JOIN dim_metro b ON ST_Distance(ST_Point(a.lon, a.lat), ST_Point(b.lon, b.lat))
April 16, 2025•Medium
Building a Real-Time Energy Data Lake on GCP: Lessons from Integrating 9 ISO Grid Systems
In today’s rapidly evolving energy landscape, having real-time access to grid performance data is no longer a luxury, it’s a necessity. While system operators like CAISO, ERCOT, and PJM publish data on demand, fuel mix, and prices, integrating these fragmented sources into a unified, analytics-ready system presents a unique engineering challenge. In this blog, I’ll share how I designed and implemented an Energy Data Lake using GridStatus.io and Google Cloud Platform, covering: Multi-ISO data ingestion Scalable transformations with Dataproc & Spark BigQuery-based analytics Serverless orchestration Exposing APIs and dashboards ML integrations with Vertex AI If you’re a data engineer, energy analyst, or simply curious about how to stitch together messy utility data into something beautiful, read on. 🔍 The Problem Every ISO (Independent System Operator) has its own format, APIs, and cadence. ERCOT publishes load forecasts every 15 minutes; PJM delivers real-time LMPs across nodes; ISONE provides daily forecasts. Yet for decision-makers — energy traders, data scientists, policymakers — what’s needed is a single pane of glass: a harmonized source of truth with reliable, near-real-time updates. 🧠 Design Principles When I began architecting this solution, I followed three guiding principles: Cloud-native and cost-efficient: Use managed services wherever possible. Modular: Each ISO ingestion pipeline should be loosely coupled. Scalable: Support backfills, forecast modeling, and BI tool integration. 🏗️ Architecture Overview Architecture for Data Pipelines Ingestion: Python + GridStatus APIs in Cloud Functions → Cloud Storage Transformation: PySpark on Dataproc → BigQuery Orchestration: Cloud Scheduler + Pub/Sub Lineage: Data Catalog Monitoring: Cloud Logging & Monitoring Delivery: Looker Studio, Vertex AI, and custom APIs 🌐 Ingestion: Taming the ISO Chaos Each ISO provides distinct data types: load, forecasts, fuel mix, prices. I used the gridstatus Python package (open source) and custom Cloud Functions to extract data and store them as raw CSVs in Cloud Storage. Example for ERCOT: from gridstatus import Ercot ercot = Ercot() load_df = ercot.get_load(date="today") load_df.to_csv("/tmp/ercot_load.csv") Cloud Functions then pushed these to: gs://my-energy-raw-data/ercot/load/ercot_load_2025-04-14.csv This pattern was repeated across CAISO, MISO, NYISO, PJM, and more. 🔄 Transformation: Spark-Powered Cleanups Once the raw files were in Cloud Storage, I used PySpark on Cloud Dataproc to: Clean and standardize schemas Merge daily/hourly files Enrich with weather where applicable Load into BigQuery df = spark.read.csv("gs://my-energy-raw-data/ercot/load_latest/*.csv", header=True) df = df.withColumn("Load", col("Load").cast("double")) df.write.format("bigquery").option("table", "energy_data_lake.ercot_load_latest").mode("append").save() Dataproc workflows were defined using Terraform, enabling repeatable jobs that spin up clusters, process data, and tear down, all on schedule. ⏰ Orchestration: Serverless & Reliable Cloud Scheduler triggered ingestion and transformation jobs using Pub/Sub. Each ISO had different refresh frequencies — ERCOT every 15 minutes, PJM hourly, ISONE daily. Failures were logged to Cloud Monitoring, with email/SMS alerts via custom metrics. Example scheduler trigger: gcloud scheduler jobs create pubsub ingest-ercot-load \ --schedule="0 * * * *" --topic=ingest-ercot-load-topic --message-body="trigger" 🧾 Data Governance & Metadata Each BigQuery table was linked in Data Catalog with metadata like: Source API (e.g., ERCOT get_load) Raw file location Frequency of updates This enabled full lineage tracking: from GridStatus API → Cloud Function → GCS → Dataproc → BigQuery. 📈 Visualization & ML Once in BigQuery, the data powered: Looker Studio Dashboards: Fuel mix vs load over time, forecast vs actual, price heatmaps. Vertex AI Pipelines: Forecasting ERCOT load using time-series models. BigQuery ML: Fast experiments in SQL, including anomaly detection and regression. 🚀 Future Work The current architecture works beautifully for batch and near-real-time data. But future enhancements might include: Streaming ingestion with Dataflow for true real-time pipelines Kafka connectors for ISO feeds Multi-region replication for disaster recovery Anomaly detection alerts using ML models on recent load data 📊 Business Intelligence & Insights: Telling Stories With Energy Data Once our pipeline processed the ISO grid data and stored it in BigQuery, we built dashboards in Looker Studio to empower stakeholders like analysts, researchers, and planners with actionable insights. The goal was simple: transform raw grid telemetry into stories of consumption, generation, and behavior. 🔹 1. Energy Generation vs Load Consumption Using merged data from ERCOT, we created a dashboard comparing generation sources like solar, wind, nuclear, natural gas with total system load. 🔍 Observation: Solar peaks in mid-day, wind is erratic, and natural gas acts as the baseload balancer. Load surges during early morning and late evening, aligning with residential usage patterns. 📈 Business Query: How does energy consumption vary throughout the day? SELECT EXTRACT(HOUR FROM interval_start) AS hour, AVG(load) AS average_load FROM ercot_merged.ercot_fm_load_merged GROUP BY hour ORDER BY hour; 🔹 2. Load Forecast Accuracy Across Regions We visualized 3-day load forecasts across five ERCOT regions: Houston, North, South, West, and the system total. 🔍 Observation: Daily load patterns show clear peaks, emphasizing the need for accurate forecasting to prevent under- or over-provisioning. Regional trends varied: West and North consistently led peak loads, while South trailed. 📈 Business Query: What is the average energy consumption per month? SELECT EXTRACT(MONTH FROM interval_start) AS month, AVG(load) AS average_load FROM ercot_merged.ercot_fm_load_merged GROUP BY month ORDER BY month; 🔹 3. Energy Mix Breakdown 📊 Energy source composition helps planners assess grid reliability, carbon impact, and dependency on renewables. 📈 Business Query: What is the percentage contribution of each energy source to the grid? SELECT ROUND(SUM(solar) / SUM(...) * 100, 2) AS solar_percent, ROUND(SUM(wind) / SUM(...) * 100, 2) AS wind_percent, ROUND(SUM(nuclear) / SUM(...) * 100, 2) AS nuclear_percent, ... FROM ercot_merged.ercot_fm_load_merged; 🔹 4. Weather vs Price Dynamics To explore whether weather influences grid pricing, we merged ERCOT SPP prices with temperature, humidity, and wind speed. 🔍 Observation: While clear correlation was not visible over one day, spikes in SPP prices often coincided with sharp drops in wind speed or heatwaves hinting at grid strain. 📈 Business Query: How do weather conditions affect electricity prices? SELECT ROUND(AVG(SPP), 2) AS avg_price, Temperature, Humidity, Wind_Speed FROM ercot_merged.ercot_spp_weather_merged GROUP BY Temperature, Humidity, Wind_Speed ORDER BY avg_price DESC; 🧠 Takeaway These BI explorations turned our raw ISO data into narratives of consumption, forecasting precision, and environmental sensitivity. Integrating BigQuery with Looker Studio provided low-latency, self-serve dashboards accessible to analysts across roles. As our pipeline evolves, we plan to: Add alerts for anomalous forecast deviations Create ML-powered forecasting comparisons Embed BI dashboards within internal product portals 🙌 Final Thoughts This project began with one question: Can we unify the messy energy grid data landscape into a clean, analytics-ready lakehouse? The answer is yes with the right design, serverless system, and tooling. If you’re working on energy data, grid analytics, or large-scale ingestion pipelines, I’d love to hear from you. Reach out on LinkedIn or check out the GitHub repo. 📚 Appendix GridStatus.io Documentation Terraform Config Samples BigQuery Schema Files Looker Studio Dashboards
#energy-data-analytics#data-engineering#google-cloud-platform#infrastructure
April 3, 2025•Medium
Choosing Between MIM and MSIS at UMD: A Detailed Comparison
McKeldin Mall at the University of Maryland Many students have reached out to me with questions about the Master of Information Management (MIM) and the Master of Science in Information Systems (MSIS) programs at the University of Maryland (UMD). Having been admitted to both, I wanted to share my perspective to help prospective students make an informed decision. Similarities in Job Outcomes Both MIM and MSIS prepare students for similar job roles, including: Data Scientist Data Analyst Data Engineer Business Analyst Product Manager Business Intelligence (BI) Engineer Software Development Engineer (SDE) DevOps Engineer (Cross Department Coursework) If you have prior development or coding experience, neither program will feel overwhelmingly technical. However, MSIS is significantly more fast-paced and demanding, whereas MIM is more flexible and relaxed. MSIS: Fast-Paced and Intensive Rigorous curriculum: The MSIS program is hectic, with weekly assignments, quizzes, exams, and tests. Students have limited time for on-campus jobs, though some manage to balance both. Fixed course structure: While the curriculum is similar to MIM, the sequence of courses differs. Limited tuition remission opportunities: If you are part of the Smith School, you won’t receive tuition remission for on-campus jobs. Additionally, preference for Teaching Assistant (TA), Grader, and Research Assistant (RA) roles at the iSchool is lower. However, department-specific Graduate Assistant (GA) positions can be found through UMD’s eJobs portal. Batch size: ~180 students Return on Investment (RoI): The program is expensive with a rigid curriculum, making its RoI questionable. Credit Distribution: 30 credits over 3 semesters (16 months/1.5 years) Semester-wise breakdown: 13–10–7 MIM: Flexible and Balanced Relaxed first semester: Courses are not rigorous, and there are no exams — only capstone projects and weekly assignments. Flexible curriculum: Allows time to develop skills, search for internships, attend networking events, and take on-campus jobs. Strong on-campus job opportunities: TA positions offer significant financial benefits, including: Full tuition remission for the semester (provided you secure the position and it is renewed each semester) Fixed monthly stipend (~$2,000 post-tax) State medical insurance (with only ~$47 deducted from stipend per month) Out-of-pocket costs: Only student organization and graduate fees (~$1,500 per semester) Potential for net positive earnings Batch size: ~30 students Thesis track available: Helpful if considering a PhD in the future. Credit Distribution: 36 credits over 4 semesters (2 years) Semester-wise breakdown: 3–3–3–3 Disclaimer: Tuition remission and GA positions are not guaranteed and depend on availability and renewal each semester. Why I Chose MIM Over MSIS Since the course outcomes and job opportunities for both programs are nearly identical, I found MIM to be a better option for several reasons: Lower tuition costs and better financial aid options More flexibility in course selection and pacing Opportunity to secure TA/RA positions early, which significantly reduced my expenses No restrictions on iSchool students working as an RA at Smith School or other departments Participation in Smith School’s consulting fellowship ensured I didn’t miss out on important technical experiences Final Thoughts Both programs have their pros and cons. If you prefer a structured, intensive learning experience and can handle a fast-paced environment, MSIS might be for you. However, if you value flexibility, work opportunities, and a better financial outlook, MIM could be the smarter choice. I hope this comparison helps you in your decision-making process. Feel free to reach out if you have more questions!
#ms-in-information-systems#information-systems#graduate-school#ms-in-us#university-of-maryland
October 6, 2021•Dev.to
Code Practices every developer can follow!
Coding the right way! You always feel difficulty understanding what doctors write, right? Ever found a code on the internet but found it difficult to understand? Exception Handling : Code in a defensive manner. Always consider the worst case scenario. Think about input failures, event handling exceptions. This would help keep a track of possible bugs and catch the exceptions. Enhance code readability : Writing highly optimized code and creating complex libraries is easy but consider junior developer trying to understand 300 lines of code. Which more of a challenge than learning. Code which is clean and modular reflects maturity, competency and professionalism of the developer. A code logically structured into modules and functions is always more readable, effecient and reusable. Minimize the memory footprint : Simple yet significant coding habits can change the performance of your final product. Memory management involves ways programmer / developer dynamically allocates memory to code when requested and frees it for reuse when not needed. Here are a few good blogs I came accross: Secure Coding Best Practices for Memory Allocation in C and C++ Memory Management in Python Best practice of memory management in programming Refactor your code : Restructuring your existing code without changing it's functionality. Aggregate groups of functions doing similar tasks into a single function, use of abstraction, deduplication of code and polymorphism. Refactoring Guru Version Control : Tracking all changes and synchronization of codes, bug-resolve, managing changes to source codeover time. Basic, niche practice every developer should master. Git, GitHub, Gitlab, Mercurial are tools used for version control. What is version control? Code Testing : Isn't it nice when someone praises us after we dress up with a good outfit? Testing practices act just like that validation we need when code the functionalities of any software. It guides and shapes the development process of the software. Unit tests, code coverage and other testing techniques help maintain modular structure and good quality end-product. KISS-DRY-YAGNI : “Keep It Simple, Silly” - keep your code as concise as possible "Don’t Repeat Yourself" - reducing repetition and redundancies "You aren't gonna need it" - Don't try to think far ahead in future and add complex bits of code features. You don't need it, you ain't gonna need it!
October 6, 2021•Dev.to
Machine Learning meets DevOps: MLOps
Production ML Systems There's a lot more to machine learning. Implementation of an ML algorithm is just the tip of the iceberg. Machine learning systems are a part of a much larger ecosystem. Creating a well-performing machine learning model is just a small aspect of real-world machine learning solutions. ML Crash Course by Google Let's say you are on the verge of signing the first customer for your startup. Your start-up has an amazing team of ML Engineers, Data Analysts, Data Scientists. They have been successful in creating state-of-the-art models with unprecedented results and metrics. The real problem here that arises is its deployment at the production level. In the 2020 State of Enterprise Report, based on a survey of nearly 750 domain experts and practitioners in Machine Learning, had the following conclusions: There was an increase in spending on AI by more than 2/3rds of the subgroups that were interviewed about their budget. 43% of respondents cited difficulty in scaling their ML projects as to 30% in the previous year. Half of the respondents deploy their systems between a week and 3 months, while 18 percent require more than 3 months. Machine Learning is evolving swiftly, growing into new sectors and industries yet building projects at scale is difficult. This marks a huge gap between models generated through scripts, notebooks, and their deployment in a production system at scale. As MLOps corresponds to DevOps for ML, there are challenges needed to be addressed. As highlighted by Arnab Bose, and Aditya Aggarwal in their blog, an example of such challenge is the role of data. There are two different Software paradigms involved in traditional software engineering and machine learning - software developers have well-defined logic and code for their software programs whereas data scientists follow a parameterized problem-solving coding approach. These parameters depend on data which vary with changes in data thus altering the entire code behavior. Therefore, another aspect of data and its irregular variation causes difficulties in tracking a well-defined software. List of challenges that make it difficult to deploy ML models to production: Data Management Huge Datasets Dataset Tracking Data Privacy Trial and Error and Iterative Development Tracking changes: Hyper-parameter tuning, code changes, architecture changes Code Quality: Production-ready code, code optimizations Model Evaluation Training, Inference, and Retraining Testing Production Deployment Monitoring Cloud / On-premise - batch and real-time predictions Infrastructure Requirements Security Shout out to Andrej Karpathy for the wonderful blog emphasizing Software 2.0 and the ongoing transition into the 2.0 stack. Software 2.0. I sometimes see people refer to neural… | by Andrej Karpathy | Medium Andrej Karpathy ・ Mar 13, 2021 ・ Data Engineering and Management Training / Modeling (Machine Learning Pipeline) Continuous Deployment Monitoring At first, one needs to define a business problem and translate it into objectives that can be addressed through machine learning solutions. Thus, the aim is to provide an end-to-end machine learning pipeline for designing, building and managing reproduciable ML Software alongside test-driven development. Co-author: @sonishreyas
#machinelearning#mlops#devops#python