We Lost Millions of Records: Here's How I Caught It, Fixed It, and Built a Monitoring System in One

by Ekaterina AdamchikJune 11th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A data platform team found that chunks of data never made it into their platform. The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation. The team built a monitoring system using Python and Snowflake.

People Mentioned

Mention Thumbnail

Company Mentioned

Mention Thumbnail
featured image - We Lost Millions of Records: Here's How I Caught It, Fixed It, and Built a Monitoring System in One
Ekaterina Adamchik HackerNoon profile picture

When you're the support engineer, the analyst, and the firefighter all in one person.

The Context: Big Data, Real Clients, Real Problems

I work in a data platform team supporting clients. Our stack includes Snowflake, BigQuery, and the product exposes millions of data points daily: customer metrics, store-level analytics, historical trends, etc.


So, when someone said, “Why is my dashboard empty?” it wasn’t just a UI bug. Something was broken deep in the data pipeline.


Spoiler: entire chunks of data never made it into our platform.

Step 1: Investigating the Black Hole

The first clue came from a product delivery team: some accounts had incomplete datasets. We quickly traced the flow:

  • Raw data entered MySQL
  • It was meant to move to Google Cloud Storage (GCS)
  • But didn’t make it into Snowflake


No alarms. No alerts. Just... silent failure.


The parser in the GCS → Snowflake handoff was skipping records silently due to a malformed schema and lack of validation.

Step 2: Build a Monitoring System (In a Week)

I decided to build an easy but powerful monitoring pipeline using Python and Snowflake. No waiting for a long-term product fix because this had to work now.

Step 2.1: Connect to Snowflake via Python

#a connector
import snowflake.connector
import pandas as pd

conn = snowflake.connector.connect(
    user='user',
    password='password',
    account='account_id',
    warehouse='warehouse',
    database='db',
    schema='schema'
)
#just an example
query = """
SELECT customer_id, module, max(event_date) as last_date
FROM analytics_table
GROUP BY customer_id, module
"""
df = pd.read_sql(query, conn)


Step 2.2: Detect Delays with Pandas

df["lag_days"] = (pd.Timestamp.today() - pd.to_datetime(df["last_date"])).dt.days
df["status"] = df["lag_days"].apply(lambda x: "valid" if x <= 1 else "delayed")


Now, I had a clear view of what modules were fresh and which were weeks behind.

Step 2.3: Store the Results in a New Monitoring Table

from sqlalchemy import create_engine
#example
engine = create_engine(
    'snowflake://user:password@account/db/schema?warehouse=warehouse'
)

df.to_sql("monitoring_status", con=engine, if_exists="replace", index=False)

Step 3: Show the Pain Visually

Using Looker Studio, I connected directly to the monitoring table.


I built a dashboard that showed:

  • % of modules delayed per customer
  • Time since last successful sync
  • New issues as they appeared


This made leadership see the gap and prioritize resources.

Step 4: Fix the Root Cause

Meanwhile, we fixed the broken parser and restarted all affected pipelines. We also retroactively reprocessed all lost data. Thanks to monitoring, it validated its complete.

Outcomes

  • Recovered 100% of lost data
  • Created a real-time, always-on monitoring dashboard
  • Reduced future blind spots by enabling automatic validation
  • Prevented client escalations before they even noticed

Final Thoughts

Sometimes, all you need is a smart support engineer, a Python script, and a clear view of what’s wrong.


If your data platform has millions of moving parts — assume something is broken. Then go build something to catch it.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks