Data Cleaning: The Dirty Work That Actually Matters

Feb 19, 20268 min read
Data Cleaning: The Dirty Work That Actually Matters
Data CleaningData EngineeringData Quality

We all know the stats: ~80% of our time goes into cleaning data instead of doing the modeling / dashboarding / ML part everyone posts about on LinkedIn. What most people don’t say out loud is that bad cleaning doesn’t just waste time — it quietly poisons every downstream decision, report, model, and Slack message that comes after it.

So here’s my battle-tested, mildly irritated list of the things I still catch myself (and others) screwing up — even after years in the trenches.

1. You Didn't Actually Look at the Damn Data First

I still see people jump straight to .fillna(0) or drop_duplicates() without ever running df.describe(), df.nunique(), or even df.sample(20). That’s like performing surgery with your eyes closed.

Rule #1 in 2026 remains unchanged: profile before you touch. Missing rate per column, weird min/max values, cardinality explosions, date columns that are secretly strings — these betrayals reveal themselves in the first 90 seconds if you bother to look.

2. Treating Every Missing Value Like It's the Same Problem

NaN is not a personality trait — it’s context.

  • Sensor died at 3 a.m. → probably should forward-fill
  • User didn’t want to answer income → imputing the median is defensible
  • Entire cohort from one marketing channel has missing conversion → that’s a systemic data collection bug, not a statistical problem

Blind mean/median imputation or — god forbid — dropping 35% of your rows is still one of the fastest ways to turn gold into garbage.

3. Duplicates That Aren't Obvious Duplicates

drop_duplicates() catches exact matches. It does not catch:

  • “john.doe@gmail.com” vs “johndoe@gmail.com”
  • “iPhone 14 Pro” vs “iphone 14 pro max 256gb”
  • Same transaction logged twice because the payment webhook retried

Fuzzy deduplication is no longer optional luxury — it’s hygiene.

4. Outliers You're Not Allowed to Remove

A customer spent $87,432 in one transaction? Statistically an outlier. Business-wise: it’s the CEO buying 400 licenses on the last day of the quarter. Moral: run your outlier detection, but never auto-remove without showing someone who actually talks to customers or looks at invoices.

5. “It Looked Like a Number Until It Didn't”

  • “1,234.56” (comma thousand separator)
  • “€ 1 250”
  • “N/A”, “-”, “ — ”, “null”, “NULL”, empty string, whitespace-only string

All of these love to hide inside numeric columns and turn your mean() into NaN without warning.

df['amount'] = pd.to_numeric(
    df['amount']
      .astype(str)
      .str.replace(r'[^\d.-]', '', regex=True),
    errors='coerce'
)

6. Category Hell

“male”, “M”, “Male ”, “m”, “MAN”, “Male?”, “m/”

You know the drill. Pick one canonical form early and map everything else to it. Bonus points if you store the mapping in a git-controlled YAML file so the business can argue about “preferred gender options” without breaking your pipeline next quarter.

7. Impossible Values Sneaking Past You

  • Age = -3
  • Birth date in 2027
  • Order status = “shipped” but delivery date is null and payment status = “pending”
  • Email without @

Write the stupid assertions. Future-you will buy present-you coffee for it.

8. The Join That Quietly Multiplies Rows

You merged on customer_id but forgot there are duplicate customer_id values on the right table. Suddenly your 10,000-row table becomes 1.2 million rows and nobody notices until the revenue number looks cursed.

before = len(df)
df = df.merge(right, on='key', how='left')
print(f"Rows went from {before:,} → {len(df):,}")

If that ratio looks weird then investigate immediately.

9. “I'll Remember Why I Did This”

No. You won’t.

  • write comments next to non-obvious imputations / mappings / filters
  • commit the cleaning notebook / script with meaningful messages
  • or better — turn recurring transformations into named, tested, versioned functions

Six months later when compliance asks “why is revenue 4.3% higher than last year?”, you want to be the person who can grep the git history, not the person refreshing their memory from three Notion pages.

10. Forgetting That Cleaning Is Production Code

  • parameterize it
  • add logging
  • add data-quality checks that fail loudly
  • run it in a scheduled, reproducible way

Because trust me: the day your boss says “the number is wrong and the board meeting is in 45 minutes” is not the day you want to be debugging pandas indexing by hand.

Data cleaning is like doing the dishes: nobody films TikToks about it, nobody gives you claps for it, but if you skip it the whole kitchen eventually smells terrible and nobody wants to eat.

So do it deliberately, do it early, and do it like someone might actually have to defend your choices in a meeting later.

More from the Blog

View all
Why Most Students Learn Data the Wrong Way
Jan 2, 20264 min read
Why Most Students Learn Data the Wrong Way

A guide to thriving in data analytics, navigate complex real-world problems with confidence, and develop a skill set grounded in fundamentals that outlast fleeting trends and ever-changing tools.

Data AnalyticsSQLData ExplorationPractical Learning
Building a Modern ELT Pipeline with Snowflake, dbt, and Airflow
Jan 31, 20266 min read
Building a Modern ELT Pipeline with Snowflake, dbt, and Airflow

A practical guide to building modern ELT pipelines using Snowflake, dbt, and Airflow — patterns, orchestration, and operational considerations.

ELTSnowflakedbtAirflowData EngineeringPipeline Architecture
Abdelhamid SAIDI | Data Engineer