Abdelhamid SAIDI | Data Engineer

We all know the stats: ~80% of our time goes into cleaning data instead of doing the modeling / dashboarding / ML part everyone posts about on LinkedIn. What most people don’t say out loud is that bad cleaning doesn’t just waste time — it quietly poisons every downstream decision, report, model, and Slack message that comes after it.

So here’s my battle-tested, mildly irritated list of the things I still catch myself (and others) screwing up — even after years in the trenches.

1. You Didn't Actually Look at the Damn Data First

I still see people jump straight to .fillna(0) or drop_duplicates() without ever running df.describe(), df.nunique(), or even df.sample(20). That’s like performing surgery with your eyes closed.

Rule #1 in 2026 remains unchanged: profile before you touch. Missing rate per column, weird min/max values, cardinality explosions, date columns that are secretly strings — these betrayals reveal themselves in the first 90 seconds if you bother to look.

2. Treating Every Missing Value Like It's the Same Problem

NaN is not a personality trait — it’s context.

Sensor died at 3 a.m. → probably should forward-fill
User didn’t want to answer income → imputing the median is defensible
Entire cohort from one marketing channel has missing conversion → that’s a systemic data collection bug, not a statistical problem

Blind mean/median imputation or — god forbid — dropping 35% of your rows is still one of the fastest ways to turn gold into garbage.

3. Duplicates That Aren't Obvious Duplicates

drop_duplicates() catches exact matches. It does not catch:

“john.doe@gmail.com” vs “johndoe@gmail.com”
“iPhone 14 Pro” vs “iphone 14 pro max 256gb”
Same transaction logged twice because the payment webhook retried

Fuzzy deduplication is no longer optional luxury — it’s hygiene.

4. Outliers You're Not Allowed to Remove

A customer spent $87,432 in one transaction? Statistically an outlier. Business-wise: it’s the CEO buying 400 licenses on the last day of the quarter. Moral: run your outlier detection, but never auto-remove without showing someone who actually talks to customers or looks at invoices.

5. “It Looked Like a Number Until It Didn't”

“1,234.56” (comma thousand separator)
“€ 1 250”
“N/A”, “-”, “ — ”, “null”, “NULL”, empty string, whitespace-only string

All of these love to hide inside numeric columns and turn your mean() into NaN without warning.

df['amount'] = pd.to_numeric(
    df['amount']
      .astype(str)
      .str.replace(r'[^\d.-]', '', regex=True),
    errors='coerce'
)

6. Category Hell

“male”, “M”, “Male ”, “m”, “MAN”, “Male?”, “m/”

You know the drill. Pick one canonical form early and map everything else to it. Bonus points if you store the mapping in a git-controlled YAML file so the business can argue about “preferred gender options” without breaking your pipeline next quarter.

7. Impossible Values Sneaking Past You

Age = -3
Birth date in 2027
Order status = “shipped” but delivery date is null and payment status = “pending”
Email without @

Write the stupid assertions. Future-you will buy present-you coffee for it.

8. The Join That Quietly Multiplies Rows

You merged on customer_id but forgot there are duplicate customer_id values on the right table. Suddenly your 10,000-row table becomes 1.2 million rows and nobody notices until the revenue number looks cursed.

before = len(df)
df = df.merge(right, on='key', how='left')
print(f"Rows went from {before:,} → {len(df):,}")

If that ratio looks weird then investigate immediately.

9. “I'll Remember Why I Did This”

No. You won’t.

write comments next to non-obvious imputations / mappings / filters
commit the cleaning notebook / script with meaningful messages
or better — turn recurring transformations into named, tested, versioned functions

Six months later when compliance asks “why is revenue 4.3% higher than last year?”, you want to be the person who can grep the git history, not the person refreshing their memory from three Notion pages.

10. Forgetting That Cleaning Is Production Code

parameterize it
add logging
add data-quality checks that fail loudly
run it in a scheduled, reproducible way

Because trust me: the day your boss says “the number is wrong and the board meeting is in 45 minutes” is not the day you want to be debugging pandas indexing by hand.

Data cleaning is like doing the dishes: nobody films TikToks about it, nobody gives you claps for it, but if you skip it the whole kitchen eventually smells terrible and nobody wants to eat.

So do it deliberately, do it early, and do it like someone might actually have to defend your choices in a meeting later.

Data Cleaning: The Dirty Work That Actually Matters