Abdelhamid SAIDI | Data Engineer

Introduction

Designing My First Production-Style ELT Architecture

As a data engineering student passionate about building real-world systems, I wanted to move beyond tutorials and design something closer to what modern data teams actually use in production.

Instead of focusing only on SQL transformations, my goal was to understand: how cloud data warehouses fit into ELT architectures; how transformation layers are structured; how orchestration connects everything together; and how data quality is enforced automatically.

This led me to design and implement a complete ELT pipeline architecture using Snowflake, dbt, and Apache Airflow. In this article, I’ll focus on the architecture and design decisions behind the pipeline rather than line-by-line code.

Tools Used and Their Roles

Snowflake — The Data Platform

Snowflake acts as the central data layer: stores raw source data, executes transformation workloads, hosts analytics-ready tables, and scales compute independently from storage. It becomes the single source of truth of the pipeline.

dbt — The Transformation and Modeling Layer

dbt is responsible for organizing transformations into modular models, building transformation DAGs, managing dependencies between datasets, running automated data tests, and enforcing analytics engineering best practices.

Apache Airflow — The Orchestration Engine

Airflow sits at the top of the architecture. Its role is to schedule pipeline execution, trigger dbt jobs, monitor task status, handle retries and failures, and provide observability via UI. It turns the pipeline into a fully automated production workflow.

The Goal: Replicating a Real-World ELT Architecture

The objective of this project was not just to transform data — it was to design a pipeline that reflects real production environments: cloud-native, modular, testable, scalable, and observable. This meant separating responsibilities across layers instead of creating a single monolithic pipeline.

ETL vs ELT: Architectural Perspective

Traditionally, pipelines followed Extract → Transform → Load. In cloud-first architectures, this has shifted to Extract → Load → Transform because storage is cheap, compute is elastic, and SQL engines are highly optimized.

High-Level Pipeline Architecture

Source Layer — Raw TPCH data stored inside Snowflake.
Staging Layer — Standardized and cleaned representations of source data.
Transformation & Mart Layer — Business logic, aggregations, and fact tables.
Orchestration Layer — Airflow coordinating execution.

Data Flow Design

Source Layer — Raw Data Preservation

The pipeline uses Snowflake’s TPCH dataset as the source. At this stage no business logic is applied — data remains untouched and raw records are preserved. This design allows reprocessing if business logic changes and provides historical traceability.

Staging Layer — Data Standardization

The staging layer is where structure is introduced: rename cryptic source column names, normalize schemas, apply light cleaning, and prepare data for downstream modeling. This layer acts as a contract boundary between raw ingestion and business logic.

Transformation Layer — Business Logic

This layer contains the core modeling logic: joins, derived metrics, business rules, and intermediate reusable datasets. Logic is split into modular transformation steps which improves debugging, reusability, and lineage. dbt automatically builds the dependency graph between models.

Analytics Layer — Consumption-Ready Data

The final layer produces fact tables and aggregated marts optimized for BI tools, dashboards, and reporting workloads — designed for performance, simplicity, and stability.

Orchestration Architecture with Airflow

While dbt handles transformations, it does not handle scheduling. Airflow becomes the control plane of the architecture, enabling time-based scheduling, failure handling, monitoring, and task dependency management.

Integrating dbt with Astronomer Cosmos

Instead of manually building DAG tasks, I used Astronomer Cosmos to convert dbt projects directly into Airflow DAGs. This provides automatic mapping of dbt model dependencies, clean DAG structure, and minimal boilerplate code.

Data Quality Architecture

A production pipeline is useless without trust. This architecture includes automated data validation layers: structural tests (primary key uniqueness, null constraints, referential integrity) and business logic tests (invalid values, impossible dates, logical inconsistencies). These tests act as quality gates.

If data breaks business rules, the pipeline surfaces errors before analytics teams consume incorrect results.

Building a Modern ELT Pipeline with Snowflake, dbt, and Airflow