Abdelhamid SAIDI | Data Engineer

As someone passionate about data pipelines, I enjoy taking raw, real-world datasets and turning them into structured, analyzable assets. This project is an end-to-end data engineering walkthrough using the 2023 NYC Yellow Taxi trip records — a classic public dataset that simulates ride-hailing analytics similar to Uber.

Inspired by community tutorials (especially Darshil Parmar's clear videos and notebooks), I built the pipeline using modern tools: Google Cloud Platform (GCP), Python, Mage.ai for orchestration, BigQuery for storage, and Looker Studio for visualization.

Project Architecture

Raw 2023 Yellow Taxi Parquet files (NYC TLC)
Google Cloud Storage (GCS) – raw landing zone
Mage.ai ETL pipeline (running on Compute Engine VM) – load, transform, model
Google BigQuery – star schema (dims + fact) + analytics view
Looker Studio – interactive dashboard with KPIs, charts, and maps

Technologies Used

Language: Python (pandas for transformations)
Pipeline Tool: Mage.ai
Cloud Storage: Google Cloud Storage
Compute: Google Compute Engine (VM to host Mage)
Warehouse: Google BigQuery
BI/Visualization: Looker Studio

Dataset

2023 NYC Yellow Taxi Trip Records (monthly Parquet files). 2023 covers ~30–40 million trips across 12 months.

Data Dictionary (key columns)

tpep_pickup_datetime, tpep_dropoff_datetime
PULocationID, DOLocationID (taxi zones)
passenger_count, trip_distance
fare_amount, tip_amount, tolls_amount, total_amount
congestion_surcharge, airport_fee, payment_type, RatecodeID

Data Modeling: Star Schema

Fact table `fact_trips` with measures (trip_distance, fare_amount, tip_amount, total_amount) and foreign keys to dimensions. Dimensions include `dim_datetime`, `dim_payment_type`, `dim_rate_code`, and location dims.

Preparation in Python

Drop duplicates, reset index, create synthetic `trip_id`
Convert datetimes and extract hour/day/month/year into `dim_datetime`
Build small lookup tables for payment_type and rate_code
Join and produce the fact table ready for export to BigQuery

Pipeline Steps

GCS Setup

Create a bucket (e.g., nyc-yellow-taxi-2023-raw) and upload monthly Parquet files. Use IAM for production access control.

VM & Mage Installation

Provision an e2-standard-4 VM, install Docker and Mage, run `mage start nyc_taxi_2023`, and expose the Mage UI (example: TCP 6789).

Mage Blocks

Loader: read Parquet from GCS into pandas DataFrame
Transformer: clean data, build dimensions & fact (return dict of DataFrames)
Exporter: write tables to BigQuery using a service account

BigQuery

Create a multi-region dataset, partition tables on pickup datetime, and populate dims + fact from Mage exports. Optionally create analytical views for common queries.

Looker Studio

Connect to BigQuery and build visuals: scorecards (total trips, revenue), bar charts (tips by payment type), time heatmaps, and geo maps for pickup/dropoff hotspots.

Key Learnings & Tips

Mage simplifies moving from notebook to production pipeline
Partition BigQuery tables on `tpep_pickup_datetime` to reduce costs
Service account permissions and firewall rules are common blockers
Parquet is preferable to CSV for scale and performance

“Partitioning and small, well-scoped transformation steps save both time and money.”

End-to-End Data Engineering Project: Analyzing 2023 Uber Trips on Google Cloud Platform