End-to-End Data Engineering Project: Analyzing 2023 Uber Trips on Google Cloud Platform

Sep 30, 20254 min read
End-to-End Data Engineering Project: Analyzing 2023 Uber Trips on Google Cloud Platform
Data EngineeringGCPBigQueryMageETL

As someone passionate about data pipelines, I enjoy taking raw, real-world datasets and turning them into structured, analyzable assets. This project is an end-to-end data engineering walkthrough using the 2023 NYC Yellow Taxi trip records — a classic public dataset that simulates ride-hailing analytics similar to Uber.

Inspired by community tutorials (especially Darshil Parmar's clear videos and notebooks), I built the pipeline using modern tools: Google Cloud Platform (GCP), Python, Mage.ai for orchestration, BigQuery for storage, and Looker Studio for visualization.

Project Architecture

Project architecture diagram
  • Raw 2023 Yellow Taxi Parquet files (NYC TLC)
  • Google Cloud Storage (GCS) – raw landing zone
  • Mage.ai ETL pipeline (running on Compute Engine VM) – load, transform, model
  • Google BigQuery – star schema (dims + fact) + analytics view
  • Looker Studio – interactive dashboard with KPIs, charts, and maps

Technologies Used

  • Language: Python (pandas for transformations)
  • Pipeline Tool: Mage.ai
  • Cloud Storage: Google Cloud Storage
  • Compute: Google Compute Engine (VM to host Mage)
  • Warehouse: Google BigQuery
  • BI/Visualization: Looker Studio

Dataset

2023 NYC Yellow Taxi Trip Records (monthly Parquet files). 2023 covers ~30–40 million trips across 12 months.

Data Dictionary (key columns)

  • tpep_pickup_datetime, tpep_dropoff_datetime
  • PULocationID, DOLocationID (taxi zones)
  • passenger_count, trip_distance
  • fare_amount, tip_amount, tolls_amount, total_amount
  • congestion_surcharge, airport_fee, payment_type, RatecodeID

Data Modeling: Star Schema

Fact table `fact_trips` with measures (trip_distance, fare_amount, tip_amount, total_amount) and foreign keys to dimensions. Dimensions include `dim_datetime`, `dim_payment_type`, `dim_rate_code`, and location dims.

Star schema diagram

Preparation in Python

  • Drop duplicates, reset index, create synthetic `trip_id`
  • Convert datetimes and extract hour/day/month/year into `dim_datetime`
  • Build small lookup tables for payment_type and rate_code
  • Join and produce the fact table ready for export to BigQuery

Pipeline Steps

GCS Setup

Create a bucket (e.g., nyc-yellow-taxi-2023-raw) and upload monthly Parquet files. Use IAM for production access control.

VM & Mage Installation

Provision an e2-standard-4 VM, install Docker and Mage, run `mage start nyc_taxi_2023`, and expose the Mage UI (example: TCP 6789).

Mage Blocks

  • Loader: read Parquet from GCS into pandas DataFrame
  • Transformer: clean data, build dimensions & fact (return dict of DataFrames)
  • Exporter: write tables to BigQuery using a service account

BigQuery

Create a multi-region dataset, partition tables on pickup datetime, and populate dims + fact from Mage exports. Optionally create analytical views for common queries.

Looker Studio

Connect to BigQuery and build visuals: scorecards (total trips, revenue), bar charts (tips by payment type), time heatmaps, and geo maps for pickup/dropoff hotspots.

Looker Studio dashboard example

Key Learnings & Tips

  • Mage simplifies moving from notebook to production pipeline
  • Partition BigQuery tables on `tpep_pickup_datetime` to reduce costs
  • Service account permissions and firewall rules are common blockers
  • Parquet is preferable to CSV for scale and performance
Partitioning and small, well-scoped transformation steps save both time and money.

More from the Blog

View all
Three Days at GITEX Africa 2026: What I Actually Learned
Apr 14, 20268 min read
Three Days at GITEX Africa 2026: What I Actually Learned

A field report from GITEX Africa 2026 on AI infrastructure, talent, and the practical execution challenges shaping Africa's digital future.

GITEXAIData StrategyDigital Transformation
Data Cleaning: The Dirty Work That Actually Matters
Feb 19, 20268 min read
Data Cleaning: The Dirty Work That Actually Matters

A practical checklist of common data cleaning pitfalls and actionable practices to keep pipelines and reports trustworthy.

Data CleaningData EngineeringData Quality
Abdelhamid SAIDI | Data Engineer