End-to-End Data Engineering Project: Analyzing 2023 Uber Trips on Google Cloud Platform

Sep 30, 20254 min read
End-to-End Data Engineering Project: Analyzing 2023 Uber Trips on Google Cloud Platform
Data EngineeringGCPBigQueryMageETL

As someone passionate about data pipelines, I enjoy taking raw, real-world datasets and turning them into structured, analyzable assets. This project is an end-to-end data engineering walkthrough using the 2023 NYC Yellow Taxi trip records — a classic public dataset that simulates ride-hailing analytics similar to Uber.

Inspired by community tutorials (especially Darshil Parmar's clear videos and notebooks), I built the pipeline using modern tools: Google Cloud Platform (GCP), Python, Mage.ai for orchestration, BigQuery for storage, and Looker Studio for visualization.

Project Architecture

Project architecture diagram
  • Raw 2023 Yellow Taxi Parquet files (NYC TLC)
  • Google Cloud Storage (GCS) – raw landing zone
  • Mage.ai ETL pipeline (running on Compute Engine VM) – load, transform, model
  • Google BigQuery – star schema (dims + fact) + analytics view
  • Looker Studio – interactive dashboard with KPIs, charts, and maps

Technologies Used

  • Language: Python (pandas for transformations)
  • Pipeline Tool: Mage.ai
  • Cloud Storage: Google Cloud Storage
  • Compute: Google Compute Engine (VM to host Mage)
  • Warehouse: Google BigQuery
  • BI/Visualization: Looker Studio

Dataset

2023 NYC Yellow Taxi Trip Records (monthly Parquet files). 2023 covers ~30–40 million trips across 12 months.

Data Dictionary (key columns)

  • tpep_pickup_datetime, tpep_dropoff_datetime
  • PULocationID, DOLocationID (taxi zones)
  • passenger_count, trip_distance
  • fare_amount, tip_amount, tolls_amount, total_amount
  • congestion_surcharge, airport_fee, payment_type, RatecodeID

Data Modeling: Star Schema

Fact table `fact_trips` with measures (trip_distance, fare_amount, tip_amount, total_amount) and foreign keys to dimensions. Dimensions include `dim_datetime`, `dim_payment_type`, `dim_rate_code`, and location dims.

Star schema diagram

Preparation in Python

  • Drop duplicates, reset index, create synthetic `trip_id`
  • Convert datetimes and extract hour/day/month/year into `dim_datetime`
  • Build small lookup tables for payment_type and rate_code
  • Join and produce the fact table ready for export to BigQuery

Pipeline Steps

GCS Setup

Create a bucket (e.g., nyc-yellow-taxi-2023-raw) and upload monthly Parquet files. Use IAM for production access control.

VM & Mage Installation

Provision an e2-standard-4 VM, install Docker and Mage, run `mage start nyc_taxi_2023`, and expose the Mage UI (example: TCP 6789).

Mage Blocks

  • Loader: read Parquet from GCS into pandas DataFrame
  • Transformer: clean data, build dimensions & fact (return dict of DataFrames)
  • Exporter: write tables to BigQuery using a service account

BigQuery

Create a multi-region dataset, partition tables on pickup datetime, and populate dims + fact from Mage exports. Optionally create analytical views for common queries.

Looker Studio

Connect to BigQuery and build visuals: scorecards (total trips, revenue), bar charts (tips by payment type), time heatmaps, and geo maps for pickup/dropoff hotspots.

Looker Studio dashboard example

Key Learnings & Tips

  • Mage simplifies moving from notebook to production pipeline
  • Partition BigQuery tables on `tpep_pickup_datetime` to reduce costs
  • Service account permissions and firewall rules are common blockers
  • Parquet is preferable to CSV for scale and performance
Partitioning and small, well-scoped transformation steps save both time and money.

More from the Blog

View all
Data Cleaning: The Dirty Work That Actually Matters
Feb 19, 20268 min read
Data Cleaning: The Dirty Work That Actually Matters

A practical checklist of common data cleaning pitfalls and actionable practices to keep pipelines and reports trustworthy.

Data CleaningData EngineeringData Quality
Why Most Students Learn Data the Wrong Way
Jan 2, 20264 min read
Why Most Students Learn Data the Wrong Way

A guide to thriving in data analytics, navigate complex real-world problems with confidence, and develop a skill set grounded in fundamentals that outlast fleeting trends and ever-changing tools.

Data AnalyticsSQLData ExplorationPractical Learning
Abdelhamid SAIDI | Data Engineer