As someone passionate about data pipelines, I enjoy taking raw, real-world datasets and turning them into structured, analyzable assets. This project is an end-to-end data engineering walkthrough using the 2023 NYC Yellow Taxi trip records — a classic public dataset that simulates ride-hailing analytics similar to Uber.
Inspired by community tutorials (especially Darshil Parmar's clear videos and notebooks), I built the pipeline using modern tools: Google Cloud Platform (GCP), Python, Mage.ai for orchestration, BigQuery for storage, and Looker Studio for visualization.
Project Architecture
- Raw 2023 Yellow Taxi Parquet files (NYC TLC)
- Google Cloud Storage (GCS) – raw landing zone
- Mage.ai ETL pipeline (running on Compute Engine VM) – load, transform, model
- Google BigQuery – star schema (dims + fact) + analytics view
- Looker Studio – interactive dashboard with KPIs, charts, and maps
Technologies Used
- Language: Python (pandas for transformations)
- Pipeline Tool: Mage.ai
- Cloud Storage: Google Cloud Storage
- Compute: Google Compute Engine (VM to host Mage)
- Warehouse: Google BigQuery
- BI/Visualization: Looker Studio
Dataset
2023 NYC Yellow Taxi Trip Records (monthly Parquet files). 2023 covers ~30–40 million trips across 12 months.
Data Dictionary (key columns)
- tpep_pickup_datetime, tpep_dropoff_datetime
- PULocationID, DOLocationID (taxi zones)
- passenger_count, trip_distance
- fare_amount, tip_amount, tolls_amount, total_amount
- congestion_surcharge, airport_fee, payment_type, RatecodeID
Data Modeling: Star Schema
Fact table `fact_trips` with measures (trip_distance, fare_amount, tip_amount, total_amount) and foreign keys to dimensions. Dimensions include `dim_datetime`, `dim_payment_type`, `dim_rate_code`, and location dims.
Preparation in Python
- Drop duplicates, reset index, create synthetic `trip_id`
- Convert datetimes and extract hour/day/month/year into `dim_datetime`
- Build small lookup tables for payment_type and rate_code
- Join and produce the fact table ready for export to BigQuery
Pipeline Steps
GCS Setup
Create a bucket (e.g., nyc-yellow-taxi-2023-raw) and upload monthly Parquet files. Use IAM for production access control.
VM & Mage Installation
Provision an e2-standard-4 VM, install Docker and Mage, run `mage start nyc_taxi_2023`, and expose the Mage UI (example: TCP 6789).
Mage Blocks
- Loader: read Parquet from GCS into pandas DataFrame
- Transformer: clean data, build dimensions & fact (return dict of DataFrames)
- Exporter: write tables to BigQuery using a service account
BigQuery
Create a multi-region dataset, partition tables on pickup datetime, and populate dims + fact from Mage exports. Optionally create analytical views for common queries.
Looker Studio
Connect to BigQuery and build visuals: scorecards (total trips, revenue), bar charts (tips by payment type), time heatmaps, and geo maps for pickup/dropoff hotspots.
Key Learnings & Tips
- Mage simplifies moving from notebook to production pipeline
- Partition BigQuery tables on `tpep_pickup_datetime` to reduce costs
- Service account permissions and firewall rules are common blockers
- Parquet is preferable to CSV for scale and performance
“Partitioning and small, well-scoped transformation steps save both time and money.”


