SF Taxi Trip Analysis

Trip trace from SFO to Sacramento (Google Maps)

In this project, I worked on GPS traces of taxis in San Francisco. There are several datasets provided.

First, we are interested in exploring the data and computing some statistics about the small dataset, which contains descriptions of the trips without the intermediary segments. In this dataset, each line has the following format (represented here on two lines):

<taxi-id> <start date (POSIX) > <start pos (lat)> <start pos (long)> <end date (POSIX)> <end pos (lat)> <end pos (long)>

We here will assume that the trip distance is the distance between its start and end points (as given by the latitude and longitude). To compute the geographical distance between two coordinates, we used a simple flat-surface formula which will give a reasonable approximation for this dataset because the distances are not too large (but remember that these formula are not always appropriate for larger distances).

All tasks here were done by using PySpark’s RDD API and several queries such as the longest trip and the busiest date were calculated using PySpark and a trip distance distribution was also displayed.

Next, we consider this senario:

“A significant number of taxi rides pass through the San Francisco airport. Assume that taxi companies have to pay for an expensive license for this airport access. A company may then be interested in knowing exactly how much they earn from these airport rides, to know whether paying the license is actually worth it.”

The goal of this part of the project is to use Hadoop to compute an estimate of the revenue coming from airport rides based on the GPS tracking data. The estimate should be as accurate as possible, which means that we would consider all data, that is, sampling is not an option.

This consists of two steps, which we describe next:

  • reconstructing trips from segments
  • computing the revenue obtained from these trips

First, we are required to reconstruct complete trips from the ride segments. The .segments files contain the complete GPS tracks decomposed into segments. A segment is simply a pair of geographical coordinates. The sampling rate is generally 1 minute, although there can be larger gaps. We proposed a design of a Map/Reduce application to construct trips.

Second, we use the output of the previous component to compute the total revenue obtained from airport trips. We consider airport trip rides as those that pass through a circle with the airport as center, and a radius of 1km. The airport is located at 37.62131° N, -122.37896° W. To calculate trip revenue, we used a simple formula that combines a starting fee of $3.5 with an additional $1.71 per kilometer.

We reported the total revenue that has been earned from airport trips, and also made a plot that shows the evolution of this revenue over time.

Feiyang Tang
Feiyang Tang
Ph.D. Candidate in Machine Learning

Data Enthusiast, ENFJ-T. Travelling, hiking and crime series lover. Multilingual.