MTA Turnstile Data Analysis for Marketing Campaign

Here’s my first project at Metis, we were assigned to use New York MTA turnstile data ( to come up with our own background story for Exploratory Data Analysis.

My Setup:

Our client Sephora wants to run a promotion, the duration will be 2 weeks. They want us to find the 10 busiest stations in New York City. Due to limited budget, only the top 5 will have advertisement boards. For the 2nd 5 busiest stations, they want to have their employees at the stations to hand out flyers & samples(2 days/week).

Goal — More people know about the sale → Increase total guest count

Task 1 — Find the 10 busiest stations

Task 2 — 2nd 5 busiest stations, find 2 days of the week with the most traffic

Data Collecting & Data Cleaning:

I used 13 weeks of the most recent MTA data(12/26/2020–3/26/2021) as it’s an upcoming promotion. Firstly, I used SQLite to create a database & table then loaded combined data into the database. Secondly, I used Sqlalchemy & Pandas to read the data from the database into python. Thirdly, in a Pandas DataFrame, I cleaned my data through the process below:

Step 1: create a new ‘DATETIME’ column

Step 2: dropped duplicates based on ‘unique turnstile’(a combination of ‘C/A’, ‘UNIT’, ‘SCP’, ‘STATION’ defines a unique turnstile) & ‘DATETIME’

Step 3: select ‘REGULAR’ rows for ‘DESC’ column to remove bad data

Step 4: as entries is cumulative data, I use today’s max entries-previous max entries to get the ‘DAILY_ENTRIES’, then drop null value

Step 5: check wildness of the data

Here’s a part of the process of my data cleaning, we can see there are 205 turnstiles that are strange (today’s entries are fewer than yesterday’s). I wrote a function to make the data valid for further analysis.

Conclusion & Visualization:

After I get the clean data, I group by ‘STATION’, sum up the ‘Daily_Entries’ to get the top 10 busiest stations. Then, we will place billboards for the first five.

For our 2nd task, before we check further, I want to know how far the stations(2nd 5 stations) are to the closest Sephora retail stores. Because we want it to be close enough to save the employees time, as well as have a higher chance for people to stop by the store after the knowing the sale info. I set 15 minutes’ walk distance as the boundary and I noticed the “125 ST” Station(45 minutes’ walk/16 minutes’ drive) seems too far from the retail store, so it may not be efficient. So I would skip this station and find the next closest station to the retail store, then I check out the distance for the next 5 busiest station, and I found “42 ST-PORT AUTH” which is 3 minutes’ walk distance away, it’s something that we are looking for, so I will use “42 ST-PORT AUTH” instead.

I selected the 2nd 5 busiest station(with one replacement) and created a new column called ‘DAY_OF_WEEK’, then group by ‘STATION’ & ‘DAY_OF_WEEK’ , sum up ‘DAILY_ENTRIES’ order by total entries, then get the top 2 for each group.

From the graph, we can see that most of the stations with more traffic on weekdays than weekends, except the purple line which is a bit different.

This is the conclusion for Task 2:


86 ST: Wed & Thu

59 ST COLUMBUS: Mon & Fri

42 ST-PORT AUTH: Tue & Wed

23 ST: Mon & Tue

I’ve also plotted out the side by side chart as shown below.

Future Work:

We can ask our clients for their analysis results of the customer characteristics, and behaviors. Segment market based on gender, age group, income level, etc…, use New York Census Data along with our MTA-data to find out which several area/district will be our marketing targets, then we can have marketing campaign around the area to improve the conversion rate.

Notebook Github link:

Appendix: Q&A

Q: Why do the top 5 stations get billboards while the 2nd group of 5 gets live in person crews? Not sure which is a more cost-effective marketing strategy?

A:Let’s go back to our goal, we want more people to know about the sale, when people see the sale info from the billboard, their friends & families may also get this info through word of mouth. The in person campaign cannot reach this effectiveness. Also, the billboard will be there 24 hours per day, which the human crew can’t. Thirdly, the in person crew is individual interactions, so there’s not as much reach. Because there’s only specific amount interactions they can do, they may get the same amount of reach at a less busy station. So, the most cost-effective marketing strategy is to put the billboard at the top 5 busiest stations and give the 2nd group of 5 other marketing campaigns.

Q: Why do you only choose the entries, not the entries & exits?

A:In the majority of cases, people will have a round trip, from home to office, then come back home, so I made an assumption that it’s the same group getting in and getting out from the station.

Data Scientist at Metis LinkedIn-