Every day, roughly five million people ride the New York City subway system, which expands 665 miles across the five boroughs of the city. The system is currently comprised of 473 subway stations. I will be employing data from the Foursquare API to learn more about each of these stations. Specifically, I will pull venue data to understand which types of establishments are prominent around each station. From this information, the traffic of each station can be estimated. For example, a station with primarily university venues nearby will be busier on weekdays and not so much on the weekends. Whereas an area that has many entertainment venues will likely experience a traffic spike during the weekends. This information will be useful for city planners, because the classification of the subway stations will provide a good estimation of human traffic within a radius of each station. I will also use k-means clustering to see if the clusters have any correlation to the venue categories of each station.
Importing libraries
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import folium # library for creating maps
import json # reading a json file
from pandas.io.json import json_normalize # tranforming json file into a pandas data frame library
import seaborn as sns # library for data visualization
import matplotlib.pyplot as plt # library for basic plotting
from matplotlib import cm, colors # creating colormaps for map markers
from IPython.display import Image, display # viewing multiple images in one cell
from sklearn.preprocessing import MinMaxScaler # normalizing data
from sklearn.metrics import silhouette_score # choosing the best number of clusters for k-means
from sklearn.cluster import KMeans # calculating k-means clusters
from collections import Counter # count occurances of stations in each cluster
Load dataset containing subway stations and their coordinates
station_coordinates_df = pd.read_csv('NYC Subway Station Coordinates.csv')
station_coordinates_df.head()
Create a map showing the NYC subway stations and the venue search radius for each one
NYC_subway_map = folium.Map(width=1200, height=675, location=[40.7320, -73.9301], zoom_start=13)
for station, coordinates in zip(station_coordinates_df['Station Name'], station_coordinates_df['Coordinates']):
latlong = [float(x) for x in coordinates.split(',')]
label = folium.Popup(str(station))
# Add a marker for each station.
folium.Circle(latlong,
radius=20,
color='black',
fill=True
).add_to(NYC_subway_map)
# Add the venue search radius around each station.
folium.Circle(latlong,
radius=200,
color='gray',
fill=True,
fill_color='gray',
fill_opacity=0.4).add_to(NYC_subway_map)
NYC_subway_map
Define Foursquare API credentials and version
credentials = json.load(open('MY_CREDENTIALS.json'))
CLIENT_ID = credentials['MY_CLIENT_ID']
CLIENT_SECRET = credentials['MY_CLIENT_SECRET']
VERSION = credentials['MY_VERSION']
LIMIT = 50
Make get request to get the main categories of venues along with their category ID
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(CLIENT_ID,
CLIENT_SECRET,
VERSION)
results = requests.get(categories_url).json()
categories_list = []
def print_categories(categories, level=0, max_level=0):
""" Print out the main categories """
if level > max_level: return
out = ''
out += '-' * level
for category in categories:
print(out + category['name'])
print_categories(category['categories'], level+1, max_level)
categories_list.append((category['name'], category['id'])) # Add category ID to list
print_categories(results['response']['categories'], 0, 0)
Make get request to get the count of nearby venues for each category, within a radius of each station
def get_venues_count(latlong, radius, categoryId):
venues_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={}&radius={}&categoryId={}'.format(CLIENT_ID,
CLIENT_SECRET,
VERSION,
latlong,
radius,
categoryId)
return requests.get(venues_url).json()['response']['totalResults']
Create a new data frame to store the venue categories
station_venues_df = station_coordinates_df.copy()
for category in categories_list:
station_venues_df[category[0]] = 0
Request the count of venues for each station, then store the results in a CSV file
for idx, row in station_venues_df.iterrows():
for category in categories_list:
station_venues_df.loc[idx, category[0]] = get_venues_count(station_venues_df.Coordinates.iloc[idx],
radius=200,
categoryId=category[1])
station_venues_df.to_csv('station_venues.csv')
Read the CSV file into a data frame
station_venues_df = pd.read_csv('station_venues.csv', index_col=0)
station_venues_df.head()
Find which stations have the most venues around them
station_venues_sum_df = station_venues_df.copy()
station_venues_sum_df['Sum'] = station_venues_sum_df.sum(axis=1)
for idx, row in enumerate(station_venues_sum_df.nlargest(20, 'Sum').iterrows()):
print("{}. {}:\tNumber of Venues = {}".format(idx+1, row[1][0], row[1][11]).expandtabs(39))
Create a boxplot and swarmplot to show the count of each venue category
plt.figure(figsize=(20, 10))
sns.boxplot(data=station_venues_df, showfliers=False)
sns.swarmplot
ax = sns.swarmplot(data=station_venues_df, zorder=.5)
ax.set_ylabel('Venue Count', fontsize=15)
ax.set_xlabel('Venue Category', fontsize=15)
ax.tick_params(labelsize=10)
plt.xticks(rotation=45, ha='right')
for patch in ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
plt.show()
Drop the Event category from the data frame and the category list, because it has very little data
station_venues_df.drop('Event', 1, inplace=True)
categories_list = [x for x in categories_list if x[0] != 'Event']
Function that sorts the venues of each station (row) in descending order
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[2:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Create new data frame that has the top 5 categories by count for each station
station_venues_sorted_df = station_coordinates_df.copy()
num_top_venues = 5
indicators = ['st', 'nd', 'rd']
columns = []
for idx in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(idx+1, indicators[idx]))
except:
columns.append('{}th Most Common Venue'.format(idx+1))
for column in columns:
station_venues_sorted_df[column] = 0
station_venues_sorted_df.head()
for idx in np.arange(station_venues_df.shape[0]):
station_venues_sorted_df.iloc[idx, 2:] = return_most_common_venues(station_venues_df.iloc[idx, :], num_top_venues)
station_venues_sorted_df.head()
Function that creates a map for the given rank of a venue category count
def create_venue_cat_map(venue_rank_str):
venue_cat_map = folium.Map(width=1200, height=675,location=[40.7320, -73.9301], zoom_start=11)
# create color map then assign colors to venue category names
cmap = cm.get_cmap('hsv')
color_indices = np.linspace(0, 1, 11)
rgb_colors = [colors.to_hex(list(rgba)) for rgba in [cmap(idx) for idx in color_indices]]
color_dict = {k:v for (k,v) in zip(station_venues_df.columns[2:], rgb_colors)}
for coordinates, venue_cat in zip(station_venues_sorted_df['Coordinates'], station_venues_sorted_df[venue_rank_str]):
latlong = [float(x) for x in coordinates.split(',')]
label = folium.Popup(str(venue_cat))
# Add a marker for each station.
folium.CircleMarker(latlong,
radius=5,
popup=label,
color=color_dict[venue_cat],
fill=True,
fill_color=color_dict[venue_cat],
fill_opacity=0.6).add_to(venue_cat_map)
return venue_cat_map
Loop to create a map for each tier of venue category rank
map_names = ["venue_rank_{}_map".format(x) for x in range(1, 6)]
map_names_dict = {}
for map_name, column in zip(map_names, station_venues_sorted_df.columns[2:]):
map_names_dict[map_name] = create_venue_cat_map(column)
Display every created map
legend = Image(filename='venue_cat_legend.JPG', width=800) # Legend made in Excel and saved as JPG
display('1st Most Common Categories')
display(legend)
display(map_names_dict['venue_rank_1_map'])
display('2nd Most Common Categories')
display(legend)
display(map_names_dict['venue_rank_2_map'])
display('3rd Most Common Categories')
display(legend)
display(map_names_dict['venue_rank_3_map'])
display('4th Most Common Categories')
display(legend)
display(map_names_dict['venue_rank_4_map'])
display('5th Most Common Categories')
display(legend)
display(map_names_dict['venue_rank_5_map'])
Scale the data to values between 0 and 1
X = station_venues_df.values[:,2:]
cluster_dataset = MinMaxScaler().fit_transform(X)
cluster_df = pd.DataFrame(cluster_dataset)
cluster_df.columns = [category[0] for category in categories_list]
cluster_df.head()
Visualize the scaled data
plt.figure(figsize=(20, 10))
sns.boxplot(data=cluster_df, showfliers=False)
sns.swarmplot
ax = sns.swarmplot(data=cluster_df, zorder=.5)
ax.set_ylabel('Venue Count (Relative)', fontsize=15)
ax.set_xlabel('Venue Category', fontsize=15)
ax.tick_params(labelsize=10)
plt.xticks(rotation=45, ha='right')
# Make the boxplot fill colors transparent
for patch in ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
plt.show()
Calculate the optimal number of clusters using The Silhouette Method
sil = []
K_sil = range(2,20)
for k in K_sil:
kmeans = KMeans(n_clusters = k).fit(cluster_df)
labels = kmeans.labels_
sil.append(silhouette_score(cluster_df, labels, metric = 'euclidean'))
plt.plot(K_sil, sil, 'bx-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method For Optimal k')
plt.show()
The optimal number of clusters appears to be 6
kclusters = 6
kmeans = KMeans(init='k-means++', n_clusters=kclusters, random_state=0).fit(cluster_df)
# print how many stations are in each cluster
print(Counter(kmeans.labels_))
Add the corresponding cluster to each station in the data frame
try:
station_venues_sorted_df.drop('Cluster Label', axis=1)
except:
station_venues_sorted_df.insert(0, 'Cluster Label', kmeans.labels_)
station_venues_sorted_df.head()
cluster_map = folium.Map(width=1200, height=675, location=[40.7320, -73.9301], zoom_start=11)
legend = Image(filename='cluster_map_legend.JPG', width=650) # legend made in Excel and saved as JPG
# set color scheme for the clusters
colors_array = cm.rainbow(np.linspace(0, 1, kclusters))
rainbow = [colors.rgb2hex(i) for i in colors_array]
for cluster, station, coordinates in zip(station_venues_sorted_df['Cluster Label'],
station_venues_sorted_df['Station Name'],
station_venues_sorted_df['Coordinates']):
label = folium.Popup(str(station) + ' - Cluster ' + str(cluster), parse_html=True)
latlong = [float(x) for x in coordinates.split(',')]
folium.CircleMarker(
latlong,
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(cluster_map)
display(legend)
display(cluster_map)
Analyze each cluster by showing the count of which venue category is the 1st, 2nd, 3rd most common
# 1st Most Common Venue, 2nd Most Common Venue, 3rd Most Common Venue
required_column_indices = [3, 4, 5]
required_columns = [list(station_venues_sorted_df.columns.values)[idx] for idx in required_column_indices]
cluster_names = ["cluster_{}".format(x) for x in range(6)]
cluster_names_dict = {}
for idx in range(6):
cluster_names_dict[cluster_names[idx]] = station_venues_sorted_df.loc[station_venues_sorted_df['Cluster Label'] == idx, station_venues_sorted_df.columns[1:12]]
for column in required_columns:
print("\t\tCLUSTER {}".format(idx))
print(cluster_names_dict[cluster_names[idx]][column].value_counts(ascending = False))
print("-" * 41)
print("\n\n\n\n")
After relating the categories of venues with each of the subway stations, some meaningful inferences can be made. Specifically with the first most common venue categories, city planners can use this data for whichever projects may relate to human traffic in New York City. By associating the categories with times of congestion, this data can be useful for predictions. Another step can be taken to retrieve the sub-categories of venues and join them to each station accordingly. This would reduce the number of data points for each variable of the data, but in return provide more specific information. The biggest challenge will be to establish the magnitudes of the first most common through fifth most common venue categories. How significant is the first most common, then how much less significant is the second most common, etc.
Moving on to the discussion about the k-means clustering. Using Euclidean distance as the classifier of the k-means algorithm and a k-value of six, three clusters of interest were discovered. Looking at the first and second most common venue categories for all three, each had almost exclusively Shop & Service and Professional & Other venues. Seeing as these three clusters are also centered geographically around Lower Manhattan, they could be a good candidate for further analysis.