In the current post, I present the exploratory analysis of the bike sharing data. I got the data from http://www.bayareabikeshare.com/open-data. The data is prepresented in three different files, which gives information about : all trips that were made, the weather for each date and for each city, and geolocations of stations.
I load the first chunks of data from different files and change name of columns.
import zipfile
import pandas as pd
#'Date', 'Max_Temperature_F', 'Events', 'zip'
z = zipfile.ZipFile('babs_open_data_year_1.zip')
weather = pd.read_csv(z.open('201402_babs_open_data/201402_weather_data.csv'), index_col=False, usecols=[0,1,21,23])
#'station_id','lat','long','landmark'
z = zipfile.ZipFile('babs_open_data_year_1.zip')
stations = pd.read_csv(z.open('201402_babs_open_data/201402_station_data.csv'), parse_dates=True, usecols=[0,2,3,5])
#'Start Date','Start Terminal','Subscription Type'
z = zipfile.ZipFile('babs_open_data_year_1.zip')
trips = pd.read_csv(z.open('201402_babs_open_data/201402_trip_data.csv'), parse_dates=True, usecols=[1,2,4,9])
trips = trips.rename(columns={"Start Terminal": "station_id", "Start Date": "date_time", "Subscription Type": "subsc_type"})
trips.date_time = trips.date_time.apply(pd.to_datetime)
trips.loc[:,"Date"] = trips.loc[:,'date_time'].dt.date.apply(pd.to_datetime)
weather.Date = weather.Date.apply(pd.to_datetime)
Before merging, we notice that weather data is given per city. The cities are coded in zip codes. However, in stations 'zip' column has the zip code of stations. To match stations id with weather data, one has to make the correspondence between zipcodes of stations and zip codes of cities. One option is to measure distance between zipcodes. The operation needs internet connection and has asimptotics of O(MxN), where M is the number of unique zip codes of cities and N is the number of unique zip codes of the stations.
Another option is to translate zip codes of cities to coordinates and measure the distance between coordinate. This approach seems to be faster as the funciton of measuring distance between two coordinates does not need an internet connection.
from geopy.geocoders import Nominatim
geolocator = Nominatim()
# Need internet connection to work!!
zcit = pd.Series(weather.zip.unique())
def zip2coor(x):
# Transfer zip codes to coordinates.
# In stead of apply function to all, it is faster
# to apply to unique values of zip the merge
loca = geolocator.geocode('CA:'+str(x),timeout=20)
# as we are in California added CA:. otherwise work not in all cases
return pd.Series({'lat_cit':loca.latitude,'long_cit':loca.longitude, 'zip':int(x)})
zipcoor = zcit.apply(zip2coor)
#make zip int instead of float.
zipcoor.zip = zipcoor.zip.apply(int)
weather = weather.merge(zipcoor, on='zip') #add coordinates to weather
Now 'weather' data frame has two more columns 'lat_cit' and 'lat_cit', which consist of coordinates of the cities. I create the same columns for 'stations' data frame by measuring distance berween the coordinates of station and closest city.
# the function distance do not use Internet
from geopy.distance import vincenty
def get_cit_coor(x):
# TODO: put zipcoor as argument, now global from previous cell
c_stat = (x[0],x[1])
mdist=1e3
for i in xrange(zipcoor.size):
c_city = (zipcoor.lat_cit[0],zipcoor.long_cit[0])
dist = vincenty(c_city, c_stat).kilometers
if dist < mdist:
mdist = dist
coor=c_city
return pd.Series({'lat':x[0],'long':x[1],'lat_cit':coor[0],'long_cit':coor[1]})
stations = stations.merge(stations.loc[:,['lat','long']].apply(get_cit_coor,axis=1),on=['lat','long'])
del zipcoor
Both 'stations' and 'weather' has 'lat_cit' and 'long_cit' columns. 'trips' can be merges with others on 'stations_id'
result = trips.merge(stations, on = 'station_id')
result = result.merge(weather, on = ['Date','lat_cit','long_cit'])
Some minor changes to data.
result.Events[result.Events=='rain'] = 'Rain'
result.Events[pd.isnull(result.Events)]='Sun'
def Far2Cels(x):
return (x - 32.)/1.8
result['day_of_week'] = result['Date'].dt.weekday_name
result.Max_Temperature_F = result.Max_Temperature_F.apply(Far2Cels)
del trips
del weather
del stations
Duration of trips is distributed in the following way
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('fivethirtyeight')
result.Duration[result.Duration<2000].hist()
plt.xlabel("seconds")
plt.ylabel("# of trips in bin")
#TODO: make a pie from it
print "normal trips", result.Duration[result.Duration<2000][result.Duration>300].count()
print "short trips", result.Duration[result.Duration<300].count()
print "long trips", result.Duration[result.Duration>100000].count()
print "all trips", result.Duration.count()
For analysis we take the trips that lasts more than five minutes. Others trips can be cosidered as tests and are not relevant for further analysis
result = result[result.Duration>300]
There are four different weather events possible: Rain, Fog, Fog-Rain and Sun. Estimating Fog-Rain as realy rare event in CA, I do not consider Fog-Rain event to Rain.
EvDa = result.groupby(['Events','Date']).size()
EvDa.groupby(level=0).size().plot.pie(autopct='%.1f', figsize=(6, 6))
plt.ylabel("# of Days")
# drop Fog-Rain
result=result.drop(result.loc[result.Events=='Fog-Rain'].index)
del EvDa
#TODO: Split on subscribers and Customers
EvDa = result.groupby(['Events','Date'],sort=False).size()
Days = EvDa.groupby(level=0,sort=False).size()
Ev = result.groupby('Events',sort=False).size()
Ev.div(Days).plot.bar()
plt.ylabel("# of rised per day")
del EvDa
del Ev
People in CA prefer to use bicycle in sunny weather and do not like to ride in rain.
result.groupby(['subsc_type','landmark'],sort=False).date_time.count().unstack('subsc_type').plot(kind='bar',stacked=True,figsize=(8,6))
plt.ylabel("Total # of trips")
Also data consist mostly from data from SF. Other cities are small. TODO: think about to take only SF.
We present plot of Maximum temperature per day and # of trips depending on the date. Here we can see certain correlation between temperature and total # of rides.
result.groupby('Date').size().plot(figsize=(8, 8), label='# of trips')
plt.legend()
plt.ylabel("Total # of trips")
result.groupby('Date').Max_Temperature_F.mean().plot(secondary_y=True,label='Temperature')
plt.legend()
plt.ylabel("grad Celsius")
Depending on the day of week, Subscribers and Customers use bycicles in a different way. Subscribers prefer to ride on the week, and customers on the weekend.
result.groupby(['subsc_type','day_of_week'],sort=False).date_time.count().unstack('subsc_type').plot(kind='bar',stacked=True,figsize=(8,6))
plt.legend(loc='best')
The same split can be observed on the distribution of rides during the day.
result.loc[:,'hours'] = result.date_time.dt.hour
result.groupby(['subsc_type','hours']).date_time.count().unstack('subsc_type').plot()
plt.ylabel("Total # of trips")
plt.legend(loc=2)
One can deduce that Customers use bicycles for rides during the day, and subscribers during the hours to go to work and back home, as well as during lunch time.