Thursday, February 12, 2015

Analyzing the NYC Subway Dataset

Section 0. PreparationΒΆ

import matplotlib
%matplotlib inline
import matplotlib.dates as mdates
#%matplotlib osx
import pandas
import numpy
from ggplot import *
import statsmodels.api as sm
import statsmodels
import scipy
import matplotlib.pyplot as plt
import math
import seaborn as sns
from datetime import datetime


df = pandas.read_csv('turnstile_data_master_with_weather.csv')
non_rain = df['ENTRIESn_hourly'][df['rain'] == 0]
rain = df['ENTRIESn_hourly'][df['rain'] == 1]

Before doing anything I want to have a look at data.

df.describe()
Unnamed: 0 Hour ENTRIESn_hourly EXITSn_hourly maxpressurei maxdewpti mindewpti minpressurei meandewpti meanpressurei fog rain meanwindspdi mintempi meantempi maxtempi precipi thunder
count 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951.000000 131951
mean 65975.000000 10.896158 1095.348478 886.890838 30.031894 57.241302 48.259013 29.892714 52.703526 29.965077 0.167100 0.334245 5.543065 56.169775 64.269729 71.769968 0.172276 0
std 38091.117022 6.892084 2337.015421 2008.604886 0.125689 8.770891 11.305312 0.146384 9.943590 0.130461 0.373066 0.471728 1.982441 6.338875 6.568289 7.627218 0.429005 0
min 0.000000 0.000000 0.000000 0.000000 29.740000 39.000000 22.000000 29.540000 31.000000 29.640000 0.000000 0.000000 1.000000 46.000000 55.000000 58.000000 0.000000 0
25% 32987.500000 5.000000 39.000000 32.000000 29.960000 50.000000 38.000000 29.840000 45.000000 29.910000 0.000000 0.000000 5.000000 52.000000 60.000000 65.000000 0.000000 0
50% 65975.000000 12.000000 279.000000 232.000000 30.030000 57.000000 51.000000 29.910000 54.000000 29.960000 0.000000 0.000000 5.000000 54.000000 63.000000 71.000000 0.000000 0
75% 98962.500000 17.000000 1109.000000 847.000000 30.100000 64.000000 55.000000 29.970000 60.000000 30.050000 0.000000 1.000000 6.000000 60.000000 68.000000 78.000000 0.100000 0
max 131950.000000 23.000000 51839.000000 45249.000000 30.310000 70.000000 66.000000 30.230000 68.000000 30.270000 1.000000 1.000000 12.000000 70.000000 78.000000 86.000000 2.180000 0

Section 1. Statistical TestΒΆ

The first question which I asked myself was: Is there any significant difference between number of people entering subway in rainy and non rainy days. As shown below, data for this particular question isn't normally distributed. So it's suitable to use non-parametric test. I'm going to use Mann–Whitney U test. Null hypothesis for our two-sided test will be P(x>y)=0.5, where x - subway crowd in rainy days, y - subway crowd in non rainy days. At first let's check our assumptions about non-Gaussian data using Shapiro-Wilk test:

scipy.stats.shapiro(rain)
(0.4715914726257324, 0.0)
scipy.stats.shapiro(non_rain)
(0.47661787271499634, 0.0)

Personaly, I'm very sceptic and always want to confirm results using another approach. For second normality test I'll use scipy.stats.mstats.normaltest based on D'Agostino and Pearson's test (more information you can find in docs.scipy.org. The method returns p-value for 2-sided chi-quared probability for the hypothesis test.

scipy.stats.mstats.normaltest(rain)
(49379.024966302793, 0.0)
scipy.stats.mstats.normaltest(non_rain)
(96614.17551121171, 0.0)

As we can see from Shapiro-Wilk and D'Agostino|Pearson test data is definitely non normal distributed due to extremely low p-value. Now let's calculate statistics and p-value for Mann Whitney U test (wikipedia.org) and it's python implementation scipy.stats:

t, p_value = scipy.stats.mannwhitneyu(rain, non_rain)
2 * p_value
0.049999825586979442

We multiplied p-value by two in Mann Whitney U test, because scipy returns p-value for one-sided test. P-value is less than 0.05 - our significant level, so we should reject the null hypothesis that the distributions being compared are identical. Let examine this data more and look at means and medians.

"rain mean = {0}, non-rain mean = {1}".format(numpy.mean(rain), numpy.mean(non_rain))
'rain mean = 1105.44637675, non-rain mean = 1090.27878015'
"rain median = {0}, non-rain median = {1}".format(numpy.median(rain), numpy.median(non_rain))
'rain median = 282.0, non-rain median = 278.0'

So it seems that statistical tests are right. In my opinion in shiny days people can walk on the street to work/university. But in rany days it's quite not comfortable, so maybe here is one of the differences. Also I should notice that I used significance level 5%.

Section 2. Linear RegressionΒΆ

In my investigation of given dataset I was trying to predict how many people will enter subway using different predictors. I've used gradient descent and ols methods for this problem. I want to show prediction using ols method. Our response variable will be 'ENTRIESn_hourly'. Our predictiors will be 'EXITSn_hourly', 'rain', 'fog'. Let's notice that rain and fog is our dummy variables. I decided to use fog and rain because I thought that when it is very foggy or rainy outside people might decide to use the subway more often. Also I decided to use hour variable because it increases R-squared value and maxtempi & meandewpti for their high variance. 'EXITSn_hourly' present in model due to my believings that number of entries in subway is extremely correlated with number of exits.

y = df['ENTRIESn_hourly']
x = df[['EXITSn_hourly' ,'rain', 'fog', 'meantempi', 'Hour','meandewpti']]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: ENTRIESn_hourly R-squared: 0.557
Model: OLS Adj. R-squared: 0.557
Method: Least Squares F-statistic: 2.770e+04
Date: Thu, 29 Jan 2015 Prob (F-statistic): 0.00
Time: 19:18:45 Log-Likelihood: -1.1569e+06
No. Observations: 131951 AIC: 2.314e+06
Df Residuals: 131944 BIC: 2.314e+06
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 324.5246 46.355 7.001 0.000 233.670 415.379
EXITSn_hourly 0.8549 0.002 395.953 0.000 0.851 0.859
rain -8.7396 12.559 -0.696 0.486 -33.355 15.875
fog 44.6593 13.107 3.407 0.001 18.970 70.348
meantempi -2.6079 1.020 -2.556 0.011 -4.608 -0.608
Hour 19.7716 0.629 31.430 0.000 18.539 21.005
meandewpti -0.7543 0.728 -1.036 0.300 -2.182 0.673
Omnibus: 109690.964 Durbin-Watson: 1.504
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25879157.660
Skew: 3.164 Prob(JB): 0.00
Kurtosis: 71.316 Cond. No. 2.39e+04

From our summary model we can see that rain variable is significant for our prediction due to 95% confidence interval contain 0 so we can surely say that rain variable is not so important in current variables configuration. In summary of our model we can observe mathematical expectation of coefficients of dummy or non-dummy variables, standard deviation, confidence interval and more different useful statistics.

Coefficient of determination of our model is equal to 0.558. I want to remember that coefficient of determination has the following meaning - this is the percentage of variance that is explained by our model. We can see that our R-squared is too low to produce accurate predictions.

Section 3. VisualizationΒΆ

Let's plot our ols prediction results. Let y-axis be 'ENTRIESn_hourly' and x-axis be 'EXITSn_hourly'. Plotting will be done using matplotlib.

fig, ax = plt.subplots()
statsmodels.graphics.regressionplots.plot_fit(results, 'EXITSn_hourly', ax=ax)
ax.set_ylabel("ENTRIESn_hourly")
ax.set_xlabel("EXITSn_hourly")
ax.set_title("Linear Regression")
plt.axis([0, 60000, 0, 60000])
fig1 = matplotlib.pyplot.gcf()
fig1.set_size_inches(10.5,10.5)
plt.show()

As we can see from plot, we haven't got good model to predict ENTRIESn_hourly on each subway station. Let's create visualization which contains two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days. For this visualization i've used ggplot.

df_t = df.copy(True)
df_t['rain'][df_t['rain'] == 0] = 'no rain'
df_t['rain'][df_t['rain'] == 1] = 'rain'
df_t.describe()
ggplot(df_t, aes(x = 'ENTRIESn_hourly')) + geom_histogram(binwidth = 200) + facet_wrap('rain') + xlim(0,5000) +\
ggtitle("Hitograms of ENTRIESn_hourly by rain") + ylab('Frequency')

Let's add more plots and look at subway daily entries/exits with time.

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df2 = df[['EXITSn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df2.index.name = 'Date'
df2.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2.columns = ['Date', 'NumberOfPeople']
df1['type'] = 'ENTRIESn_hourly'
df2['type'] = 'EXITSn_hourly'
total = pandas.concat([df1,df2])
total['Date'] = pandas.to_datetime(total['Date'])
start = min(total['Date'])
end = max(total['Date'])
gg = ggplot(total, aes(x='Date', y='NumberOfPeople', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Daily entries and exits") + ylim(0, 6500000)
print gg

Surprisingly, but I my thoughts are very different from this plot. I was thinking that summary number of Entries & Exits in subway should be near same value, but from plot we can see big difference between this two variables. Difference is explained by the reason that some subway station in NY doesn't have turnstile on exits. Also it's very common that less people use subway in weekends. Notes: unfortunately ggplot can't natively scale date axis - https://github.com/yhat/ggplot/issues/312. There are some bugs with limits.

The last one - let's create histogram plot of ENTRIESn_hourly depending on weather i.e. fog and rain As in class we will create fog-rain variable at first.

def format_key(fog, rain):
    return '{}fog-{}rain'.format(
        '' if fog else 'no',
        '' if rain else 'no')

fograin = []
for index, row in df.iterrows():
    fograin.append(format_key(row['fog'], row['rain']))
df['fograin'] = pandas.Series(fograin)
ggplot(df, aes(x='ENTRIESn_hourly', fill='fograin', color='fograin')) + \
    geom_histogram(alpha=0.6, binwidth = 200) + xlim(0,5000) + \
    xlab("Number of people") + ylab("Frequency") + ggtitle("Histogram of Entries hourly by type of weather")

From plot we can observe that mode of number of peoples in subway stay near the same. Also we can notice that our data is highly right skewed distributed.

Section 3.5. Prediction of future daily entries using linear regression.ΒΆ

At first I should notice that I can't use EXITSn_hourly for future predictions, because i simply don't know this values. So i can use only weather values which i can obtain from meteorological station. In section 3 i notice that data is seems to very seasonal, so it might be a right way to add to linear regression some features to catch seasonal (daily) changings. So i'm going to add seasonal pattern using Fourier terms.

#Read again data
df = pandas.read_csv('turnstile_data_master_with_weather.csv')
#define season in seconds (1 week)
season = 7 * 24 * 60 * 60

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2 = df[['meanwindspdi', 'mintempi', 'meantempi', 'maxtempi','maxpressurei', 'maxdewpti', 'mindewpti', 'minpressurei', 'meandewpti', 'meanpressurei', 'DATEn']].groupby('DATEn').mean()
df2.reset_index(inplace=True)
##concat df1 and df2
main = pandas.concat([df1,df2], 1)
main = main.drop('DATEn', 1)
main['Date'] = pandas.to_datetime(main['Date'])
main['UnixTime'] = main['Date'].astype(numpy.int64) // 10**9
#function to add fourier terms to data frame, n - number of terms
def add(df, n, col_time_name):
    for i in range(1,n+1):
        col_sin_name = "sin" + str(i)
        col_cos_name = "cos" + str(i)
        sin = lambda x: math.sin(2*math.pi*i*x/season)
        cos = lambda x: math.cos(2*math.pi*i*x/season)

        df[col_sin_name] = df[col_time_name].apply(sin)
        df[col_cos_name] = df[col_time_name].apply(cos)
    
    return df

main = add(main, 2, 'UnixTime')
cols = list(main.columns.values)
cols_x = [x for x in cols if x not in ['NumberOfPeople', 'Date', 'UnixTime']]
y = main['NumberOfPeople']
x = main[cols_x]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: NumberOfPeople R-squared: 0.951
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 20.64
Date: Thu, 29 Jan 2015 Prob (F-statistic): 2.77e-07
Time: 19:19:17 Log-Likelihood: -420.42
No. Observations: 30 AIC: 870.8
Df Residuals: 15 BIC: 891.9
Df Model: 14
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const -1.54e+07 2.48e+07 -0.622 0.543 -6.82e+07 3.74e+07
meanwindspdi -8130.3111 5.44e+04 -0.149 0.883 -1.24e+05 1.08e+05
mintempi -2.794e+05 1.94e+05 -1.443 0.169 -6.92e+05 1.33e+05
meantempi 6.312e+05 3.57e+05 1.766 0.098 -1.31e+05 1.39e+06
maxtempi -3.549e+05 1.76e+05 -2.012 0.063 -7.31e+05 2.12e+04
maxpressurei 1.054e+07 4.11e+06 2.563 0.022 1.77e+06 1.93e+07
maxdewpti 1.342e+05 7.44e+04 1.804 0.091 -2.44e+04 2.93e+05
mindewpti 7.91e+04 7.49e+04 1.057 0.307 -8.05e+04 2.39e+05
minpressurei -1.229e+06 2.63e+06 -0.467 0.647 -6.84e+06 4.38e+06
meandewpti -2.253e+05 1.39e+05 -1.618 0.126 -5.22e+05 7.14e+04
meanpressurei -8.627e+06 4.91e+06 -1.756 0.099 -1.91e+07 1.84e+06
sin1 -9.192e+05 1.5e+05 -6.119 0.000 -1.24e+06 -5.99e+05
cos1 1.529e+06 1.62e+05 9.460 0.000 1.18e+06 1.87e+06
sin2 5.624e+05 1.44e+05 3.909 0.001 2.56e+05 8.69e+05
cos2 -1.866e+05 1.33e+05 -1.406 0.180 -4.7e+05 9.64e+04
Omnibus: 9.581 Durbin-Watson: 1.863
Prob(Omnibus): 0.008 Jarque-Bera (JB): 8.179
Skew: -1.038 Prob(JB): 0.0167
Kurtosis: 4.494 Cond. No. 5.04e+04

Let's create visualization.

pr = pandas.Series(results.predict())
#Construct data with fitted values and real values
final1 = pandas.DataFrame(dict(values = pr, date = main['Date']))
final2 = pandas.DataFrame(dict(values = main['NumberOfPeople'], date = main['Date']))
final1['type'] = 'fitted values'
final2['type'] = 'real values'
final = pandas.concat([final1, final2])

#crete visualization
gg = ggplot(final, aes(x='date', y='values', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Fitting daily entries using Linear Regression with Fourier terms") + ylim(0, 6600000)
print gg

As we can see after adding seasonal pattern we highly increase R-squared up to 0.95 using only 4 Fourier terms.

Section 4. ConclusionΒΆ

In this first project we partially investigate NYC Subway Dataset. I was trying to answer the following question: "Do more people ride the NYC subway when it is raining or when it is not raining?". As part of this investigation I've obtain two samples (rainy and non rainy days). I needed to check zero hypothesis that means of two samples are equal. At first i need to choose statistical test to do this task correctly. I've checked both samples on normality test. Shapiro-Wilk test rejected zero hypothesis about data normality for both samples. So I've choosen Mann-Whitney U test or Wilcoxon rank test, because it is non-parametric test (no assumptions about distributions). For our two samples Mann-Whitney U test has rejected hypothesis. So looking at mean and median for rainy and not days (both greater for rainy days meandif = 15, mediandif = 4), using results of statistical test we can conclude that distribution of the number of entries statistically different between rainy and non rainy days with significance level 5%. Linear Regression using only features from dataset gave not bad and no good. Results can be as start point, but model is bad for accurate predictions. There is still need more investigation to make right decision, but anyway I believe in non-parametric approach more. I've build ols model with not so good r-square equals to 0.557. But it can be only initial solution before more deeper diving in this dataset. But there are some issues with Linear Regression if we plan to use it for prediction. We'll definitely exclude regressors as 'EXITSn_hourly', because we arn't able to get this values. Looking at we can see that daily entries have seasonal pattern, so I've included Fourier terms to catch this fact. I've obtained good R-squared 0.95 and this model can be used for predictions. Also i want to notice that seasonal linear regression also includes variables like daily minimal temperature, maximum temperature and etc. This values i can get from meteorological station.

Section 5. ReflectionΒΆ

I've found next potential shortcomings of dataset we use: 1. Daily entries aren't equal to daily exits. Is this a bug? Or maybe I just don't have clear understanding of EXITSn_hourly. I think that entries is how many people enter subway, exits - number of people exiting subway. I'm sure this variables should be daily equal. 2. Just one month of data. This implies low variablity of tempreture and other interesting features, so they might be not so important in May, but can be extremely useful in winter.

Some notes about application of Linear Regression to our dataset: 1. It's clear to see from figure that our data is seasonal. So we might include seasonal features to predict ENTRIESn_hourly. But classic linear regression doesn't take seasonality into account. 2. I've include seasonality pattern in Linear Regression using Fourier terms. I've obtained high R-squared equal to 0.96. Model was built withou EXITSn_hourly, so we can use it to predict future values of daily entries in NYC subway.

Last notes about statistical tests: 1. In my opinion for any test I should use at least two different approaches to calculate p-value for zero hypothesis. But actually i didn't find fully another approach of non parametric test for my task. Tests on data normality is more developed and i found a good amount of approaches for using in different situations.

P.S. All source code and ipython notebook can be found on github.

List of Resources.ΒΆ

  1. Scipy - Normal test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html
  2. Wikipedia - Mann-Whitney U test. Url: http://en.wikipedia.org/wiki/Mann–Whitney_U_test
  3. Scipy - Mann-Whitney U test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mannwhitneyu.html
  4. StatsModels - Regression. Url: http://statsmodels.sourceforge.net/0.5.0/generated/statsmodels.regression.linear_model.OLS.html
  5. Date Axis Limits GGplot Issue. Url: https://github.com/yhat/ggplot/issues/312
  6. StatsModels - OLS. Url: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
  7. Python GGplot - GGplot documentation. Url: http://ggplot.yhathq.com/docs/index.html

No comments:

Post a Comment