Section 0. Preparation¶

import matplotlib
%matplotlib inline
import matplotlib.dates as mdates
#%matplotlib osx
import pandas
import numpy
from ggplot import *
import statsmodels.api as sm
import statsmodels
import scipy
import matplotlib.pyplot as plt
import math
import seaborn as sns
from datetime import datetime


df = pandas.read_csv('turnstile_data_master_with_weather.csv')
non_rain = df['ENTRIESn_hourly'][df['rain'] == 0]
rain = df['ENTRIESn_hourly'][df['rain'] == 1]

Before doing anything I want to have a look at data.

df.describe()

	Unnamed: 0	Hour	ENTRIESn_hourly	EXITSn_hourly	maxpressurei	maxdewpti	mindewpti	minpressurei	meandewpti	meanpressurei	fog	rain	meanwindspdi	mintempi	meantempi	maxtempi	precipi	thunder
count	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951
mean	65975.000000	10.896158	1095.348478	886.890838	30.031894	57.241302	48.259013	29.892714	52.703526	29.965077	0.167100	0.334245	5.543065	56.169775	64.269729	71.769968	0.172276	0
std	38091.117022	6.892084	2337.015421	2008.604886	0.125689	8.770891	11.305312	0.146384	9.943590	0.130461	0.373066	0.471728	1.982441	6.338875	6.568289	7.627218	0.429005	0
min	0.000000	0.000000	0.000000	0.000000	29.740000	39.000000	22.000000	29.540000	31.000000	29.640000	0.000000	0.000000	1.000000	46.000000	55.000000	58.000000	0.000000	0
25%	32987.500000	5.000000	39.000000	32.000000	29.960000	50.000000	38.000000	29.840000	45.000000	29.910000	0.000000	0.000000	5.000000	52.000000	60.000000	65.000000	0.000000	0
50%	65975.000000	12.000000	279.000000	232.000000	30.030000	57.000000	51.000000	29.910000	54.000000	29.960000	0.000000	0.000000	5.000000	54.000000	63.000000	71.000000	0.000000	0
75%	98962.500000	17.000000	1109.000000	847.000000	30.100000	64.000000	55.000000	29.970000	60.000000	30.050000	0.000000	1.000000	6.000000	60.000000	68.000000	78.000000	0.100000	0
max	131950.000000	23.000000	51839.000000	45249.000000	30.310000	70.000000	66.000000	30.230000	68.000000	30.270000	1.000000	1.000000	12.000000	70.000000	78.000000	86.000000	2.180000	0

Section 1. Statistical Test¶

The first question which I asked myself was: Is there any significant difference between number of people entering subway in rainy and non rainy days. As shown below, data for this particular question isn't normally distributed. So it's suitable to use non-parametric test. I'm going to use Mann–Whitney U test. Null hypothesis for our two-sided test will be P(x>y)=0.5, where x - subway crowd in rainy days, y - subway crowd in non rainy days. At first let's check our assumptions about non-Gaussian data using Shapiro-Wilk test:

scipy.stats.shapiro(rain)

(0.4715914726257324, 0.0)

scipy.stats.shapiro(non_rain)

(0.47661787271499634, 0.0)

Personaly, I'm very sceptic and always want to confirm results using another approach. For second normality test I'll use scipy.stats.mstats.normaltest based on D'Agostino and Pearson's test (more information you can find in docs.scipy.org. The method returns p-value for 2-sided chi-quared probability for the hypothesis test.

scipy.stats.mstats.normaltest(rain)

(49379.024966302793, 0.0)

scipy.stats.mstats.normaltest(non_rain)

(96614.17551121171, 0.0)

As we can see from Shapiro-Wilk and D'Agostino|Pearson test data is definitely non normal distributed due to extremely low p-value. Now let's calculate statistics and p-value for Mann Whitney U test (wikipedia.org) and it's python implementation scipy.stats:

t, p_value = scipy.stats.mannwhitneyu(rain, non_rain)
2 * p_value

0.049999825586979442

We multiplied p-value by two in Mann Whitney U test, because scipy returns p-value for one-sided test. P-value is less than 0.05 - our significant level, so we should reject the null hypothesis that the distributions being compared are identical. Let examine this data more and look at means and medians.

"rain mean = {0}, non-rain mean = {1}".format(numpy.mean(rain), numpy.mean(non_rain))

'rain mean = 1105.44637675, non-rain mean = 1090.27878015'

"rain median = {0}, non-rain median = {1}".format(numpy.median(rain), numpy.median(non_rain))

'rain median = 282.0, non-rain median = 278.0'

So it seems that statistical tests are right. In my opinion in shiny days people can walk on the street to work/university. But in rany days it's quite not comfortable, so maybe here is one of the differences. Also I should notice that I used significance level 5%.

Section 2. Linear Regression¶

In my investigation of given dataset I was trying to predict how many people will enter subway using different predictors. I've used gradient descent and ols methods for this problem. I want to show prediction using ols method. Our response variable will be 'ENTRIESn_hourly'. Our predictiors will be 'EXITSn_hourly', 'rain', 'fog'. Let's notice that rain and fog is our dummy variables. I decided to use fog and rain because I thought that when it is very foggy or rainy outside people might decide to use the subway more often. Also I decided to use hour variable because it increases R-squared value and maxtempi & meandewpti for their high variance. 'EXITSn_hourly' present in model due to my believings that number of entries in subway is extremely correlated with number of exits.

y = df['ENTRIESn_hourly']
x = df[['EXITSn_hourly' ,'rain', 'fog', 'meantempi', 'Hour','meandewpti']]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	ENTRIESn_hourly	R-squared:	0.557
Model:	OLS	Adj. R-squared:	0.557
Method:	Least Squares	F-statistic:	2.770e+04
Date:	Thu, 29 Jan 2015	Prob (F-statistic):	0.00
Time:	19:18:45	Log-Likelihood:	-1.1569e+06
No. Observations:	131951	AIC:	2.314e+06
Df Residuals:	131944	BIC:	2.314e+06
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	324.5246	46.355	7.001	0.000	233.670 415.379
EXITSn_hourly	0.8549	0.002	395.953	0.000	0.851 0.859
rain	-8.7396	12.559	-0.696	0.486	-33.355 15.875
fog	44.6593	13.107	3.407	0.001	18.970 70.348
meantempi	-2.6079	1.020	-2.556	0.011	-4.608 -0.608
Hour	19.7716	0.629	31.430	0.000	18.539 21.005
meandewpti	-0.7543	0.728	-1.036	0.300	-2.182 0.673

Omnibus:	109690.964	Durbin-Watson:	1.504
Prob(Omnibus):	0.000	Jarque-Bera (JB):	25879157.660
Skew:	3.164	Prob(JB):	0.00
Kurtosis:	71.316	Cond. No.	2.39e+04

From our summary model we can see that rain variable is significant for our prediction due to 95% confidence interval contain 0 so we can surely say that rain variable is not so important in current variables configuration. In summary of our model we can observe mathematical expectation of coefficients of dummy or non-dummy variables, standard deviation, confidence interval and more different useful statistics.

Coefficient of determination of our model is equal to 0.558. I want to remember that coefficient of determination has the following meaning - this is the percentage of variance that is explained by our model. We can see that our R-squared is too low to produce accurate predictions.

Section 3. Visualization¶

Let's plot our ols prediction results. Let y-axis be 'ENTRIESn_hourly' and x-axis be 'EXITSn_hourly'. Plotting will be done using matplotlib.

fig, ax = plt.subplots()
statsmodels.graphics.regressionplots.plot_fit(results, 'EXITSn_hourly', ax=ax)
ax.set_ylabel("ENTRIESn_hourly")
ax.set_xlabel("EXITSn_hourly")
ax.set_title("Linear Regression")
plt.axis([0, 60000, 0, 60000])
fig1 = matplotlib.pyplot.gcf()
fig1.set_size_inches(10.5,10.5)
plt.show()

As we can see from plot, we haven't got good model to predict ENTRIESn_hourly on each subway station. Let's create visualization which contains two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days. For this visualization i've used ggplot.

df_t = df.copy(True)
df_t['rain'][df_t['rain'] == 0] = 'no rain'
df_t['rain'][df_t['rain'] == 1] = 'rain'
df_t.describe()
ggplot(df_t, aes(x = 'ENTRIESn_hourly')) + geom_histogram(binwidth = 200) + facet_wrap('rain') + xlim(0,5000) +\
ggtitle("Hitograms of ENTRIESn_hourly by rain") + ylab('Frequency')

Let's add more plots and look at subway daily entries/exits with time.

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df2 = df[['EXITSn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df2.index.name = 'Date'
df2.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2.columns = ['Date', 'NumberOfPeople']
df1['type'] = 'ENTRIESn_hourly'
df2['type'] = 'EXITSn_hourly'
total = pandas.concat([df1,df2])
total['Date'] = pandas.to_datetime(total['Date'])
start = min(total['Date'])
end = max(total['Date'])
gg = ggplot(total, aes(x='Date', y='NumberOfPeople', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Daily entries and exits") + ylim(0, 6500000)
print gg

Surprisingly, but I my thoughts are very different from this plot. I was thinking that summary number of Entries & Exits in subway should be near same value, but from plot we can see big difference between this two variables. Difference is explained by the reason that some subway station in NY doesn't have turnstile on exits. Also it's very common that less people use subway in weekends. Notes: unfortunately ggplot can't natively scale date axis - https://github.com/yhat/ggplot/issues/312. There are some bugs with limits.

The last one - let's create histogram plot of ENTRIESn_hourly depending on weather i.e. fog and rain As in class we will create fog-rain variable at first.

def format_key(fog, rain):
    return '{}fog-{}rain'.format(
        '' if fog else 'no',
        '' if rain else 'no')

fograin = []
for index, row in df.iterrows():
    fograin.append(format_key(row['fog'], row['rain']))
df['fograin'] = pandas.Series(fograin)

ggplot(df, aes(x='ENTRIESn_hourly', fill='fograin', color='fograin')) + \
    geom_histogram(alpha=0.6, binwidth = 200) + xlim(0,5000) + \
    xlab("Number of people") + ylab("Frequency") + ggtitle("Histogram of Entries hourly by type of weather")

From plot we can observe that mode of number of peoples in subway stay near the same. Also we can notice that our data is highly right skewed distributed.

Section 3.5. Prediction of future daily entries using linear regression.¶

At first I should notice that I can't use EXITSn_hourly for future predictions, because i simply don't know this values. So i can use only weather values which i can obtain from meteorological station. In section 3 i notice that data is seems to very seasonal, so it might be a right way to add to linear regression some features to catch seasonal (daily) changings. So i'm going to add seasonal pattern using Fourier terms.

#Read again data
df = pandas.read_csv('turnstile_data_master_with_weather.csv')
#define season in seconds (1 week)
season = 7 * 24 * 60 * 60

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2 = df[['meanwindspdi', 'mintempi', 'meantempi', 'maxtempi','maxpressurei', 'maxdewpti', 'mindewpti', 'minpressurei', 'meandewpti', 'meanpressurei', 'DATEn']].groupby('DATEn').mean()
df2.reset_index(inplace=True)
##concat df1 and df2
main = pandas.concat([df1,df2], 1)
main = main.drop('DATEn', 1)
main['Date'] = pandas.to_datetime(main['Date'])
main['UnixTime'] = main['Date'].astype(numpy.int64) // 10**9
#function to add fourier terms to data frame, n - number of terms
def add(df, n, col_time_name):
    for i in range(1,n+1):
        col_sin_name = "sin" + str(i)
        col_cos_name = "cos" + str(i)
        sin = lambda x: math.sin(2*math.pi*i*x/season)
        cos = lambda x: math.cos(2*math.pi*i*x/season)

        df[col_sin_name] = df[col_time_name].apply(sin)
        df[col_cos_name] = df[col_time_name].apply(cos)
    
    return df

main = add(main, 2, 'UnixTime')
cols = list(main.columns.values)
cols_x = [x for x in cols if x not in ['NumberOfPeople', 'Date', 'UnixTime']]
y = main['NumberOfPeople']
x = main[cols_x]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	NumberOfPeople	R-squared:	0.951
Model:	OLS	Adj. R-squared:	0.905
Method:	Least Squares	F-statistic:	20.64
Date:	Thu, 29 Jan 2015	Prob (F-statistic):	2.77e-07
Time:	19:19:17	Log-Likelihood:	-420.42
No. Observations:	30	AIC:	870.8
Df Residuals:	15	BIC:	891.9
Df Model:	14
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	-1.54e+07	2.48e+07	-0.622	0.543	-6.82e+07 3.74e+07
meanwindspdi	-8130.3111	5.44e+04	-0.149	0.883	-1.24e+05 1.08e+05
mintempi	-2.794e+05	1.94e+05	-1.443	0.169	-6.92e+05 1.33e+05
meantempi	6.312e+05	3.57e+05	1.766	0.098	-1.31e+05 1.39e+06
maxtempi	-3.549e+05	1.76e+05	-2.012	0.063	-7.31e+05 2.12e+04
maxpressurei	1.054e+07	4.11e+06	2.563	0.022	1.77e+06 1.93e+07
maxdewpti	1.342e+05	7.44e+04	1.804	0.091	-2.44e+04 2.93e+05
mindewpti	7.91e+04	7.49e+04	1.057	0.307	-8.05e+04 2.39e+05
minpressurei	-1.229e+06	2.63e+06	-0.467	0.647	-6.84e+06 4.38e+06
meandewpti	-2.253e+05	1.39e+05	-1.618	0.126	-5.22e+05 7.14e+04
meanpressurei	-8.627e+06	4.91e+06	-1.756	0.099	-1.91e+07 1.84e+06
sin1	-9.192e+05	1.5e+05	-6.119	0.000	-1.24e+06 -5.99e+05
cos1	1.529e+06	1.62e+05	9.460	0.000	1.18e+06 1.87e+06
sin2	5.624e+05	1.44e+05	3.909	0.001	2.56e+05 8.69e+05
cos2	-1.866e+05	1.33e+05	-1.406	0.180	-4.7e+05 9.64e+04

Omnibus:	9.581	Durbin-Watson:	1.863
Prob(Omnibus):	0.008	Jarque-Bera (JB):	8.179
Skew:	-1.038	Prob(JB):	0.0167
Kurtosis:	4.494	Cond. No.	5.04e+04

Let's create visualization.

pr = pandas.Series(results.predict())
#Construct data with fitted values and real values
final1 = pandas.DataFrame(dict(values = pr, date = main['Date']))
final2 = pandas.DataFrame(dict(values = main['NumberOfPeople'], date = main['Date']))
final1['type'] = 'fitted values'
final2['type'] = 'real values'
final = pandas.concat([final1, final2])

#crete visualization
gg = ggplot(final, aes(x='date', y='values', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Fitting daily entries using Linear Regression with Fourier terms") + ylim(0, 6600000)
print gg

As we can see after adding seasonal pattern we highly increase R-squared up to 0.95 using only 4 Fourier terms.

Section 4. Conclusion¶

In this first project we partially investigate NYC Subway Dataset. I was trying to answer the following question: "Do more people ride the NYC subway when it is raining or when it is not raining?". As part of this investigation I've obtain two samples (rainy and non rainy days). I needed to check zero hypothesis that means of two samples are equal. At first i need to choose statistical test to do this task correctly. I've checked both samples on normality test. Shapiro-Wilk test rejected zero hypothesis about data normality for both samples. So I've choosen Mann-Whitney U test or Wilcoxon rank test, because it is non-parametric test (no assumptions about distributions). For our two samples Mann-Whitney U test has rejected hypothesis. So looking at mean and median for rainy and not days (both greater for rainy days meandif = 15, mediandif = 4), using results of statistical test we can conclude that distribution of the number of entries statistically different between rainy and non rainy days with significance level 5%. Linear Regression using only features from dataset gave not bad and no good. Results can be as start point, but model is bad for accurate predictions. There is still need more investigation to make right decision, but anyway I believe in non-parametric approach more. I've build ols model with not so good r-square equals to 0.557. But it can be only initial solution before more deeper diving in this dataset. But there are some issues with Linear Regression if we plan to use it for prediction. We'll definitely exclude regressors as 'EXITSn_hourly', because we arn't able to get this values. Looking at we can see that daily entries have seasonal pattern, so I've included Fourier terms to catch this fact. I've obtained good R-squared 0.95 and this model can be used for predictions. Also i want to notice that seasonal linear regression also includes variables like daily minimal temperature, maximum temperature and etc. This values i can get from meteorological station.

Section 5. Reflection¶

I've found next potential shortcomings of dataset we use: 1. Daily entries aren't equal to daily exits. Is this a bug? Or maybe I just don't have clear understanding of EXITSn_hourly. I think that entries is how many people enter subway, exits - number of people exiting subway. I'm sure this variables should be daily equal. 2. Just one month of data. This implies low variablity of tempreture and other interesting features, so they might be not so important in May, but can be extremely useful in winter.

Some notes about application of Linear Regression to our dataset: 1. It's clear to see from figure that our data is seasonal. So we might include seasonal features to predict ENTRIESn_hourly. But classic linear regression doesn't take seasonality into account. 2. I've include seasonality pattern in Linear Regression using Fourier terms. I've obtained high R-squared equal to 0.96. Model was built withou EXITSn_hourly, so we can use it to predict future values of daily entries in NYC subway.

Last notes about statistical tests: 1. In my opinion for any test I should use at least two different approaches to calculate p-value for zero hypothesis. But actually i didn't find fully another approach of non parametric test for my task. Tests on data normality is more developed and i found a good amount of approaches for using in different situations.

P.S. All source code and ipython notebook can be found on github.

List of Resources.¶

Scipy - Normal test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html
Wikipedia - Mann-Whitney U test. Url: http://en.wikipedia.org/wiki/Mann–Whitney_U_test
Scipy - Mann-Whitney U test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mannwhitneyu.html
StatsModels - Regression. Url: http://statsmodels.sourceforge.net/0.5.0/generated/statsmodels.regression.linear_model.OLS.html
Date Axis Limits GGplot Issue. Url: https://github.com/yhat/ggplot/issues/312
StatsModels - OLS. Url: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
Python GGplot - GGplot documentation. Url: http://ggplot.yhathq.com/docs/index.html

Oleksii Renov

Thursday, February 12, 2015

Analyzing the NYC Subway Dataset