Saturday, February 28, 2015

Comparing Search Algorithms for Pacman Game

Here is a short comparison of 4 search algorithms for Pacman Game:

Depth First Search
Breadth First Search
Uniform Cost Search
A-start Search with consistent heuristic

1a. Depth First Search and Big Maze.

Statistics:

Path total cost = 210
Time = 0.0 seconds
Nodes Expanded = 390

1b. Depth First Search and Medium Maze.

Statistics:

Path total cost = 130
Time = 0.0 seconds
Nodes Expanded = 146

2a. Breadth First Search and Big Maze.

Statistics:

Path total cost = 210
Time = 0.0 seconds
Nodes Expanded = 620

2b. Breadth First Search and Medium Maze.

Statistics:

Path total cost = 68
Time = 0.0 seconds
Nodes Expanded = 269

3a. Uniform Cost Search and Big Maze.

Statistics:

Path total cost = 210
Time = 0.0 seconds
Nodes Expanded = 620

3b. Uniform Cost Search and Medium Maze.

Statistics:

Path total cost = 68
Time = 0.0 seconds
Nodes Expanded = 269

4a. A-star and Big Maze. (Manhattan Distance)

Statistics:

Path total cost = 210
Time = 0.0 seconds
Nodes expanded = 549

4b. A-star and Medium Maze. (Manhattan Distance)

Statistics:

Path total cost = 68
Time = 0.0 seconds
Nodes Expanded = 221

Bonus:

Eating all the dotes using A-star. Expanded nodes equal 4110 with total cost 60, in comparison with UCS ~ 16,000 .

All sources can be found in github .

Tuesday, February 17, 2015

Sentence Clusterization

Briefly. I'm going to show you clustering of sentences. The main thing is you can describe clusters using this approach. This article is the start point for more complicated senteces clustering or market segmentation (think about this). And it's nice approach how hard clustering can be easily transformed into soft clustering.

Load data files

Data is from ics.uci.edu/ml

Opinion Reviews.

data.battery <- read.table("data/battery-life_amazon_kindle.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.windows7 <- read.table("data/features_windows7.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)

## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string

data.keyboard <- read.table("data/keyboard_netbook_1005ha.txt.data", sep="\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.honda <- read.table("data/performance_honda_accord_2008.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.ipod <- read.table("data/video_ipod_nano_8gb.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)

I will try to do cluster analysis for these sentences without any knowledge of topics. As first step let's merge all this data sets and trim sentences.

library(gdata)
sentences <- c()
for (i in 1:dim(data.battery)[1]) {sentences <- c(sentences, trim(data.battery[i,1]))}
for (i in 1:dim(data.windows7)[1]){sentences <- c(sentences, trim(data.windows7[i,1]))}
for (i in 1:dim(data.keyboard)[1]){sentences <- c(sentences, trim(data.keyboard[i,1]))}
for (i in 1:dim(data.honda)[1]) {sentences <- c(sentences, trim(data.honda[i,1]))}
for (i in 1:dim(data.ipod)[1]) {sentences <- c(sentences, trim(data.ipod[i,1]))}

sentences[1:10]

##  [1] "After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !"                                                                                                                                                                                                        
##  [2] "After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !"                                                                                                                                 
##  [3] "NO USER REPLACEABLE BATTERY, , Unless you buy the extended warranty for $65 ."                                                                                                                                                                                                                                             
##  [4] "After 1 year you pay $80 plus shipping to send the device to Amazon and have the Kindle REPLACED, not the battery changed out   ."                                                                                                                                                                                         
##  [5] "The fact that Kindle 2 has no SD card capability and the battery is not user, serviceable is not an issue with me ."                                                                                                                                                                                                       
##  [6] "Things like the buttons that made it easy to accidentally turn pages  the separate cursor on the side that could only select lines and was sometimes hard to see  the occasionally awkward menus  the case which practically forced you to remove it to use it and sometimes pulled the battery door off ."                
##  [7] "The issue with the battery door opening is thus solved, but Amazon went further, eliminating the door altogether and wrapping the back with sleek stainless steel ."                                                                                                                                                       
##  [8] "Frankly, I never used either the card slot or changed the battery on my Kindle 1 but I liked that they were there and I miss them on the Kindle 2, even though, I have to admit, I dont actually need them .\n Its also easy to charge the Kindle in the car if you have a battery charger with a USB port   ."            
##  [9] "You cant carry an extra battery ,  though with the extended battery life and extra charging options its almost a non, issue ,  and you cant replace the battery because of the iPod, like fixed backing .\n For one thing, theres no charge except battery power no pun intended !"                                        
## [10] "Before purchasing, I was obsessed with the reviews and predictions I found online and reading about some of the critiques such as the thick border, the lack of touchscreen, lack of battery SD slot, lack of a back light, awkward difficult keyboard layout, minimally faster page flipping, and the super, high price ."

Now we have our data. Let's use simple tf-idf approach at first. All needed libraries:

library(Matrix)
library(gamlr)
library(parallel)
library(distrom)
library(textir)
library(NLP)
library(tm)
library(SnowballC)

I'm going to use tm package. For more information visit tm package site

corpus <- VCorpus(VectorSource(sentences))
corpus <- tm_map(corpus,
                     content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                     mc.cores=1)
corpus <- tm_map(corpus, content_transformer(removePunctuation), lazy = TRUE)
my.stopwords <- c(stopwords('english'), "the", "great", "use")
corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "the", "great"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- Corpus(VectorSource(corpus))

dtm <- DocumentTermMatrix(corpus, control=list(minWordLength=4, minDocFreq=4))
dtm

## <<DocumentTermMatrix (documents: 245, terms: 1492)>>
## Non-/sparse entries: 4171/361369
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)

Export to df and inverse matrix.

df <- as.data.frame(inspect(dtm))
m <- t(as.matrix(df))
d <- dist(m)

Initial clustering.

hr <- hclust(d, method = "complete", members=NULL)
plot(hr)

Yeah. Unimpossible for understanding. Let's work on this problem.

plot(hr, hang = -1)
rect.hclust(hr, 10)

I don't want to think why is this and pick number of cluster as random. Let it be 10.

set.seed(1235)
df <- df[,!colnames(df)%in%my.stopwords]
cl = kmeans(df, centers = 10, nstart = 50, iter.max = 100)
res = cl$centers
discr = apply(cl$centers, 2, sd)
r = sort(discr, decreasing = TRUE, index = TRUE)$ix[1:10]
print(r)

##  [1]  902  439 1372  643  105  169  180 1418  681  990

barplot(cl$centers[,r],beside=TRUE, col=rainbow(10))

Problems: Actually we've started kmeans algorithm only ones, it's very unstable in sparse data. So the very right way is to run it multiple times and then get clusters. Also i should notice that kmeans thinks that our clusters are spheres. If from domain we know that it isn't that, then we need to choose another algorithm (EM, Gausian Mixtures, cmeans, weighted k-means etc.).

Let's fix this problem with k-means stability.

for(i in 1:50){
  cl = kmeans(df, centers = 10, nstart = 50, iter.max = 100)
  tt = cl$centers
  for(j in 1:length(tt)) {
    tt[i] = tt[i] + cl$centers[sort(abs(tt - cl$centers), i=T)$ix[1]]
    tt = tt/2
  }
}
discr = apply(cl$centers, 2, sd)
r = sort(discr, decreasing = TRUE, index = TRUE)$ix[1:10]
barplot(cl$centers[,r],beside=TRUE, col=rainbow(10))

We see results, ipod is added. This improve our results a lot. Conclusions:

Orange cluster interested in ['camera', 'video', 'ipod']

Cyan cluster interested in ['performance', 'car']

Light green cluster interested in ['camera', 'video'], but less than orange cluster

Dark Blue cluster interested in ['features', 'windows']

Blue cluster intereted in ['battery', 'keyboard', 'life']

Purple cluster interested in ['battery', 'life']

Yellow cluster interestedn in ['keyboard']

This 100% unsupervised approach to cluster this type of data. And we obtain good segmentation of our sentences into topics. It can be improved a lot using stemming for large datasets. This approach can help you to understand your data from another side. Welcome!

sessionInfo()

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] SnowballC_0.5.1 tm_0.6          NLP_0.1-6       textir_2.0-2   
## [5] distrom_0.3-1   gamlr_1.12-1    Matrix_1.1-5    gdata_2.13.3   
## [9] knitr_1.9      
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5  formatR_1.0     grid_3.1.2      gtools_3.4.1   
## [5] highr_0.4       lattice_0.20-29 slam_0.1-32     stringr_0.6.2  
## [9] tools_3.1.2

Thursday, February 12, 2015

Analyzing the NYC Subway Dataset

Section 0. Preparation¶

import matplotlib
%matplotlib inline
import matplotlib.dates as mdates
#%matplotlib osx
import pandas
import numpy
from ggplot import *
import statsmodels.api as sm
import statsmodels
import scipy
import matplotlib.pyplot as plt
import math
import seaborn as sns
from datetime import datetime


df = pandas.read_csv('turnstile_data_master_with_weather.csv')
non_rain = df['ENTRIESn_hourly'][df['rain'] == 0]
rain = df['ENTRIESn_hourly'][df['rain'] == 1]

Before doing anything I want to have a look at data.

df.describe()

	Unnamed: 0	Hour	ENTRIESn_hourly	EXITSn_hourly	maxpressurei	maxdewpti	mindewpti	minpressurei	meandewpti	meanpressurei	fog	rain	meanwindspdi	mintempi	meantempi	maxtempi	precipi	thunder
count	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951.000000	131951
mean	65975.000000	10.896158	1095.348478	886.890838	30.031894	57.241302	48.259013	29.892714	52.703526	29.965077	0.167100	0.334245	5.543065	56.169775	64.269729	71.769968	0.172276	0
std	38091.117022	6.892084	2337.015421	2008.604886	0.125689	8.770891	11.305312	0.146384	9.943590	0.130461	0.373066	0.471728	1.982441	6.338875	6.568289	7.627218	0.429005	0
min	0.000000	0.000000	0.000000	0.000000	29.740000	39.000000	22.000000	29.540000	31.000000	29.640000	0.000000	0.000000	1.000000	46.000000	55.000000	58.000000	0.000000	0
25%	32987.500000	5.000000	39.000000	32.000000	29.960000	50.000000	38.000000	29.840000	45.000000	29.910000	0.000000	0.000000	5.000000	52.000000	60.000000	65.000000	0.000000	0
50%	65975.000000	12.000000	279.000000	232.000000	30.030000	57.000000	51.000000	29.910000	54.000000	29.960000	0.000000	0.000000	5.000000	54.000000	63.000000	71.000000	0.000000	0
75%	98962.500000	17.000000	1109.000000	847.000000	30.100000	64.000000	55.000000	29.970000	60.000000	30.050000	0.000000	1.000000	6.000000	60.000000	68.000000	78.000000	0.100000	0
max	131950.000000	23.000000	51839.000000	45249.000000	30.310000	70.000000	66.000000	30.230000	68.000000	30.270000	1.000000	1.000000	12.000000	70.000000	78.000000	86.000000	2.180000	0

Section 1. Statistical Test¶

The first question which I asked myself was: Is there any significant difference between number of people entering subway in rainy and non rainy days. As shown below, data for this particular question isn't normally distributed. So it's suitable to use non-parametric test. I'm going to use Mann–Whitney U test. Null hypothesis for our two-sided test will be P(x>y)=0.5, where x - subway crowd in rainy days, y - subway crowd in non rainy days. At first let's check our assumptions about non-Gaussian data using Shapiro-Wilk test:

scipy.stats.shapiro(rain)

(0.4715914726257324, 0.0)

scipy.stats.shapiro(non_rain)

(0.47661787271499634, 0.0)

Personaly, I'm very sceptic and always want to confirm results using another approach. For second normality test I'll use scipy.stats.mstats.normaltest based on D'Agostino and Pearson's test (more information you can find in docs.scipy.org. The method returns p-value for 2-sided chi-quared probability for the hypothesis test.

scipy.stats.mstats.normaltest(rain)

(49379.024966302793, 0.0)

scipy.stats.mstats.normaltest(non_rain)

(96614.17551121171, 0.0)

As we can see from Shapiro-Wilk and D'Agostino|Pearson test data is definitely non normal distributed due to extremely low p-value. Now let's calculate statistics and p-value for Mann Whitney U test (wikipedia.org) and it's python implementation scipy.stats:

t, p_value = scipy.stats.mannwhitneyu(rain, non_rain)
2 * p_value

0.049999825586979442

We multiplied p-value by two in Mann Whitney U test, because scipy returns p-value for one-sided test. P-value is less than 0.05 - our significant level, so we should reject the null hypothesis that the distributions being compared are identical. Let examine this data more and look at means and medians.

"rain mean = {0}, non-rain mean = {1}".format(numpy.mean(rain), numpy.mean(non_rain))

'rain mean = 1105.44637675, non-rain mean = 1090.27878015'

"rain median = {0}, non-rain median = {1}".format(numpy.median(rain), numpy.median(non_rain))

'rain median = 282.0, non-rain median = 278.0'

So it seems that statistical tests are right. In my opinion in shiny days people can walk on the street to work/university. But in rany days it's quite not comfortable, so maybe here is one of the differences. Also I should notice that I used significance level 5%.

Section 2. Linear Regression¶

In my investigation of given dataset I was trying to predict how many people will enter subway using different predictors. I've used gradient descent and ols methods for this problem. I want to show prediction using ols method. Our response variable will be 'ENTRIESn_hourly'. Our predictiors will be 'EXITSn_hourly', 'rain', 'fog'. Let's notice that rain and fog is our dummy variables. I decided to use fog and rain because I thought that when it is very foggy or rainy outside people might decide to use the subway more often. Also I decided to use hour variable because it increases R-squared value and maxtempi & meandewpti for their high variance. 'EXITSn_hourly' present in model due to my believings that number of entries in subway is extremely correlated with number of exits.

y = df['ENTRIESn_hourly']
x = df[['EXITSn_hourly' ,'rain', 'fog', 'meantempi', 'Hour','meandewpti']]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	ENTRIESn_hourly	R-squared:	0.557
Model:	OLS	Adj. R-squared:	0.557
Method:	Least Squares	F-statistic:	2.770e+04
Date:	Thu, 29 Jan 2015	Prob (F-statistic):	0.00
Time:	19:18:45	Log-Likelihood:	-1.1569e+06
No. Observations:	131951	AIC:	2.314e+06
Df Residuals:	131944	BIC:	2.314e+06
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	324.5246	46.355	7.001	0.000	233.670 415.379
EXITSn_hourly	0.8549	0.002	395.953	0.000	0.851 0.859
rain	-8.7396	12.559	-0.696	0.486	-33.355 15.875
fog	44.6593	13.107	3.407	0.001	18.970 70.348
meantempi	-2.6079	1.020	-2.556	0.011	-4.608 -0.608
Hour	19.7716	0.629	31.430	0.000	18.539 21.005
meandewpti	-0.7543	0.728	-1.036	0.300	-2.182 0.673

Omnibus:	109690.964	Durbin-Watson:	1.504
Prob(Omnibus):	0.000	Jarque-Bera (JB):	25879157.660
Skew:	3.164	Prob(JB):	0.00
Kurtosis:	71.316	Cond. No.	2.39e+04

From our summary model we can see that rain variable is significant for our prediction due to 95% confidence interval contain 0 so we can surely say that rain variable is not so important in current variables configuration. In summary of our model we can observe mathematical expectation of coefficients of dummy or non-dummy variables, standard deviation, confidence interval and more different useful statistics.

Coefficient of determination of our model is equal to 0.558. I want to remember that coefficient of determination has the following meaning - this is the percentage of variance that is explained by our model. We can see that our R-squared is too low to produce accurate predictions.

Section 3. Visualization¶

Let's plot our ols prediction results. Let y-axis be 'ENTRIESn_hourly' and x-axis be 'EXITSn_hourly'. Plotting will be done using matplotlib.

fig, ax = plt.subplots()
statsmodels.graphics.regressionplots.plot_fit(results, 'EXITSn_hourly', ax=ax)
ax.set_ylabel("ENTRIESn_hourly")
ax.set_xlabel("EXITSn_hourly")
ax.set_title("Linear Regression")
plt.axis([0, 60000, 0, 60000])
fig1 = matplotlib.pyplot.gcf()
fig1.set_size_inches(10.5,10.5)
plt.show()

As we can see from plot, we haven't got good model to predict ENTRIESn_hourly on each subway station. Let's create visualization which contains two histograms: one of ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days. For this visualization i've used ggplot.

df_t = df.copy(True)
df_t['rain'][df_t['rain'] == 0] = 'no rain'
df_t['rain'][df_t['rain'] == 1] = 'rain'
df_t.describe()
ggplot(df_t, aes(x = 'ENTRIESn_hourly')) + geom_histogram(binwidth = 200) + facet_wrap('rain') + xlim(0,5000) +\
ggtitle("Hitograms of ENTRIESn_hourly by rain") + ylab('Frequency')

Let's add more plots and look at subway daily entries/exits with time.

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df2 = df[['EXITSn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df2.index.name = 'Date'
df2.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2.columns = ['Date', 'NumberOfPeople']
df1['type'] = 'ENTRIESn_hourly'
df2['type'] = 'EXITSn_hourly'
total = pandas.concat([df1,df2])
total['Date'] = pandas.to_datetime(total['Date'])
start = min(total['Date'])
end = max(total['Date'])
gg = ggplot(total, aes(x='Date', y='NumberOfPeople', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Daily entries and exits") + ylim(0, 6500000)
print gg

Surprisingly, but I my thoughts are very different from this plot. I was thinking that summary number of Entries & Exits in subway should be near same value, but from plot we can see big difference between this two variables. Difference is explained by the reason that some subway station in NY doesn't have turnstile on exits. Also it's very common that less people use subway in weekends. Notes: unfortunately ggplot can't natively scale date axis - https://github.com/yhat/ggplot/issues/312. There are some bugs with limits.

The last one - let's create histogram plot of ENTRIESn_hourly depending on weather i.e. fog and rain As in class we will create fog-rain variable at first.

def format_key(fog, rain):
    return '{}fog-{}rain'.format(
        '' if fog else 'no',
        '' if rain else 'no')

fograin = []
for index, row in df.iterrows():
    fograin.append(format_key(row['fog'], row['rain']))
df['fograin'] = pandas.Series(fograin)

ggplot(df, aes(x='ENTRIESn_hourly', fill='fograin', color='fograin')) + \
    geom_histogram(alpha=0.6, binwidth = 200) + xlim(0,5000) + \
    xlab("Number of people") + ylab("Frequency") + ggtitle("Histogram of Entries hourly by type of weather")

From plot we can observe that mode of number of peoples in subway stay near the same. Also we can notice that our data is highly right skewed distributed.

Section 3.5. Prediction of future daily entries using linear regression.¶

At first I should notice that I can't use EXITSn_hourly for future predictions, because i simply don't know this values. So i can use only weather values which i can obtain from meteorological station. In section 3 i notice that data is seems to very seasonal, so it might be a right way to add to linear regression some features to catch seasonal (daily) changings. So i'm going to add seasonal pattern using Fourier terms.

#Read again data
df = pandas.read_csv('turnstile_data_master_with_weather.csv')
#define season in seconds (1 week)
season = 7 * 24 * 60 * 60

df1 = df[['ENTRIESn_hourly', 'DATEn']].groupby('DATEn').sum()
df1.index.name = 'Date'
df1.reset_index(inplace=True)
df1.columns = ['Date', 'NumberOfPeople']
df2 = df[['meanwindspdi', 'mintempi', 'meantempi', 'maxtempi','maxpressurei', 'maxdewpti', 'mindewpti', 'minpressurei', 'meandewpti', 'meanpressurei', 'DATEn']].groupby('DATEn').mean()
df2.reset_index(inplace=True)
##concat df1 and df2
main = pandas.concat([df1,df2], 1)
main = main.drop('DATEn', 1)
main['Date'] = pandas.to_datetime(main['Date'])
main['UnixTime'] = main['Date'].astype(numpy.int64) // 10**9
#function to add fourier terms to data frame, n - number of terms
def add(df, n, col_time_name):
    for i in range(1,n+1):
        col_sin_name = "sin" + str(i)
        col_cos_name = "cos" + str(i)
        sin = lambda x: math.sin(2*math.pi*i*x/season)
        cos = lambda x: math.cos(2*math.pi*i*x/season)

        df[col_sin_name] = df[col_time_name].apply(sin)
        df[col_cos_name] = df[col_time_name].apply(cos)
    
    return df

main = add(main, 2, 'UnixTime')
cols = list(main.columns.values)
cols_x = [x for x in cols if x not in ['NumberOfPeople', 'Date', 'UnixTime']]
y = main['NumberOfPeople']
x = main[cols_x]
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	NumberOfPeople	R-squared:	0.951
Model:	OLS	Adj. R-squared:	0.905
Method:	Least Squares	F-statistic:	20.64
Date:	Thu, 29 Jan 2015	Prob (F-statistic):	2.77e-07
Time:	19:19:17	Log-Likelihood:	-420.42
No. Observations:	30	AIC:	870.8
Df Residuals:	15	BIC:	891.9
Df Model:	14
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	-1.54e+07	2.48e+07	-0.622	0.543	-6.82e+07 3.74e+07
meanwindspdi	-8130.3111	5.44e+04	-0.149	0.883	-1.24e+05 1.08e+05
mintempi	-2.794e+05	1.94e+05	-1.443	0.169	-6.92e+05 1.33e+05
meantempi	6.312e+05	3.57e+05	1.766	0.098	-1.31e+05 1.39e+06
maxtempi	-3.549e+05	1.76e+05	-2.012	0.063	-7.31e+05 2.12e+04
maxpressurei	1.054e+07	4.11e+06	2.563	0.022	1.77e+06 1.93e+07
maxdewpti	1.342e+05	7.44e+04	1.804	0.091	-2.44e+04 2.93e+05
mindewpti	7.91e+04	7.49e+04	1.057	0.307	-8.05e+04 2.39e+05
minpressurei	-1.229e+06	2.63e+06	-0.467	0.647	-6.84e+06 4.38e+06
meandewpti	-2.253e+05	1.39e+05	-1.618	0.126	-5.22e+05 7.14e+04
meanpressurei	-8.627e+06	4.91e+06	-1.756	0.099	-1.91e+07 1.84e+06
sin1	-9.192e+05	1.5e+05	-6.119	0.000	-1.24e+06 -5.99e+05
cos1	1.529e+06	1.62e+05	9.460	0.000	1.18e+06 1.87e+06
sin2	5.624e+05	1.44e+05	3.909	0.001	2.56e+05 8.69e+05
cos2	-1.866e+05	1.33e+05	-1.406	0.180	-4.7e+05 9.64e+04

Omnibus:	9.581	Durbin-Watson:	1.863
Prob(Omnibus):	0.008	Jarque-Bera (JB):	8.179
Skew:	-1.038	Prob(JB):	0.0167
Kurtosis:	4.494	Cond. No.	5.04e+04

Let's create visualization.

pr = pandas.Series(results.predict())
#Construct data with fitted values and real values
final1 = pandas.DataFrame(dict(values = pr, date = main['Date']))
final2 = pandas.DataFrame(dict(values = main['NumberOfPeople'], date = main['Date']))
final1['type'] = 'fitted values'
final2['type'] = 'real values'
final = pandas.concat([final1, final2])

#crete visualization
gg = ggplot(final, aes(x='date', y='values', color = 'type')) + geom_point() + geom_line() +\
ggtitle("Fitting daily entries using Linear Regression with Fourier terms") + ylim(0, 6600000)
print gg

As we can see after adding seasonal pattern we highly increase R-squared up to 0.95 using only 4 Fourier terms.

Section 4. Conclusion¶

In this first project we partially investigate NYC Subway Dataset. I was trying to answer the following question: "Do more people ride the NYC subway when it is raining or when it is not raining?". As part of this investigation I've obtain two samples (rainy and non rainy days). I needed to check zero hypothesis that means of two samples are equal. At first i need to choose statistical test to do this task correctly. I've checked both samples on normality test. Shapiro-Wilk test rejected zero hypothesis about data normality for both samples. So I've choosen Mann-Whitney U test or Wilcoxon rank test, because it is non-parametric test (no assumptions about distributions). For our two samples Mann-Whitney U test has rejected hypothesis. So looking at mean and median for rainy and not days (both greater for rainy days meandif = 15, mediandif = 4), using results of statistical test we can conclude that distribution of the number of entries statistically different between rainy and non rainy days with significance level 5%. Linear Regression using only features from dataset gave not bad and no good. Results can be as start point, but model is bad for accurate predictions. There is still need more investigation to make right decision, but anyway I believe in non-parametric approach more. I've build ols model with not so good r-square equals to 0.557. But it can be only initial solution before more deeper diving in this dataset. But there are some issues with Linear Regression if we plan to use it for prediction. We'll definitely exclude regressors as 'EXITSn_hourly', because we arn't able to get this values. Looking at we can see that daily entries have seasonal pattern, so I've included Fourier terms to catch this fact. I've obtained good R-squared 0.95 and this model can be used for predictions. Also i want to notice that seasonal linear regression also includes variables like daily minimal temperature, maximum temperature and etc. This values i can get from meteorological station.

Section 5. Reflection¶

I've found next potential shortcomings of dataset we use: 1. Daily entries aren't equal to daily exits. Is this a bug? Or maybe I just don't have clear understanding of EXITSn_hourly. I think that entries is how many people enter subway, exits - number of people exiting subway. I'm sure this variables should be daily equal. 2. Just one month of data. This implies low variablity of tempreture and other interesting features, so they might be not so important in May, but can be extremely useful in winter.

Some notes about application of Linear Regression to our dataset: 1. It's clear to see from figure that our data is seasonal. So we might include seasonal features to predict ENTRIESn_hourly. But classic linear regression doesn't take seasonality into account. 2. I've include seasonality pattern in Linear Regression using Fourier terms. I've obtained high R-squared equal to 0.96. Model was built withou EXITSn_hourly, so we can use it to predict future values of daily entries in NYC subway.

Last notes about statistical tests: 1. In my opinion for any test I should use at least two different approaches to calculate p-value for zero hypothesis. But actually i didn't find fully another approach of non parametric test for my task. Tests on data normality is more developed and i found a good amount of approaches for using in different situations.

P.S. All source code and ipython notebook can be found on github.

List of Resources.¶

Scipy - Normal test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html
Wikipedia - Mann-Whitney U test. Url: http://en.wikipedia.org/wiki/Mann–Whitney_U_test
Scipy - Mann-Whitney U test. Url: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mannwhitneyu.html
StatsModels - Regression. Url: http://statsmodels.sourceforge.net/0.5.0/generated/statsmodels.regression.linear_model.OLS.html
Date Axis Limits GGplot Issue. Url: https://github.com/yhat/ggplot/issues/312
StatsModels - OLS. Url: http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
Python GGplot - GGplot documentation. Url: http://ggplot.yhathq.com/docs/index.html

Oleksii Renov

Saturday, February 28, 2015

Comparing Search Algorithms for Pacman Game

Tuesday, February 17, 2015

Sentence Clusterization

Thursday, February 12, 2015

Analyzing the NYC Subway Dataset

Section 0. Preparation¶

Section 1. Statistical Test¶

Section 2. Linear Regression¶

Section 3. Visualization¶

Section 3.5. Prediction of future daily entries using linear regression.¶

Section 4. Conclusion¶

Section 5. Reflection¶

List of Resources.¶

About Me