Tuesday, February 17, 2015

Sentence Clusterization

Briefly. I'm going to show you clustering of sentences. The main thing is you can describe clusters using this approach. This article is the start point for more complicated senteces clustering or market segmentation (think about this). And it's nice approach how hard clustering can be easily transformed into soft clustering.

Load data files

Data is from ics.uci.edu/ml

Opinion Reviews.

data.battery <- read.table("data/battery-life_amazon_kindle.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.windows7 <- read.table("data/features_windows7.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
## Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
## na.strings, : EOF within quoted string
data.keyboard <- read.table("data/keyboard_netbook_1005ha.txt.data", sep="\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.honda <- read.table("data/performance_honda_accord_2008.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
data.ipod <- read.table("data/video_ipod_nano_8gb.txt.data", sep = "\n", stringsAsFactors = FALSE, strip.white = TRUE)
I will try to do cluster analysis for these sentences without any knowledge of topics. As first step let's merge all this data sets and trim sentences.
library(gdata)
sentences <- c()
for (i in 1:dim(data.battery)[1]) {sentences <- c(sentences, trim(data.battery[i,1]))}
for (i in 1:dim(data.windows7)[1]){sentences <- c(sentences, trim(data.windows7[i,1]))}
for (i in 1:dim(data.keyboard)[1]){sentences <- c(sentences, trim(data.keyboard[i,1]))}
for (i in 1:dim(data.honda)[1]) {sentences <- c(sentences, trim(data.honda[i,1]))}
for (i in 1:dim(data.ipod)[1]) {sentences <- c(sentences, trim(data.ipod[i,1]))}
sentences[1:10]
##  [1] "After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !"                                                                                                                                                                                                        
##  [2] "After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !"                                                                                                                                 
##  [3] "NO USER REPLACEABLE BATTERY, , Unless you buy the extended warranty for $65 ."                                                                                                                                                                                                                                             
##  [4] "After 1 year you pay $80 plus shipping to send the device to Amazon and have the Kindle REPLACED, not the battery changed out   ."                                                                                                                                                                                         
##  [5] "The fact that Kindle 2 has no SD card capability and the battery is not user, serviceable is not an issue with me ."                                                                                                                                                                                                       
##  [6] "Things like the buttons that made it easy to accidentally turn pages  the separate cursor on the side that could only select lines and was sometimes hard to see  the occasionally awkward menus  the case which practically forced you to remove it to use it and sometimes pulled the battery door off ."                
##  [7] "The issue with the battery door opening is thus solved, but Amazon went further, eliminating the door altogether and wrapping the back with sleek stainless steel ."                                                                                                                                                       
##  [8] "Frankly, I never used either the card slot or changed the battery on my Kindle 1 but I liked that they were there and I miss them on the Kindle 2, even though, I have to admit, I dont actually need them .\n Its also easy to charge the Kindle in the car if you have a battery charger with a USB port   ."            
##  [9] "You cant carry an extra battery ,  though with the extended battery life and extra charging options its almost a non, issue ,  and you cant replace the battery because of the iPod, like fixed backing .\n For one thing, theres no charge except battery power no pun intended !"                                        
## [10] "Before purchasing, I was obsessed with the reviews and predictions I found online and reading about some of the critiques such as the thick border, the lack of touchscreen, lack of battery SD slot, lack of a back light, awkward difficult keyboard layout, minimally faster page flipping, and the super, high price ."
Now we have our data. Let's use simple tf-idf approach at first. All needed libraries:
library(Matrix)
library(gamlr)
library(parallel)
library(distrom)
library(textir)
library(NLP)
library(tm)
library(SnowballC)
I'm going to use tm package. For more information visit tm package site
corpus <- VCorpus(VectorSource(sentences))
corpus <- tm_map(corpus,
                     content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                     mc.cores=1)
corpus <- tm_map(corpus, content_transformer(removePunctuation), lazy = TRUE)
my.stopwords <- c(stopwords('english'), "the", "great", "use")
corpus <- tm_map(corpus, removeWords, c(stopwords('english'), "the", "great"))
corpus <- tm_map(corpus, removeNumbers)
corpus <- Corpus(VectorSource(corpus))

dtm <- DocumentTermMatrix(corpus, control=list(minWordLength=4, minDocFreq=4))
dtm
## <<DocumentTermMatrix (documents: 245, terms: 1492)>>
## Non-/sparse entries: 4171/361369
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)
Export to df and inverse matrix.
df <- as.data.frame(inspect(dtm))
m <- t(as.matrix(df))
d <- dist(m)
Initial clustering.
hr <- hclust(d, method = "complete", members=NULL)
plot(hr)
Yeah. Unimpossible for understanding. Let's work on this problem.
plot(hr, hang = -1)
rect.hclust(hr, 10)
I don't want to think why is this and pick number of cluster as random. Let it be 10.
set.seed(1235)
df <- df[,!colnames(df)%in%my.stopwords]
cl = kmeans(df, centers = 10, nstart = 50, iter.max = 100)
res = cl$centers
discr = apply(cl$centers, 2, sd)
r = sort(discr, decreasing = TRUE, index = TRUE)$ix[1:10]
print(r)
##  [1]  902  439 1372  643  105  169  180 1418  681  990
barplot(cl$centers[,r],beside=TRUE, col=rainbow(10))
Problems: Actually we've started kmeans algorithm only ones, it's very unstable in sparse data. So the very right way is to run it multiple times and then get clusters. Also i should notice that kmeans thinks that our clusters are spheres. If from domain we know that it isn't that, then we need to choose another algorithm (EM, Gausian Mixtures, cmeans, weighted k-means etc.).

Let's fix this problem with k-means stability.

for(i in 1:50){
  cl = kmeans(df, centers = 10, nstart = 50, iter.max = 100)
  tt = cl$centers
  for(j in 1:length(tt)) {
    tt[i] = tt[i] + cl$centers[sort(abs(tt - cl$centers), i=T)$ix[1]]
    tt = tt/2
  }
}
discr = apply(cl$centers, 2, sd)
r = sort(discr, decreasing = TRUE, index = TRUE)$ix[1:10]
barplot(cl$centers[,r],beside=TRUE, col=rainbow(10))
We see results, ipod is added. This improve our results a lot. Conclusions:

Orange cluster interested in ['camera', 'video', 'ipod']

Cyan cluster interested in ['performance', 'car']

Light green cluster interested in ['camera', 'video'], but less than orange cluster

Dark Blue cluster interested in ['features', 'windows']

Blue cluster intereted in ['battery', 'keyboard', 'life']

Purple cluster interested in ['battery', 'life']

Yellow cluster interestedn in ['keyboard']

This 100% unsupervised approach to cluster this type of data. And we obtain good segmentation of our sentences into topics. It can be improved a lot using stemming for large datasets. This approach can help you to understand your data from another side. Welcome!
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] SnowballC_0.5.1 tm_0.6          NLP_0.1-6       textir_2.0-2   
## [5] distrom_0.3-1   gamlr_1.12-1    Matrix_1.1-5    gdata_2.13.3   
## [9] knitr_1.9      
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5  formatR_1.0     grid_3.1.2      gtools_3.4.1   
## [5] highr_0.4       lattice_0.20-29 slam_0.1-32     stringr_0.6.2  
## [9] tools_3.1.2

No comments:

Post a Comment