Monday, June 17, 2013

(English) word-clouds in R

The R wordcloud package can be used to generate static images similar to tag-clouds. These are a fun way to visualize document contents, as demonstrated on the R Data Mining website and at the One R Tip A Day site.

Running the sample code from these examples on any real English prose results in lists of words that are far from satisfactory, even when using a stemmer. English is a difficult language to parse, especially when the source is nontechnical writing or, worse, a transcript. In this particular case, an entirely accurate parsing of English isn't necessary; the wordcloud generation only has to be intelligent enough to not make the viewer snort in derision.

To begin with, use the R Text Mining package to load a directory of documents to be analyzed:
library(tm)
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'))

This creates a Corpus containing all files in the directory supplied to DirSource. The files are assumed to be in plaintext; for different formats, use the Corpus readerControl argument:
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'), readerControl=readPDF)

If the text is already loaded in R, then a VectorSource can be of course be used:
wc_corpus <- Corpus(VectorSource(data_string))

Next, the text in the Corpus must be normalized. This involves the following steps:
  1. convert all text to lowercase
  2. expand all contractions
  3. remove all punctuation
  4. remove all "noise words"
The last step requires detecting what are known as "stop words" : words in a language which provide no information (articles, prepositions, and extremely common words).  Note that in most text processing, a fifth step would be added to stem the words in the Corpus; in generating word clouds, this produces undesirable output, as the stemmed words tend to be roots that are not recognizable as actual English words.

The following code performs these steps:
wc_corpus <- tm_map(wc_corpus, tolower)
# fix_contractions is defined later in the article
wc_corpus <- tm_map(wc_corpus, fix_contractions)
wc_corpus <- tm_map(wc_corpus, removePunctuation)
wc_corpus <- tm_map(wc_corpus, removeWords, stopwords('english'))
# Not executed: stem the words in the corpus
# wc_corpus <- tm_map(wc_corpus, stemDocument)

This code makes use of the tm_map function, which invokes a function for every document in the Corpus.

A support function is required to remove contractions from the Corpus. Note that this step must be performed before punctuation is removed, or it will be more difficult to detect contractions.

The purpose of the fix_contractions function is to expand all contractions to their "formal English" equivalents: don't to do not, we'll to we will, etc. The following function uses gsub to perform this expansion, except in the case of possessives and plurals ('s) which are simply removed.

fix_contractions <- function(doc) {
   # "won't" is a special case as it does not expand to "wo not"
   doc <- gsub("won't", "will not", doc)
   doc <- gsub("n't", " not", doc)
   doc <- gsub("'ll", " will", doc)
   doc <- gsub("'re", " are", doc)
   doc <- gsub("'ve", " have", doc)
   doc <- gsub("'m", " am", doc)
   # 's could be is or possessive: it has no expansion
   doc <- gsub("'s", "", doc) 
   return(doc)
}

The Corpus has now been normalized, and can be used to generate a list of words along with counts of their occurrence. First, a TermDocument matrix is created; next, a Word-Frequency Vector (a list of the number of occurrences of each word) is generated. Each element in the vector is the number of occurrences for a specific word, and the name of the element is the word itself (use names(v) to verify this).

td_mtx <- TermDocumentMatrix(wc_corpus, control = list(minWordLength = 3))
v <- sort(rowSums(as.matrix(td_mtx)), decreasing=TRUE)

At this point, the vector is a list of all words in the document, along with their frequency counts. This can be cleaned up by removing obvious plurals (dog, dogs; address, addresses; etc), and adding their occurrence count to the singular case.

This doesn't have to be completely accurate (it's only a wordcloud, after all), and it is not necessary to convert plural words to singular if there is no singular form present. The following function will check each word in the Word-Frequency Vector to see if a plural form of that word (specifically, the word followed by s or es) exists in the Vector as well. If so, the frequency count for the plural form is added to the frequency count for the singular form, and the plural form is removed from the Vector.

aggregate.plurals <- function (v) {
    aggr_fn <- function(v, singular, plural) {
       if (! is.na(v[plural])) {
           v[singular] <- v[singular] + v[plural]
           v <- v[-which(names(v) == plural)]
       }
       return(v)
    }
    for (n in names(v)) {
       n_pl <- paste(n, 's', sep='')
       v <- aggr_fn(v, n, n_pl)
       n_pl <- paste(n, 'es', sep='')
       v <- aggr_fn(v, n, n_pl)
     }
     return(v)
 }

The function is applied to the Word-Frequency Vector as follows:
v <- aggregate.plurals(v)

All that remains is to create a dataframe of the word frequencies, and supply that to the wordcloud function in order to generate the wordcloud image:
df <- data.frame(word=names(v), freq=v)
library(wordcloud)
wordcloud(df$word, df$freq, min.freq=3)

It goes without saying that the default R graphics device can be changed to save the file. An example for PNG output:
png(file='wordcloud.png', bg='transparent')
wordcloud(df$word, df$freq, min.freq=3)
dev.off()

The techniques used previously to create a standalone sentiment analysis command-line utility can be used in this case as well.

Friday, June 14, 2013

Git Trick: Preview before pull

A little out of sync with your teammates? Not sure if that next git pull is going to send you into a half-hour of merging?

Use git-fetch and git-diff to see what evils await you:


git fetch
git diff origin/master


As usual, difftool can be used to launch a preferred diff utility (*cough*meld*cough*).


git diff origin/master


To see just what files have changed, use the --stat option:


git diff --stat origin/master

...or --dirstat to see what directories have changed:


git diff --dirstat origin/master


With any luck, everything is more or less in sync and you can proceed with your usual git pull.

For those looking for something to add to their .bashrc:

alias git-dry-run='git fetch && git diff --stat origin/master'                  

Thursday, June 13, 2013

Quick-and-dirty Sentiment Analysis in Ruby + R


Sentiment analysis is a hot topic these days, and it is easy to see why. The idea that one could mine a bunch of Twitter drivel in order to guesstimate the popularity of a topic, company or celebrity must have induced seizures in marketing departments across the globe.

All the more so because, given the right tools, it's not all that hard.


The R Text Mining package (tm) can be used to perform rather painless sentiment analysis on choice topics.

The Web Mining plugin (tm.plugin.webmining) can be used to query a search engine and build a corpus of the documents in the results:

library(tm.plugin.webmining)
corpus <- WebCorpus(YahooNewsSource('drones'))

The corpus is a standard tm corpus object, meaning it can be passed to other tm plugins without a problem.


One of the more interesting plugins that can be fed a corpus object is the Sentiment Analysis plugin (tm.plugin.sentiment):

library(tm.plugin.sentiment)
corpus <- score(corpus)
sent_scores <- meta(corpus)


The score() method performs sentiment analysis on the corpus, and stores the results in the metadata of the corpus R object. Examining the output of the meta() call will display these scores:


summary(sent_scores)
     MetaID     polarity         subjectivity     pos_refs_per_ref  neg_refs_per_ref 
 Min.   :0   Min.   :-0.33333   Min.   :0.02934   Min.   :0.01956   Min.   :0.00978  
 1st Qu.:0   1st Qu.:-0.05263   1st Qu.:0.04889   1st Qu.:0.02667   1st Qu.:0.02266  
 Median :0   Median : 0.06926   Median :0.06767   Median :0.03009   Median :0.02755  
 Mean   :0   Mean   : 0.04789   Mean   :0.06462   Mean   :0.03343   Mean   :0.03118  
 3rd Qu.:0   3rd Qu.: 0.15862   3rd Qu.:0.07579   3rd Qu.:0.03981   3rd Qu.:0.03526  
 Max.   :0   Max.   : 0.37778   Max.   :0.10145   Max.   :0.06280   Max.   :0.05839  
             NA's   : 2.00000   NA's   :2.00000   NA's   :2.00000   NA's   :2.00000  
 senti_diffs_per_ref
 Min.   :-0.029197  
 1st Qu.:-0.002451  
 Median : 0.003501  
 Mean   : 0.002248  
 3rd Qu.: 0.009440  
 Max.   : 0.026814  
 NA's   : 2.000000 


These sentiment scores are based on the Lydia/TextMap system, and are explained in the TestMap paper as well as in the tm.plugin.sentiment presentation:

  • polarity (p - n / p + n) : difference of positive and negative sentiment references / total number of sentiment references
  • subjectivity (p + n / N) :  total number of sentiment references / total number of references
  • pos_refs_per_ref (p / N) : total number of positive sentiment references / total number of references
  • neg_refs_per_ref (n / N) : total number of negative sentiment references / total number of references
  • senti_diffs_per_ref  (p - n / N) :  difference of positive and negative sentiment references  / total number of references                                                        
The pos_refs_per_ref and neg_refs_per_ref are the rate at which positive and negative references occur in the corpus, respectively (i.e., "x out of n textual references were positive/negative"). The polarity metric is used to determine the bias (positive or negative) of the text, while the subjectivity metric is used to determine the rate at which biased (i.e. positive or negative) references occur in the text.

The remaining metric, senti_diffs_per_ref, is a combination of polarity and subjectivity: it determines the bias of the text in relation to the size of the text (actually, number of references in the text) as a whole. This is likely to be what most people expect the output of a sentiment analysis to be, but it may be useful to create a ratio of pos_refs_per_ref to neg_refs_per_ref.


Having some R code to perform sentiment analysis is all well and good, but it doesn't make for a decent command-line utility. For that, it is useful to call R from within Ruby. The rsruby gem can be used to do this.

# initialize R
ENV['R_HOME'] ||= '/usr/lib/R'
r = RSRuby.instance

# load TM libraries
r.eval_R("suppressMessages(library('tm.plugin.webmining'))")
r.eval_R("suppressMessages(library('tm.plugin.sentiment'))")

# perform search and sentiment analysis
r.eval_R("corpus <- WebCorpus(YahooNewsSource('drones'))")
r.eval_R('corpus <- score(corpus)')

# output results
scores = r.eval_R('meta(corpus)')
puts scores.inspect


The output of the last eval_R command is a Hash corresponding to the sent_scores dataframe in the R code.


Naturally, in order for this to be anything but a throwaway script, there has to be some decent command line parsing, maybe an option to aggregate or summarize the results, and of course some sort of output formatting.

As usual, the source code for such a utility has been uploaded to GitHub: https://github.com/mkfs/sentiment-analysis

Usage: sentiment_for_symbol.rb TERM [...]
Perform sentiment analysis on a web query for keyword

Google Engines:
    -b, --google-blog                Include Google Blog search
    -f, --google-finance             Include Google Finance search
    -n, --google-news                Include Google News search
Yahoo Engines:
    -F, --yahoo-finance              Include Yahoo Finance search
    -I, --yahoo-inplay               Include Yahoo InPlay search
    -N, --yahoo-news                 Include Yahoo News search
Summary Options:
    -m, --median                     Calculate median
    -M, --mean                       Calculate mean
Output Options:
    -p, --pipe-delim                 Print pipe-delimited table output
    -r, --raw                        Serialize output as a Hash, not an Array
Misc Options:
    -h, --help                       Show help screen