Saturday, February 15, 2014

Sending email from R using Gmail

As mentioned previously, the sendmailR package generally works well for sending an email from within R. When an ISP blocks traffic on port 25, however, sendmailR cannot be used unless a a local mailserver is configured to act as a relay. This means that sendmailR is unable to reliably sending email on arbitrary machines and across arbitrary network connections.

There is gmailR, of course, but it requires rJython, and using Python through R via Java is just too many levels of indirection for something as simple as sending an email.

Instead, it should be possible to use Curl to send an email by directly connecting to Gmail.

The RCurl package should be able to provide this. According to the SSL SMTP example, an email (subject + body) can be uploaded with code (converted from C) such as the following:

library(RCurl)

rmail.rcurl.read.function <- function(x) return(x)

gmail.rcurl.send <- function( username, password, to.email, subject, body ) {
  email.data <- paste(
    paste('Subject:', subject),
    '', body, '', sep="\r\n")
  
  curl <- getCurlHandle()
  
  curlSetOpt( .opts=list(
    "useragent"='Mozilla 5.0',
    "use.ssl"=3, # AKA CURLUSESSL_ALL 
    "username"=username,
    "password"=password,
    "readdata"=email.data,
    "mail.from"=username,
    "mail.rcpt"=to.email,
    "readfunction"=rmail.rcurl.read.function,
    "upload"=TRUE
  ), curl=curl )
  
  
  getURL('smtp://smtp.gmail.com:587', curl=curl, verbose=TRUE)
}

Unfortunately, this crashes R -- it appears to be a bug in RCurl, possibly due to a lack of SMTP support ( a call to curlVersion() shows that SMTP and SMTPS are supported, but listCurlOptions() does not include mail.from or mail.rcpt).

Instead, Curl must be called directly via system. Sure, this is ugly, and yes, system should never be used, but it was RCurl that drove us to this. Remember to sanitize your inputs (in this case, the email addresses and the password)!

gmail.curl.send <- function( username, password, to.email, subject, body ) {
  email.data <- paste(
    paste('Subject:', subject),
    '', body, '', sep="\r\n")
  
  # IMPORTANT: username, password, and to.email must be cleaned!
  cmd <- paste('curl -n --ssl-reqd --mail-from "<',
               username,
               '>" --mail-rcpt "<',
               to.email,
               '<" --url ',
               'smtp://smtp.gmail.com:587',
               ' --user "', username, ':', password, 
               '" -T -', sep='')
  system(cmd, input=email.data)
}

This works correctly, as one would expect: it's hard to go wrong with a simple shell call to Curl. Note the use of the input parameter in the system call: this creates a temp file with the email contents, which Curl then uploads to Gmail using the -T flag.

Friday, February 7, 2014

Disasm Wordcloud

In a recent discussion of what the possible applications of R to binary analysis are, the usual visualizations (byte entropy, size of basic blocks, number of times a function is called during a trace, etc) came to mind. Past experiments with tm.plugins.webmining, however, also raised the following question: Why not use the R textmining packages to generate a wordcloud from a disassembled binary?

Why not, indeed.

The objdump disassembler can be used to generate a list of terms from a binary file. The template Ruby code for generating a list of terms is a simple wrapper around objdump:

# generate a space-delimited string of terms occurring in target at 'path'
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  # ...extract terms from line and append to arr...
  arr
}.join(" ")  

The R code for generating wordclouds has been covered before. The code for disassembly terms can be more simple, as the terms have already been extracted from the raw text (disassembly):

library('tm')
library('wordcloud')

# term occurrences must be in variable "terms"
corpus <- Corpus(VectorSource(terms))
tdm <- TermDocumentMatrix(corpus)
vec <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
 df <- data.frame(word=names(vec), freq=vec)

# output file path must be in variable "img_path"
png(file=img_path)
# minimum frequency should be higher than 1 if there are many terms
wordcloud(df$word, df$freq, min.freq=1) 
dev.off()
                             
The most interesting terms in a binary are the library functions that are invoked. The following regex will extract the symbol name from call instructions:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /<([_[:alnum:]]+)(@[[:alnum:]]+)?>\s*$/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:

The other obvious terms in a binary are the instruction mnemonics. The following regex will extract the instruction mnemonics from an objdump disassembly:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /^\s*[[:xdigit:]]+:[[:xdigit:]\s]+\s+([[:alnum:]]+)\s*/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:       

Of course, there is always the possibility of generating a wordcloud from the ASCII strings in a binary. The following Ruby code is a crude attempt at creating a terms string from the output of the strings command:

 terms = `strings '#{path}'`.gsub(/[[:punct:]]/, '').lines.to_a .join(' ')

When run on /usr/bin/xterm, this generates the following wordcloud:     

Not as nice as the others, but some pre-processing of the strings output would clear that up.

There is, of course, a github for the code. Note that the implementation is in Ruby, using the rsruby gem to interface with R.

Wednesday, February 5, 2014

Finding stock symbols by industry in R

The quantmod package is fantastic, but it has one shortcoming: there is no facility for retrieving information about a specific industry (e.g., "is the entire industry on a downward trend, or just this company?").

Yahoo Finance provides this information via its CSV API; this means it should be easy to retrieve from within R. Details about the API have been provided by the C# yahoo-finance-managed project.

The first step is to get a list of all possible sectors. This is a straightforward Curl download and CSV parse:

library(RCurl)
get.sectors <- function() {
  url <- 'http://biz.yahoo.com/p/csv/s_conameu.csv'
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  # sector ID is its index in this alphabetical list
  df$ID <- 1:nrow(df)
  return(df)
}

Note the use of textConnection() to parse an in-memory string instead of an on-disk file. The binary=TRUE flag causes Curl to return a "raw" object which is converted to a character vector by the rawToChar() call; this is necessary because the CSV file ends with a NULL byte.

The next step is to fetch a list of the industries in each sector. At first, this seems to be straightforward:

get.sector.industries <- function( sector ) {
  url <- paste('http://biz.yahoo.com/p/csv', 
               paste(as.integer(sector), 'conameu.csv', sep=''), 
               sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  
  # fix broken Industry names
  df[,'Industry'] <- gsub(' +', ' ', df[,'Industry'])
  
  # default (incorrect) ID column
  df$ID <- (sector * 100) + 1:nrow(df)
    
  df$Sector <- sector
  return(df)
}

Unfortunately, there is one problem: the industry IDs are not based on the index value. In fact, there does not seem to be a way to obtain the industry IDs using the Yahoo Finance API, which appears to be a pretty egregious oversight.

Yahoo Finance provides an alphabetical list of industries in all sectors; the URL for each industry entry contains its ID. This means that the page can be parsed in order to build a list of industries and their IDs.

The code is a little hairy, involving a couple of XPath queries to extract the URLs and their descriptions:

library(XML)
get.industry.ids <- function() {
  html <- htmlParse('http://biz.yahoo.com/ic/ind_index_alpha.htm')

  # extract description from A tags
  html.names <- as.vector(xpathSApply(html, "//td/a/font", xmlValue))
  # extract URL from A tags
  html.urls <- as.vector(xpathSApply(html, "//td/a/font/../@href"))
  
  if (length(html.names) != length(html.urls)) {
    warning(paste("Got", length(html.names), "names but", 
                  length(html.urls), "URLs"))
  }

  html.names <- gsub("\n", " ", html.names)
  html.urls <- gsub("http://biz.yahoo.com/ic/([0-9]+).html", "\\1", html.urls)
  
  df <- data.frame(Name=character(length(html.urls)), 
                   ID=numeric(length(html.urls)), stringsAsFactors=FALSE)
  for (i in 1:length(html.urls)) {
    url = html.urls[i]
    val = suppressWarnings(as.numeric(url))
    if (! is.na(val) ) {
      df[i,'Name'] = html.names[i]
      df[i,'ID'] = val
    }
  }
  return(df)
}

In this function, htmlParse() was used to download the web page instead of Curl. This  is necessary because the webpage contains one or more non-trailing NULL bytes; rawToChar() can only strip trailing NULL bytes. The parser in htmlParse() is able to handle the NULL bytes just fine.

With this function, the IDs of industries can be set as follows:

df <- get.sector.industries( sector.id )
id.df <- get.industry.ids()
for (i in 1:nrow(id.df)) {
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
  }

It is now possible to build a dataframe that contains the industries of all of the sectors:

df.sectors <- get.sectors()
id.df <- get.industry.ids()
df.industries <- NULL
for (id in df.sectors) {
  df <- get.sector.industries(id)
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
    if (is.null(ind.df)) {
      ind.df <- df
    } else {
      ind.df <- rbind(ind.df, df)
    }
}

This list is probably not going to change much, so the dataframe can be stored for reuse in an .RData object.

The final step, getting the stock symbols for a specific industry, is much more straightforward:

get.industry.symbols <- function(id) {
  url <- paste('http://biz.yahoo.com/p/csv', 
                 paste(as.integer(id), 'conameu.csv', sep=''), 
                 sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  return(df)
}

As usual, there is a github for the code.

One final note: the sector and industry data is also available via the FinViz API. Yahoo Finance was selected for this project in order to be compatible with the quantmod data.

Sunday, February 2, 2014

Daily stock symbol reports with R

This is a simple R script that uses the quantmod package to look up stock symbols on Yahoo Finance, and the sendmailR package to send an email alert if the latest stock price ("Last" in the Yahoo report) is below a "buy" threshold or above a "sell" threshold.

The input file format is tab-delimited with three columns: Symbol, BuyAt, SellAt. There is no need for a header column. For example:
  AAPL        300     750
  BA           65      90
  ...

The function that does all the work is symbol.report. This reads the input file containing buy and sell thresholds, performs a Yahoo Finance query on all symbols in the file, and generates a dataframe with the details (BuyAt, SellAt, Open, Close, Last, etc) of every symbol whose latest price is either below the buy threshold, or above the sell threshold.

library(quantmod)
symbol.report <- function(filename, header=FALSE, sep = "\t") {
  watch.df <- read.delim(filename, header=header, sep=sep)
  colnames(watch.df) <- c('Symbol', 'BuyAt', 'SellAt')

  quote.df <- getQuote(paste(watch.df$Symbol, collapse=';'))
  quote.df$Symbol <- rownames(quote.df)

  df <- merge(watch.df, quote.df)

  df[(df$Last <= df$BuyAt) | (df$SellAt > 0 & df$Last >= df$SellAt), ]
}  

The symbol.report function is invoked by  symbol.alert, which will send an email to the provided address if the dataframe returned by symbol.report is not empty. If an email address is not provided, the dataframe will be printed to STDOUT.

library(sendmailR)
symbol.alert <- function(filename, email=NULL, verbose=FALSE) {
  df <- symbol.report(filename)

  if (nrow(df) == 0) {
    return(df)
  }

  if ( is.null(email) ) {
    print(df)
  } else {
    sendmail(# from: fake email address
             paste('<', "r.script@nospam.org", '>', sep=''),
             # to: provided email address
             paste('<', email, '>', sep=''),
             # subject
             "SYMBOL ALERT",
             # body
             capture.output(print(df, row.names=FALSE)),
             # SMTP server (gmail)
             control=list(smtpServer='ASPMX.L.GOOGLE.COM'),
             verbose=verbose)
  } 
  return(df)
}

A few things to note here:
  * a fake email address is used as the From address, allowing easy filtering of these emails
  * the SMTP server used is the GMail server, which may not be appropriate for some users

This function can be called from a shell script in a cron job, invoking R with the --vanilla option:

  R --vanilla -e "source('/home/me/symbol.alert.R'); ticker.email.alert('/home/me/monitoried_symbols.dat', 'me@gmail.com')"

And again, there is a github for the code.