Using R with Apache Spark

Tweet about this on TwitterShare on LinkedInShare on Google+Share on Facebook

To use R with Spark we need SparkR package. Below are the steps needed to build, install and use SparkR package in a windows system.

Building SparkR in Windows

  • Make sure that you have installed R (version >3.1) and the path to bin is added in the system PATH variable
    e.g. C:\R\R-3.1.3\bin\x64
  • Download Rtools from below link
    http://cran.r-project.org/bin/windows/Rtools/
  • Select the components to install

Setup Rtools

  • Finish the installation and add below paths to the system PATH variable
    C:\Rtools\gcc492_64\bin;C:\Rtools\bin
  • Download JDK and install. Create a new system variable name it as JAVA_HOME and set the value to the JDK folder.
    e.g. C:\Java\jdk1.7.0_75
  • Download and install Maven from below link.
    http://maven.apache.org/download.cgi
    Extract it and include the bin directory path in the system PATH variable.
    e.g. C:\Maven\bin
  • Download the SparkR source from below
    http://amplab-extras.github.io/SparkR-pkg/
  • Extract the zip and rename the extracted folder as SparkRSource.
  • Open cmd and change directory to SparkRSource. Run the install-dev batch file

building SparkR

  • If build finishes successfully you can see the below messages and the output is stored in C:\SparkRSource\lib\SparkR

building SparkR successfull

  • To use SparkR copy the SparkR folder from C:\SparkRSource\lib to C:\R\R-3.1.3\library.
  • You can use the SparkR library like below
library(SparkR)
sc <- sparkR.init(master=”local”,”RSparkApp”)

Running a Spark WordCount program in R

Start RGui and type the below code.

library(SparkR)       #load SparkR package

sc <- sparkR.init(master=”local”,”RWordCount”)     #start the Spark Context

lines <- textFile(sc,”c:/spark-1.3.0/changes.txt”)

words <- flatMap(lines,function(line){strsplit(line, ” “)[[1]]})

wordCount <- lapply(words, function(word) { list(word, 1L) })

counts <- reduceByKey(wordCount, “+”, 2L)

output <- collect(counts)

for (wordcount in output) {

cat(wordcount[[1]], “: “, wordcount[[2]], “\n”)

}

The output will be like below.

Spark wordcount in R

One thought on “Using R with Apache Spark

  1. Hi,
    Below is the error I am getting when I run install-dev from the cmd line:

    ‘R.exe’ is not recognized an an internal or external command, operable program or batch file.

    Pls. let me know the fix.

    SH

Leave a Reply

Your email address will not be published. Required fields are marked *


− 3 = three