To use R with Spark we need SparkR package. Below are the steps needed to build, install and use SparkR package in a windows system.
Building SparkR in Windows
- Make sure that you have installed R (version >3.1) and the path to bin is added in the system PATH variable
e.g. C:\R\R-3.1.3\bin\x64 - Download Rtools from below link
http://cran.r-project.org/bin/windows/Rtools/ - Select the components to install
- Finish the installation and add below paths to the system PATH variable
C:\Rtools\gcc492_64\bin;C:\Rtools\bin - Download JDK and install. Create a new system variable name it as JAVA_HOME and set the value to the JDK folder.
e.g. C:\Java\jdk1.7.0_75 - Download and install Maven from below link.
http://maven.apache.org/download.cgi
Extract it and include the bin directory path in the system PATH variable.
e.g. C:\Maven\bin - Download the SparkR source from below
http://amplab-extras.github.io/SparkR-pkg/ - Extract the zip and rename the extracted folder as SparkRSource.
- Open cmd and change directory to SparkRSource. Run the install-dev batch file
- If build finishes successfully you can see the below messages and the output is stored in C:\SparkRSource\lib\SparkR
- To use SparkR copy the SparkR folder from C:\SparkRSource\lib to C:\R\R-3.1.3\library.
- You can use the SparkR library like below
library(SparkR)
sc <- sparkR.init(master=”local”,”RSparkApp”)
sc <- sparkR.init(master=”local”,”RSparkApp”)
Running a Spark WordCount program in R
Start RGui and type the below code.
library(SparkR) #load SparkR package
sc <- sparkR.init(master=”local”,”RWordCount”) #start the Spark Context
lines <- textFile(sc,”c:/spark-1.3.0/changes.txt”)
words <- flatMap(lines,function(line){strsplit(line, ” “)[[1]]})
wordCount <- lapply(words, function(word) { list(word, 1L) })
counts <- reduceByKey(wordCount, “+”, 2L)
output <- collect(counts)
for (wordcount in output) {
cat(wordcount[[1]], “: “, wordcount[[2]], “\n”)
}
The output will be like below.
Sanjay
11 Jun 2015Hi,
Below is the error I am getting when I run install-dev from the cmd line:
‘R.exe’ is not recognized an an internal or external command, operable program or batch file.
Pls. let me know the fix.
SH