To develop Apache Spark applications in IPython and Python tools for Visual Studio we need to set the environment variables PYTHONPATH to include the required library path for Spark.
Setting PYTHONPATH for Spark
- Go to system properties and in advance tab click on environment variables.
- Create a new system variable and name it as PYTHONPATH.
- Add the below paths to the value field separated by semicolons (here c:\spark-1.3.0 is the path where spark installed)
c:\spark-1.3.0\bin
c:\spark-1.3.0\python
c:\spark-1.3.0\python\lib\py4j-0.8.2.1-src.zip - Create another system variable and name it as SPARK_HOME. Set the value as the path of Spark installed directory.
c:\spark-1.3.0
Standalone Spark program in IPython
Run the IPython shell or IPython Notebook and type below code. The code is for word counts of a file in Spark standalone mode.
[code language=”python”]
from operator import add
from pyspark import SparkContext
sc = SparkContext(appName=”PythonWordCount”)
lines = sc.textFile(“c:/spark-1.3.0/CHANGES.txt”) # path to a text file in local file system
counts = lines.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x, 1)).reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print “%s: %i” % (word, count)
sc.stop()
[/code]
Standalone Spark program in Visual Studio
To develop python programs in Visual Studio you need to install Python tools for Visual Studio. Below is the link to download
https://pytools.codeplex.com
Follow the below steps to run a Standalone Spark program in Visual Studio
- Create a new Python Application in Visual Studio
- In solution explorer right click on search path and Add PYTHONPATH to Search Path
- Type the code given above and run. If you are using Python Interactive then you need to reset it.