Using IPython and Visual Studio with Apache Spark

To develop Apache Spark applications in IPython and Python tools for Visual Studio we need to set the environment variables PYTHONPATH to include the required library path for Spark.

Setting PYTHONPATH for Spark

  • Go to system properties and in advance tab click on environment variables.
  • Create a new system variable and name it as PYTHONPATH.
  • Add the below paths to the value field separated by semicolons (here c:\spark-1.3.0 is the path where spark installed)
    c:\spark-1.3.0\bin
    c:\spark-1.3.0\python
    c:\spark-1.3.0\python\lib\py4j-0.8.2.1-src.zip
  • Create another system variable and name it as SPARK_HOME. Set the value as the path of Spark installed directory.
    c:\spark-1.3.0

Standalone Spark program in IPython

Run the IPython shell or IPython Notebook and type below code. The code is for word counts of a file in Spark standalone mode.
[code language=”python”]
from operator import add
from pyspark import SparkContext

sc = SparkContext(appName=”PythonWordCount”)

lines = sc.textFile(“c:/spark-1.3.0/CHANGES.txt”) # path to a text file in local file system

counts = lines.flatMap(lambda x: x.split(‘ ‘)).map(lambda x: (x, 1)).reduceByKey(add)

output = counts.collect()

for (word, count) in output:

print “%s: %i” % (word, count)

sc.stop()
[/code]

Standalone Spark program in Visual Studio

To develop python programs in Visual Studio you need to install Python tools for Visual Studio. Below is the link to download
https://pytools.codeplex.com

Follow the below steps to run a Standalone Spark program in Visual Studio

  • Create a new Python Application in Visual Studio
  • In solution explorer right click on search path and Add PYTHONPATH to Search Path

Add PYTHONPATH to Search Path in visual studio (spark python)

  • Type the code given above and run. If you are using Python Interactive then you need to reset it.

Python Spark wordcount program in visual studio

Leave a Reply

Close Menu