Data Analysis Using Zeppelin on Windows

We know that Apache Zeppelin is a web-based multipurpose note book. It provides an interactive Data Analysis and many more. such as Data Ingestion, Data Discovery, Data Visualization & Collaboration. In this post I will explore some basic data analysis using Zeppelin and Spark.

To enable Apache Spark and Zeppelin on Windows system you need to download and install the Sparklet on your windows system.

Here I am using the sales data (SampleData.csv) for my Data Analysis, which was also used in my previous Data Visualization blog post.

Below are the steps I am following and the code sample.

  1. Load Data File

    val csv = sc.textFile("C:/data/SampleData.csv")
    val headerAndRows = csv.map(line => line.split(",").map(_.trim))
    val header = headerAndRows.first
    val data = headerAndRows.filter(_(0) != header(0))
    val sampleData = data.map(p => SampleData(
    p(0),
    p(1),
    p(2),
    p(3),
    p(4),
    p(5),
    p(6)
    )).toDF()
  2. Show the content of the DataFrame
    sampleData.show()
  3. Count the number of orders
    sampleData.count
  1. Select only one column, e.g. “Item” column
    sampleData.select("Item").show()
  2. Access a Column, e.g. “OrderDate” column
    sampleData("OrderDate")
    //select column
    select(sampleData("OrderDate")).show()
    //select multiple column, e.g “OrderDate” and “Total” column
    sampleData.select(sampleData("OrderDate"), (sampleData("Total"))).show()
  3. Round figure of column value, e.g. “Total” column
    sampleData.select(sampleData("OrderDate"), round(sampleData("Total"))).show()
  4. Filter column value, e.g. “Total” column greater than 1000
    sampleData.filter(sampleData("Total") > 1000).show()
  5. Count the number of orders by Region, e.g. “Region” column
    sampleData.groupBy("Region").count().show()

  6. Register Data Frame as a Table
    sampleData.registerTempTable("sales")
  7. Data Visualization technique
    %sql
    SELECT * FROM sales

    sales bar chart


    %sql
    SELECT Region, Item, Total FROM sales

    total item sales on region


    %sql
    SELECT Region, round(sum(Total)) AS RegionalTotal FROM sales
    GROUP BY Region

    total item sales by region


    %sql
    SELECT Region, Item, round(sum(Total)) AS RegionalTotal FROM sales
    GROUP BY Region, Item
    Order By Region

    total item order by region

I will keep exploring more analysis on Zeppelin and Spark in windows environment. Stay tuned!

This Post Has 2 Comments

  1. thank you for sharing the article with us. please keep on updating more useful informations…

  2. Hello.

    Where can we download the latest version Zeppelin 0.6.1 and Spark 2.0 for Windows?

Leave a Reply

Close Menu