Data Analysis Using Zeppelin on Windows

Tweet about this on TwitterShare on LinkedInShare on Google+Share on Facebook

We know that Apache Zeppelin is a web-based multipurpose note book. It provides an interactive Data Analysis and many more. such as Data Ingestion, Data Discovery, Data Visualization & Collaboration. In this post I will explore some basic data analysis using Zeppelin and Spark.

To enable Apache Spark and Zeppelin on Windows system you need to download and install the Sparklet on your windows system.

Here I am using the sales data (SampleData.csv) for my Data Analysis, which was also used in my previous Data Visualization blog post.

Below are the steps I am following and the code sample.

  1. Load Data File
    val csv = sc.textFile("C:/data/SampleData.csv")
    val headerAndRows = csv.map(line => line.split(",").map(_.trim))
    val header = headerAndRows.first
    val data = headerAndRows.filter(_(0) != header(0))
    val sampleData = data.map(p => SampleData(
    p(0),
    p(1),
    p(2),
    p(3),
    p(4),
    p(5),
    p(6)
    )).toDF()
    
  2. Show the content of the DataFrame
    sampleData.show()
  3. Count the number of orders
    sampleData.count
  1. Select only one column, e.g. “Item” column
    sampleData.select("Item").show()
  2. Access a Column, e.g. “OrderDate” column
    sampleData("OrderDate")
    //select column
    select(sampleData("OrderDate")).show()
    //select multiple column, e.g “OrderDate” and “Total” column
    sampleData.select(sampleData("OrderDate"), (sampleData("Total"))).show()
  3. Round figure of column value, e.g. “Total” column
    sampleData.select(sampleData("OrderDate"), round(sampleData("Total"))).show()
  4. Filter column value, e.g. “Total” column greater than 1000
    sampleData.filter(sampleData("Total") > 1000).show()
  5. Count the number of orders by Region, e.g. “Region” column
    sampleData.groupBy("Region").count().show()
  6. Register Data Frame as a Table
    sampleData.registerTempTable("sales")
  7. Data Visualization technique
    %sql
    SELECT * FROM sales
    

    sales bar chart

    %sql
    SELECT Region, Item, Total FROM sales
    

    total item sales on region

    %sql
    SELECT Region, round(sum(Total)) AS RegionalTotal FROM sales
    GROUP BY Region
    

    total item sales by region

    %sql
    SELECT Region, Item, round(sum(Total)) AS RegionalTotal FROM sales
    GROUP BY Region, Item
    Order By Region
    

    total item order by region

I will keep exploring more analysis on Zeppelin and Spark in windows environment. Stay tuned!

2 thoughts on “Data Analysis Using Zeppelin on Windows

Leave a Reply

Your email address will not be published. Required fields are marked *


6 + three =