Category Archives: Apache Spark

Data Analysis Using Zeppelin on Windows

We know that Apache Zeppelin is a web-based multipurpose note book. It provides an interactive Data Analysis and many more. such as Data Ingestion, Data Discovery, Data Visualization & Collaboration. In this post I will explore some basic data analysis using Zeppelin and Spark.

To enable Apache Spark and Zeppelin on Windows system you need to download and install the Sparklet on your windows system.

Continue reading

Sparklet – Apache Spark and Zeppelin installer for Windows

Needless to mention that Apache Spark is becoming the de facto platform for big data analytics. At the same time there is a notebook revolution going on. Data scientists and others who use a notebook simply love it. A notebook provides a browser based interactive environment to write and execute code, view output, make plots and many more. IPython Notebook is no doubt leading this revolution but it only allows python code.

Apache Zeppelin is a new entrant to the league. It enables interactive data analytics. One can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. Zeppelin is based on the concept of an interpreter that can be bound to any language or data processing backend. Basically, Zeppelin is a web based notebook server. Its backend already supports quite a few interpreters like Spark, Scala, Python, Hive, Markdown etc and many more are yet to come. That means from a single notebook you can work with different big data platform and build your analytics solution. Zeppelin tends to cater all your needs: Data Ingestion, Data Discovery, Data Analytics, Data Visualization & Collaboration. It comes with Spark/Scala as its default interpreter.

Continue reading

Apache Spark – Big Data Platform for All

Apache Spark is a powerful open source in-memory cluster computing framework built around speed, ease of use, and sophisticated analytics. It runs everywhere – Hadoop (YARN), Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3 and more. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX for graph processing, and Spark Streaming to build scalable fault-tolerant streaming applications. These can also be combined seamlessly in an application.

Continue reading

Using R with Apache Spark

To use R with Spark we need SparkR package. Below are the steps needed to build, install and use SparkR package in a windows system.

Building SparkR in Windows

  • Make sure that you have installed R (version >3.1) and the path to bin is added in the system PATH variable
    e.g. C:\R\R-3.1.3\bin\x64
  • Download Rtools from below link
    http://cran.r-project.org/bin/windows/Rtools/
  • Select the components to install

Continue reading