Category Archives: Hadoop

Exploring Azure Data Lake Store Preview

The Azure Data Lake Store  is a cloud repository where you can easily store data of any size or any type. It is the Hadoop Distributed File System for the cloud and available on-demad. Data stored in Data Lake Store is easily accessible to Azure Data Lake Analytics and Azure HDInsight. It will be possible to integrate it with other Hadoop distributions and projects like Hortonworks , Cloudera, spark, strom and flume.

Below are the steps to create Azure Data Late Store and manage it using Azure Portal and Azure CLI.

Continue reading

Enterprise Data Lake – Azure Way!

What is an Enterprise Data Lake?

Way back in 2010, Pentaho co-founder and CTO, James Dixon coined the term ‘Data Lake’. While these days, there exist many interpretations of the term, usually it means a repository that holds a vast amount of raw data in its native format until it is needed. Raw data at its most granular level is stored so that any ad-hoc analysis can be performed at any time.

Continue reading

File operations in HDFS using java

I am using HDP for windows (1.3.0.0) single node and Eclipse as development environment. Below are few samples to read and write to HDFS.

  • Create a new Java Project in Eclipse.
  • In Java Settings go to Libraries and add External JARs. Browse to Hadoop installation folder and add below JAR file.Hadoop-core.jar
  • Go into lib folder and add below JAR files.common-configuration-1.6.jar
    common-lang-2.4.jar
    common-logging-api-1.0.4.jar

Continue reading

Quick notes on YARN (Hadoop 2.0)

Problems we had before YARN:

  • JobTracker is solely responsible for handling resources and tasks progress.
  • Scalability Limitation: Maximum cluster size is 4000
  • Maximum concurrent task is 40,000
  • On failure in one job execution: Kills the complete job queue. User needs to resubmit all the jobs.
  • Restarting is complex.
  • Low resource utilization because no flexibility in sharing and allocation of cluster resources.
  • Supports only map reduce. Other iterative application implemented using map reduce is very slower.

Continue reading