Quick notes on YARN (Hadoop 2.0)

Tweet about this on TwitterShare on LinkedInShare on Google+Share on Facebook

Problems we had before YARN:

  • JobTracker is solely responsible for handling resources and tasks progress.
  • Scalability Limitation: Maximum cluster size is 4000
  • Maximum concurrent task is 40,000
  • On failure in one job execution: Kills the complete job queue. User needs to resubmit all the jobs.
  • Restarting is complex.
  • Low resource utilization because no flexibility in sharing and allocation of cluster resources.
  • Supports only map reduce. Other iterative application implemented using map reduce is very slower.

Job Tracker

 

The above picture shows the communication between Job Tracker and Task Trackers. Job Tracker solely responsible for creating Job queue and Task queue and monitors the execution of Jobs/Tasks execution. Task Trackers run the Map and Reduce task and report to the Job Tracker. Hence performance bottleneck happens when cluster size increases more than 4000.

What is YARN?

YARN – Yet Another Resource Negotiator
YARN also known as MapReduce 2.0 and a framework to develop and execute distributed processing application like MapReduce, Spark etc. It is considered to be the operating system for Hadoop (2.0).

  • Job Tracker is divided into Global Resource Manager and per-application Application Master.
  • Scales up to 10,000 nodes.
  • The Application Manager of Resource Manager provides the service for restarting the Application Master container on failure.
  • The Scheduler of Resource Manager is responsible for sharing the cluster resources adding flexibility to proper utilization of resources.
  • Supports frameworks like MapReduce, Graph Processing, Tez and provides platforms develop new ones.

The picture below shows the YARN architecture:

YARN Architecture

When a job is submitted to the ResourceManager, the ApplicationManager selects a Node Manger with available resources and requests a container for Application Master. Node Manager allocates container for the Application Master. Application Master executes and coordinates the tasks.
Each request has its own Application Master and the Application Master handles the containers. The picture below describes how YARN fits in Hadoop 2.0 ecosystem that is different from Hadoop 1.0.

Hadoop 2.0 Ecosystem

Quick References:

  • http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html
  • http://hortonworks.com/labs/yarn/
  • http://hortonworks.com/blog/philosophy-behind-yarn-resource-management/

Leave a Reply

Your email address will not be published. Required fields are marked *


six − = 2