Friday, November 14, 2014

OpenStack Series: Part 14 – Sahara – Data Processing Service

Sahara is to provide a service to provision data intensive application cluster on an OpenStack Infrastructure.

The OpenStack documentation uses "data intensive application cluster", and I think most people will use the term "big data processing" or "data analytic" in which huge amount of computer power is required for processing the raw data collected.

Often times the impression for Sahara is to have Hadoop running on OpenStack.  At this time Hadoop is the main application cluster that Sahara support.  Spark is also being worked on.  There is the Spark Plugin in Sahara.  I think it all depends on what application for Big Data is being deployed are popular in the enterprise environment where OpenStack is used.

Hadoop 2.0 by itself is just like OpenStack which is a set of services for data processing.  It has 2 pillars:
  • YARN - Yet Another Resource Negotiator
  • HDFS -Hadoop Distributed File System
image source:

Another view of Hadoop:

image source:

This one shows the Hadoop ecosystem with MapReduce, PIG, HIVE, HBASE ... etc.
image source:

Sahara Architecture
image source:

OpenStack Documentation describe the various components of Sahara as:
  • Auth component - responsible for client authentication & authorization, communicates with Keystone
  • DAL - Data Access Layer, persists internal models in DB
  • Provisioning Engine - component responsible for communication with Nova, Heat, Cinder and Glance
  • Vendor Plugins - pluggable mechanism responsible for configuring and launching Hadoop on provisioned VMs; existing management solutions like Apache Ambari and Cloudera Management Console could be utilized for that matter
  • EDP - Elastic Data Processing (EDP) responsible for scheduling and managing Hadoop jobs on clusters provisioned by Sahara
  • REST API - exposes Sahara functionality via REST
  • Python Sahara Client - similar to other OpenStack components Sahara has its own python client
  • Sahara pages - GUI for the Sahara is located on Horizon

Sahara Use Cases
Sahara supports two key use cases:
  • on-demand cluster provisioning
  • on-demand Hadoop tasks execution (Elastic Data Processing)
Cluster Provisioning
  • Consist of node group
  • 3 types of node groups: Master, Core Workers and Workers
  • Two kinds of templates: node group templates and cluster templates
  • User can override the parameters of the templates via API
  • Hadoop distribution specific due to different parameters used
Provisioning Plugins
  • Responsible for provisioning a Hadoop cluster
  • One plugin for each specific Hadoop distribution (Apache Hadoop, HortonWorks)
  • A list of available plugins can be found here
Image Registry
  • To support cluster provisioning, pre-built image with an installed OS are needed
  • Helps filter out images during cluster creation
  • This page explain how to work with image registry

image source:

Elastic Data Processing
  • Work flow management
  • Similar to Amazon Web Services Elastic MapReduce (EMR) which is being deployed heavily on public cloud
  • Jobs can be launched either via the OpenStack Dashboard or CLI 
  • API to launch jobs without the user having to know the underlying Hadoop Infrastructure
For detailed discussion of Elastic Data Processing please visit the OpenStack Documentation site. This diagram still uses the old project name "Savanna" instead of Sahara.
image source:

Related Post:
OpenStack Series Part 1: How do you look at OpenStack?
OpenStack Series Part 2: What's new in the Juno Release?
OpenStack Series Part 3: Keystone - Identity Service
OpenStack Series Part 4: Nova - Compute Service
OpenStack Series Part 5: Glance - Image Service
OpenStack Series Part 6: Cinder - Block Storage Service
OpenStack Series Part 7: Swift - Object Storage Service
OpenStack Series Part 8: Neutron - Networking Service
OpenStack Series Part 9: Horizon - a Web Based UI Service
OpenStack Series Part 10: Heat - Orchestration Service
OpenStack Series Part 11: Ceilometer - Monitoring and Metering Service
OpenStack Series Part 12: Trove - Database Service
OpenStack Series Part 13: Docker in OpenStack
OpenStack Series part 15: Messaging and Queuing System in OpenStack
OpenStack Series Part 16: Ceph in OpenStack
OpenStack Series Part 17: Congress - Policy Service
OpenStack Series Part 18: Network Function Virtualization in OpenStack
OpenStack Series Part 19: Storage Polices for Object Storage
OpenStack Series Part 20: Group-based Policy for Neutron

"OpenStack." Architecture — Sahara. N.p., n.d. Web. 28 Oct. 2014.
"OpenStack." Getting Started — Sahara. N.p., n.d. Web. 29 Oct. 2014.


  1. Cloud is one of the tremendous technology that any company in this world would rely on( training in chennai). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Saesforce Admin Training in Chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

  2. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

  3. really Good blog post.provided a helpful information.I hope that you will post more updates like thisBig data hadoop online Training Bangalore

  4. This blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
    AWS Certification Training in Anna nagar
    AWS Training in Ambattur
    AWS Training in T nagar
    AWS Certification Training in T nagar