Friday, November 14, 2014

OpenStack Series: Part 14 – Sahara – Data Processing Service

Sahara is to provide a service to provision data intensive application cluster on an OpenStack Infrastructure.

The OpenStack documentation uses "data intensive application cluster", and I think most people will use the term "big data processing" or "data analytic" in which huge amount of computer power is required for processing the raw data collected.

Often times the impression for Sahara is to have Hadoop running on OpenStack.  At this time Hadoop is the main application cluster that Sahara support.  Spark is also being worked on.  There is the Spark Plugin in Sahara.  I think it all depends on what application for Big Data is being deployed are popular in the enterprise environment where OpenStack is used.

Hadoop 2.0 by itself is just like OpenStack which is a set of services for data processing.  It has 2 pillars:
  • YARN - Yet Another Resource Negotiator
  • HDFS -Hadoop Distributed File System
  
image source: http://4.bp.blogspot.com/-Pm2Q_uyZmPw/U2IvDO7my1I/AAAAAAAABFg/8CAyVoO7F30/s1600/YARN.png

Another view of Hadoop:

image source: http://www.rosebt.com/uploads/8/1/8/1/8181762/5829807_orig.jpg

This one shows the Hadoop ecosystem with MapReduce, PIG, HIVE, HBASE ... etc.
image source: http://hortonworks.com/wp-content/uploads/2013/10/HDP2.0Stack.png


Sahara Architecture
image source: http://docs.openstack.org/developer/sahara/_images/sahara-architecture.png

OpenStack Documentation describe the various components of Sahara as:
  • Auth component - responsible for client authentication & authorization, communicates with Keystone
  • DAL - Data Access Layer, persists internal models in DB
  • Provisioning Engine - component responsible for communication with Nova, Heat, Cinder and Glance
  • Vendor Plugins - pluggable mechanism responsible for configuring and launching Hadoop on provisioned VMs; existing management solutions like Apache Ambari and Cloudera Management Console could be utilized for that matter
  • EDP - Elastic Data Processing (EDP) responsible for scheduling and managing Hadoop jobs on clusters provisioned by Sahara
  • REST API - exposes Sahara functionality via REST
  • Python Sahara Client - similar to other OpenStack components Sahara has its own python client
  • Sahara pages - GUI for the Sahara is located on Horizon

Sahara Use Cases
Sahara supports two key use cases:
  • on-demand cluster provisioning
  • on-demand Hadoop tasks execution (Elastic Data Processing)
Cluster Provisioning
Cluster
  • Consist of node group
  • 3 types of node groups: Master, Core Workers and Workers
Templates
  • Two kinds of templates: node group templates and cluster templates
  • User can override the parameters of the templates via API
  • Hadoop distribution specific due to different parameters used
Provisioning Plugins
  • Responsible for provisioning a Hadoop cluster
  • One plugin for each specific Hadoop distribution (Apache Hadoop, HortonWorks)
  • A list of available plugins can be found here
Image Registry
  • To support cluster provisioning, pre-built image with an installed OS are needed
  • Helps filter out images during cluster creation
  • This page explain how to work with image registry

image source: http://docs.openstack.org/developer/sahara/_images/hadoop-cluster-example.jpg

Elastic Data Processing
  • Work flow management
  • Similar to Amazon Web Services Elastic MapReduce (EMR) which is being deployed heavily on public cloud
  • Jobs can be launched either via the OpenStack Dashboard or CLI 
  • API to launch jobs without the user having to know the underlying Hadoop Infrastructure
For detailed discussion of Elastic Data Processing please visit the OpenStack Documentation site. This diagram still uses the old project name "Savanna" instead of Sahara.
image source: https://wiki.openstack.org/wiki/File:EDP_diagram.png

Related Post:
OpenStack Series Part 1: How do you look at OpenStack?
OpenStack Series Part 2: What's new in the Juno Release?
OpenStack Series Part 3: Keystone - Identity Service
OpenStack Series Part 4: Nova - Compute Service
OpenStack Series Part 5: Glance - Image Service
OpenStack Series Part 6: Cinder - Block Storage Service
OpenStack Series Part 7: Swift - Object Storage Service
OpenStack Series Part 8: Neutron - Networking Service
OpenStack Series Part 9: Horizon - a Web Based UI Service
OpenStack Series Part 10: Heat - Orchestration Service
OpenStack Series Part 11: Ceilometer - Monitoring and Metering Service
OpenStack Series Part 12: Trove - Database Service
OpenStack Series Part 13: Docker in OpenStack
OpenStack Series part 15: Messaging and Queuing System in OpenStack
OpenStack Series Part 16: Ceph in OpenStack
OpenStack Series Part 17: Congress - Policy Service
OpenStack Series Part 18: Network Function Virtualization in OpenStack
OpenStack Series Part 19: Storage Polices for Object Storage
OpenStack Series Part 20: Group-based Policy for Neutron

Reference:
"OpenStack." Architecture — Sahara. N.p., n.d. Web. 28 Oct. 2014.
"OpenStack." Getting Started — Sahara. N.p., n.d. Web. 29 Oct. 2014.

5 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Big Data Hadoop Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Course in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Best Hadoop Training in Chennai
    | Best hadoop training institute in chennai

    ReplyDelete
  2. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce.com training in chennai). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Saesforce Admin Training in Chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

    ReplyDelete
  3. BlueHost is one of the best website hosting provider for any hosting plans you might require.

    ReplyDelete
  4. There's a chance you're eligible to get a $1,000 Amazon Gift Card.

    ReplyDelete
  5. Your blog has given me that thing which I never expect to get from all over the websites. Nice post guys!

    ReplyDelete