Destiny - The Cloud: OpenStack Series: Part 14 – Sahara

Friday, November 14, 2014

OpenStack Series: Part 14 – Sahara – Data Processing Service

Sahara is to provide a service to provision data intensive application cluster on an OpenStack Infrastructure.

The OpenStack documentation uses "data intensive application cluster", and I think most people will use the term "big data processing" or "data analytic" in which huge amount of computer power is required for processing the raw data collected.

Often times the impression for Sahara is to have Hadoop running on OpenStack. At this time Hadoop is the main application cluster that Sahara support. Spark is also being worked on. There is the Spark Plugin in Sahara. I think it all depends on what application for Big Data is being deployed are popular in the enterprise environment where OpenStack is used.

Hadoop 2.0 by itself is just like OpenStack which is a set of services for data processing. It has 2 pillars:

YARN - Yet Another Resource Negotiator
HDFS -Hadoop Distributed File System

image source: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWlmlf5oMIdBHlX3M9HbdH2bedW4n6G8sgfQRPH6vMKrLXXUzwgKSJm7Xwf47qJmPsf_9d6MJfwG_BkOziXfGtK6BjqFUE_zNY21-cr7olgz5beB2AlewxoQdbh3j6G8_MWg-0U_58e0I/s1600/YARN.png

Another view of Hadoop:

image source: http://www.rosebt.com/uploads/8/1/8/1/8181762/5829807_orig.jpg

This one shows the Hadoop ecosystem with MapReduce, PIG, HIVE, HBASE ... etc.

image source: http://hortonworks.com/wp-content/uploads/2013/10/HDP2.0Stack.png

Sahara Architecture

image source: http://docs.openstack.org/developer/sahara/_images/sahara-architecture.png

OpenStack Documentation describe the various components of Sahara as:

Auth component - responsible for client authentication & authorization, communicates with Keystone
DAL - Data Access Layer, persists internal models in DB
Provisioning Engine - component responsible for communication with Nova, Heat, Cinder and Glance
Vendor Plugins - pluggable mechanism responsible for configuring and launching Hadoop on provisioned VMs; existing management solutions like Apache Ambari and Cloudera Management Console could be utilized for that matter
EDP - Elastic Data Processing (EDP) responsible for scheduling and managing Hadoop jobs on clusters provisioned by Sahara
REST API - exposes Sahara functionality via REST
Python Sahara Client - similar to other OpenStack components Sahara has its own python client
Sahara pages - GUI for the Sahara is located on Horizon

Sahara Use Cases
Sahara supports two key use cases:

on-demand cluster provisioning
on-demand Hadoop tasks execution (Elastic Data Processing)

Cluster Provisioning
Cluster

Consist of node group
3 types of node groups: Master, Core Workers and Workers

Templates

Two kinds of templates: node group templates and cluster templates
User can override the parameters of the templates via API
Hadoop distribution specific due to different parameters used

Provisioning Plugins

Responsible for provisioning a Hadoop cluster
One plugin for each specific Hadoop distribution (Apache Hadoop, HortonWorks)
A list of available plugins can be found here

Image Registry

To support cluster provisioning, pre-built image with an installed OS are needed
Helps filter out images during cluster creation
This page explain how to work with image registry

image source: http://docs.openstack.org/developer/sahara/_images/hadoop-cluster-example.jpg

Elastic Data Processing

Work flow management
Similar to Amazon Web Services Elastic MapReduce (EMR) which is being deployed heavily on public cloud
Jobs can be launched either via the OpenStack Dashboard or CLI
API to launch jobs without the user having to know the underlying Hadoop Infrastructure

For detailed discussion of Elastic Data Processing please visit the OpenStack Documentation site. This diagram still uses the old project name "Savanna" instead of Sahara.

image source: https://wiki.openstack.org/wiki/File:EDP_diagram.png

Related Post:
OpenStack Series Part 1: How do you look at OpenStack?
OpenStack Series Part 2: What's new in the Juno Release?
OpenStack Series Part 3: Keystone - Identity Service
OpenStack Series Part 4: Nova - Compute Service
OpenStack Series Part 5: Glance - Image Service
OpenStack Series Part 6: Cinder - Block Storage Service
OpenStack Series Part 7: Swift - Object Storage Service
OpenStack Series Part 8: Neutron - Networking Service
OpenStack Series Part 9: Horizon - a Web Based UI Service
OpenStack Series Part 10: Heat - Orchestration Service
OpenStack Series Part 11: Ceilometer - Monitoring and Metering Service
OpenStack Series Part 12: Trove - Database Service
OpenStack Series Part 13: Docker in OpenStack
OpenStack Series part 15: Messaging and Queuing System in OpenStack
OpenStack Series Part 16: Ceph in OpenStack
OpenStack Series Part 17: Congress - Policy Service
OpenStack Series Part 18: Network Function Virtualization in OpenStack
OpenStack Series Part 19: Storage Polices for Object Storage
OpenStack Series Part 20: Group-based Policy for Neutron

Reference:
"OpenStack." Architecture — Sahara. N.p., n.d. Web. 28 Oct. 2014.
"OpenStack." Getting Started — Sahara. N.p., n.d. Web. 29 Oct. 2014.