Hadoop YARN: Super Simplified


Tech world is filled with overwhelming number of terms and acronyms, Hadoop is no exception. Hadoop v1.0 was centered around just two things: HDFS (Hadoop File System) and Map Reduce. In fact, most of earlier versions of Hadoop used Map Reduce to process data. Hadoop 2.0 introduced YARN (Yet Another Resource Negotiator) as Hadoop moved from Map Reduce to more generic programming framework or model, with ability to support Apache Spark among others.

YARN architecture can be a little confusing. An attempt is made in this blog to ‘super simplify’, without sacrificing too much of context and features.

Consider a single host / server. Operating system does the (a) scheduling and allocates (b) resources for an application. If there are number of instances of an application on the same server, it is conceivable to have (c) an application master managing all instances for that application.

Expand this idea to a cluster. A number of applications may be spawned by a corresponding Application Master. A cluster wise master or resource manager is required to coordinate activities and schedule to keep cluster healthy. Tasks or workers are run and managed by application master. Application master requests resource manager, which in turn allocates resources in coordination with node manager.

Confusion confined! Please check details such as flow / control structure at Apache documentation

Big data store for 1 Peta Byte (compressed)


I was checking cost effective way to archive a large set of data. A store of this size is desirable for analyzing massive collection of data. One approach to machine learning is to apply ensemble models and let the algorithms figure out best predictive analytical selection. We can process compressed data directly, without unzipping in many cases. This data can feed key-value store, Hadoop or traditional databases such as Postgres.

Cloud storage can get very expensive and is challenging for fast access. Enterprise storage from companies such as NetApp is not cheap either.

Back Blaze has published series of studies. Here is a 180 TB system with price tag of around $10K: Compression ration of 1:5 or better is assumed: https://www.backblaze.com/blog/backblaze-storage-pod-4/

This is definitely not for enterprise grade reliability and performance. We still need to decide on file system, back up / recovery strategy, easy administration among other challenges.

Hadoop Cluster on VMware ESX server with Ambari and Spark


I am sure some of us wanted to experience Hadoop in a true cluster, not just limiting to a pseudo cluster (aka, single node).  I played a bit with other distributions in the past, including PHD and CDH. Hortonworks Hadoop installation and configuration with Ambari is relatively painless and even enlightening! Set aside a few hours or a weekend, one can get a Hadoop cluster up and running with default services.

Here are high level observations, your mileage may slightly vary.

  1. Get an ESX server from VMware, free if you are running 2 CPU x 6 cores machine or with lower specs. This machine has 8 x 1TB SAS directly attached drives on a Dell rack mountable server.
  2. Install CentOS, free from thoughtpolice.co.uk
  3. Setup 6 or 8 nodes of VM with default settings. Here is mine when viewed from vSphere client.

    Six VM CentOS images

    Six VM CentOS images

  4. Ensure ssh is setup correctly both for root and hduser so that these users can access any VM without a password.
  5. Download Hadoop 2.1 with Ambari 1.6.1. Follow steps 1 through 4.Make required directories on the nodes and turn off IP tables. (http://docs.hortonworks.com/HDPDocuments/Ambari- 
  6. Setup repositories, ensure that they match with OS version and release.
  7. Two of the six VMs reported problem (though all six are practically supposed to be identical). “sudo yum -y update –skip-broken” fixed these problems.
  8. Here is VM (host) image and Ambari UI. Not only we can see services, configuration and activity through web interface, one can access Ganglia and Nagios as well.
    All six hosts displayed here (no order)

    All six hosts displayed here (no order)

    Ambari interface

  9. Run a sample map reduce job to check all is well.
  10. Download Spark:

    wget http://public-repo-1.hortonworks.com/HDP-LABS/Projects/spark/1.1.0/spark-

  11. Untar and run a small Spark job after setting YARN_CONF_DIR accordingly:

    ./bin/spark-submit –class org.apache.spark.examples.SparkPi    –master yarn-cluster  –num-executors 3 –driver-memory 512m  –executor-memory 512m   –executor-cores 1  lib/spark-examples*.jar 10

  12. Above may run into an error if permissions & directories are not set properly.

    sudo -u hdfs hadoop fs -mkdir /user/root

    hadoop fs -ls /user

  13. Foot note: All settings used are default, no tuning or scaling performed.

Mongodb architecture: super simplified


Mongodb uses doubly linked lists (DLL) to maintain structure. There are two levels; this does not take into indexing and other features.

  1. Extents: they are similar to a table in a relational system. Extents contain one or more documents and are connected through DLL. (Slide 10 & 10, see below)
  2. Documents: these are similar to a row in a relational databases. They are maintain their own DLL. (Slide 19).

Mongodb references files on disk through memory mapping. Indexes are through well known B-trees.

These are illustrated in http://2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk-mongodb.pdf

Deeper coverage: http://horicky.blogspot.com/2012/04/mongodb-architecture.html

Note: MongoDB is suited for document display & processing; if we try to link or join docs, we will find some difficulties. Definitely, not a competing technology to (relational or MPP) databases.

Next big Hadoop wave: Spark


One of the popular and widely talked about technologies in Hadoop world is Spark. Spark Summit concluded about a week back and was quickly sold out!

If we consider (traditional) Hadoop as batch oriented, Oracle Times Ten or Pivotal Gemfire as memory based data stores, Spark may be positioned in between these two extremes; near real time interactive solution.

Data Bricks (based on Spark) has demonstrated their product offering(s): http://www.youtube.com/watch?v=dJQ5lV5Tldw&list=PLcI18OaXgJ1ucFdq6xWkLoKzP1T_R5mVD&index=1

Product demo starts from minute 15 onwards. It appears that ‘deep learning’ algorithms are built into this product as it can “predict” on its own; think of it as unsupervised as opposed to supervised learning. A number of features are strikigly similar to Greenplum’s Chorus.


Good Database Comparison Site: db-engines.com


This site is useful for quick comparison of databases and data stores. For instance, if you like to generate a summarized report for NZ vs TD, here is a link: http://db-engines.com/en/system/Netezza%3BTeradata

We can include or remove any product(s) to generate a nice report.


RB ranking are given at http://db-engines.com/en/ranking

Note that these findings are not based on any scientific analysis or financial or revenue generation or even customer installation base. They are consolidated from web and digital access from internet users like you.

Postgres is fourth popular database, not surprising as it is the engine for Netezza, Greenplum, Vertica, etc. Btw, Teradata uses Postgres to store performance results (in viewpoint).