Tech world is filled with overwhelming number of terms and acronyms, Hadoop is no exception. Hadoop v1.0 was centered around just two things: HDFS (Hadoop File System) and Map Reduce. In fact, most of earlier versions of Hadoop used Map Reduce to process data. Hadoop 2.0 introduced YARN (Yet Another Resource Negotiator) as Hadoop moved from Map Reduce to more generic programming framework or model, with ability to support Apache Spark among others.
YARN architecture can be a little confusing. An attempt is made in this blog to ‘super simplify’, without sacrificing too much of context and features.
Consider a single host / server. Operating system does the (a) scheduling and allocates (b) resources for an application. If there are number of instances of an application on the same server, it is conceivable to have (c) an application master managing all instances for that application.
Expand this idea to a cluster. A number of applications may be spawned by a corresponding Application Master. A cluster wise master or resource manager is required to coordinate activities and schedule to keep cluster healthy. Tasks or workers are run and managed by application master. Application master requests resource manager, which in turn allocates resources in coordination with node manager.
I was checking cost effective way to archive a large set of data. A store of this size is desirable for analyzing massive collection of data. One approach to machine learning is to apply ensemble models and let the algorithms figure out best predictive analytical selection. We can process compressed data directly, without unzipping in many cases. This data can feed key-value store, Hadoop or traditional databases such as Postgres.
Cloud storage can get very expensive and is challenging for fast access. Enterprise storage from companies such as NetApp is not cheap either.
Back Blaze has published series of studies. Here is a 180 TB system with price tag of around $10K: Compression ration of 1:5 or better is assumed: https://www.backblaze.com/blog/backblaze-storage-pod-4/
This is definitely not for enterprise grade reliability and performance. We still need to decide on file system, back up / recovery strategy, easy administration among other challenges.
I am sure some of us wanted to experience Hadoop in a true cluster, not just limiting to a pseudo cluster (aka, single node). I played a bit with other distributions in the past, including PHD and CDH. Hortonworks Hadoop installation and configuration with Ambari is relatively painless and even enlightening! Set aside a few hours or a weekend, one can get a Hadoop cluster up and running with default services.
Here are high level observations, your mileage may slightly vary.
- Get an ESX server from VMware, free if you are running 2 CPU x 6 cores machine or with lower specs. This machine has 8 x 1TB SAS directly attached drives on a Dell rack mountable server.
- Install CentOS, free from thoughtpolice.co.uk
- Setup 6 or 8 nodes of VM with default settings. Here is mine when viewed from vSphere client.
Six VM CentOS images
- Ensure ssh is setup correctly both for root and hduser so that these users can access any VM without a password.
- Download Hadoop 2.1 with Ambari 1.6.1. Follow steps 1 through 4.Make required directories on the nodes and turn off IP tables. (http://docs.hortonworks.com/HDPDocuments/Ambari-184.108.40.206/bk_using_Ambari_book/content/ambari-chap1.html).
- Setup repositories, ensure that they match with OS version and release.
- Two of the six VMs reported problem (though all six are practically supposed to be identical). “sudo yum -y update –skip-broken” fixed these problems.
- Here is VM (host) image and Ambari UI. Not only we can see services, configuration and activity through web interface, one can access Ganglia and Nagios as well.
All six hosts displayed here (no order)
- Run a sample map reduce job to check all is well.
- Download Spark:
- Untar and run a small Spark job after setting YARN_CONF_DIR accordingly:
./bin/spark-submit –class org.apache.spark.examples.SparkPi –master yarn-cluster –num-executors 3 –driver-memory 512m –executor-memory 512m –executor-cores 1 lib/spark-examples*.jar 10
- Above may run into an error if permissions & directories are not set properly.
sudo -u hdfs hadoop fs -mkdir /user/root
hadoop fs -ls /user
- Foot note: All settings used are default, no tuning or scaling performed.
Mongodb uses doubly linked lists (DLL) to maintain structure. There are two levels; this does not take into indexing and other features.
- Extents: they are similar to a table in a relational system. Extents contain one or more documents and are connected through DLL. (Slide 10 & 10, see below)
- Documents: these are similar to a row in a relational databases. They are maintain their own DLL. (Slide 19).
Mongodb references files on disk through memory mapping. Indexes are through well known B-trees.
These are illustrated in http://2013.nosql-matters.org/bcn/wp-content/uploads/2013/12/storage-talk-mongodb.pdf
Deeper coverage: http://horicky.blogspot.com/2012/04/mongodb-architecture.html
Note: MongoDB is suited for document display & processing; if we try to link or join docs, we will find some difficulties. Definitely, not a competing technology to (relational or MPP) databases.
Spark is gaining enormous popularity. This technology is considered as next phase in Hadoop or big data wave by many folks in the industry.
One of the popular and widely talked about technologies in Hadoop world is Spark. Spark Summit concluded about a week back and was quickly sold out!
If we consider (traditional) Hadoop as batch oriented, Oracle Times Ten or Pivotal Gemfire as memory based data stores, Spark may be positioned in between these two extremes; near real time interactive solution.
Data Bricks (based on Spark) has demonstrated their product offering(s): http://www.youtube.com/watch?v=dJQ5lV5Tldw&list=PLcI18OaXgJ1ucFdq6xWkLoKzP1T_R5mVD&index=1
Product demo starts from minute 15 onwards. It appears that ‘deep learning’ algorithms are built into this product as it can “predict” on its own; think of it as unsupervised as opposed to supervised learning. A number of features are strikigly similar to Greenplum’s Chorus.
This site is useful for quick comparison of databases and data stores. For instance, if you like to generate a summarized report for NZ vs TD, here is a link: http://db-engines.com/en/system/Netezza%3BTeradata
We can include or remove any product(s) to generate a nice report.
RB ranking are given at http://db-engines.com/en/ranking
Note that these findings are not based on any scientific analysis or financial or revenue generation or even customer installation base. They are consolidated from web and digital access from internet users like you.
Postgres is fourth popular database, not surprising as it is the engine for Netezza, Greenplum, Vertica, etc. Btw, Teradata uses Postgres to store performance results (in viewpoint).