Saturday, December 27, 2014

First steps with Hadoop


I need some improvement in one of the batch processes I run. It were build using PHP, parsing text files into mysql database. So for a change I tried to learn Hadoop and Hive

Installation 1

Hadoop is a bit more complex than MySQL installation. Ok, so 'a bit' is an understatement. I tried to follow Windows installation procedure from HadoopOnWindows. I downloaded the binary package instead of the source package, because I am not in the mood of waiting mvn downloading endless list of jars. Well, some errors prevented me from continuing this path.

Installation 2

Virtual machines seems to be way to go. Not wanting to spend too much time installing and configuring VMs, I installed Vagrant, a tool to download images and configure VMs automatically. VirtualBox is required as Vagrant's default provider, so I installed it too.
At first I tried to follow this blog post titled Installing a Hadoop Cluster in Three Commands, but it somehow doesn't work either. So the steps below is copied from Gabriele Baldassarre's blog post  who supplied us with working Vagrantfile  and a few shell scripts:
  • git clone
  • cd cdh5-vagrant
  • vi Vagrantfile
  • vagrant up
I needed to change the network setting a bit in the Vagrantfile because my server's external network is not DHCP, and bridging is out of the question. What I need is for Vagrant to set up a host-only network that is not internal to VirtualBox. 

Vagrantfile line 43:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| = "centos65-x86_64-20140116"
  config.vm.box_url = ""
  config.vm.define "cdh-master" do |master|
# :public_network, :bridge => 'eth0'
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config| = "centos65-x86_64-20140116"
  config.vm.box_url = ""
  config.vm.define "cdh-master" do |master|
# :public_network, :bridge => 'eth0' :private_network, ip: "" :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master" :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"

So it doesn't matter that there are two private_network. 

The scripts set us 80 GB of secondary storage on the cdh-master node, please be sure you have plenty of space in the HDD.

After the downloads and configuration completed, access the Hue interface on the host-only IP :, and create the first user, this user will be defined as the administrator of this Hadoop system.

The problem 

After installation completes, I found that all of the VM's are running with 100% CPU usage attributed to the flume user. It turns out that the provisioning script copied the flume configuration file verbatim, which is configured to use a continuous sequence generator as event source. Changing the event source to syslogtcp and restarting flume-ng-agent service will cure this condition.

It seems that the provisioned VMs all have default yarn.nodemanager.resource.memory-mb value, which is 8096 mb. For the 2048 mb VMs, I created this property in /etc/hadoop/conf/yarn-site.xml and set the value to 1600.

Somehow there are misconfigured lines in the created VMs. In yarn-site.xml, I need to change  yarn.nodemanager.remote-app-log-dir from hdfs:///var/log/hadoop-yarn/apps to hdfs://cdh-master:8020/var/log/hadoop-yarn/apps.  And also need to change in /etc/hadoop/conf/hdfs-site.xml to file:///dfs/dn to prevent 'could only be replicated to 0 nodes' errors.

after destroying and recreating all the vms, seems that the also need reconfiguring in each of data nodes, make them all point to file:///dfs/dn

Installation 3

In parallel, I also tried lightweight version of VMs, that is called Docker.  First challenge is that my PC is windows, so I installed boot2docker first, enable VT in the BIOS, then tried this one-liner from blog post Ambari provisioned Hadoop cluster on Docker :

curl -LOs && . ambari-singlenode

It finished, but somehow the web UI shows everything to be offline. Need some debugging to get it right, and in current condition I have so little knowledge of what happening behind the scenes, so I postpone the debugging later.

No comments: