First steps with Hadoop

December 27, 2014

Background

I need some improvement in one of the batch processes I run. It were build using PHP, parsing text files into mysql database. So for a change I tried to learn Hadoop and Hive

Installation 1

Hadoop is a bit more complex than MySQL installation. Ok, so 'a bit' is an understatement. I tried to follow Windows installation procedure from HadoopOnWindows. I downloaded the binary package instead of the source package, because I am not in the mood of waiting mvn downloading endless list of jars. Well, some errors prevented me from continuing this path.

Installation 2

Virtual machines seems to be way to go. Not wanting to spend too much time installing and configuring VMs, I installed Vagrant, a tool to download images and configure VMs automatically. VirtualBox is required as Vagrant's default provider, so I installed it too.
At first I tried to follow this blog post titled Installing a Hadoop Cluster in Three Commands, but it somehow doesn't work either. So the steps below is copied from Gabriele Baldassarre's blog post who supplied us with working Vagrantfile and a few shell scripts:

git clone https://github.com/theclue/cdh5-vagrant
cd cdh5-vagrant
vi Vagrantfile
vagrant up

I needed to change the network setting a bit in the Vagrantfile because my server's external network is not DHCP, and bridging is out of the question. What I need is for Vagrant to set up a host-only network that is not internal to VirtualBox.

Vagrantfile line 43:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

config.vm.box = "centos65-x86_64-20140116"

config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"

config.vm.define "cdh-master" do |master|

# master.vm.network :public_network, :bridge => 'eth0'

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

config.vm.box = "centos65-x86_64-20140116"

config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"

config.vm.define "cdh-master" do |master|

# master.vm.network :public_network, :bridge => 'eth0'

master.vm.network :private_network, ip: "192.168.0.10"

master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"

master.vm.hostname = "cdh-master"

master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"

master.vm.hostname = "cdh-master"

So it doesn't matter that there are two private_network.

The scripts set us 80 GB of secondary storage on the cdh-master node, please be sure you have plenty of space in the HDD.

After the downloads and configuration completed, access the Hue interface on the host-only IP : http://192.168.0.10:8888, and create the first user, this user will be defined as the administrator of this Hadoop system.

The problem

After installation completes, I found that all of the VM's are running with 100% CPU usage attributed to the flume user. It turns out that the provisioning script copied the flume configuration file verbatim, which is configured to use a continuous sequence generator as event source. Changing the event source to syslogtcp and restarting flume-ng-agent service will cure this condition.

UPDATE:
It seems that the provisioned VMs all have default yarn.nodemanager.resource.memory-mb value, which is 8096 mb. For the 2048 mb VMs, I created this property in /etc/hadoop/conf/yarn-site.xml and set the value to 1600.

UPDATE 2:
Somehow there are misconfigured lines in the created VMs. In yarn-site.xml, I need to change yarn.nodemanager.remote-app-log-dir from hdfs:///var/log/hadoop-yarn/apps to hdfs://cdh-master:8020/var/log/hadoop-yarn/apps. And also need to change dfs.namenode.name.dir in /etc/hadoop/conf/hdfs-site.xml to file:///dfs/dn to prevent 'could only be replicated to 0 nodes' errors.

UPDATE 3:
after destroying and recreating all the vms, seems that the dfs.datanode.data.dir also need reconfiguring in each of data nodes, make them all point to file:///dfs/dn

Installation 3

In parallel, I also tried lightweight version of VMs, that is called Docker. First challenge is that my PC is windows, so I installed boot2docker first, enable VT in the BIOS, then tried this one-liner from blog post Ambari provisioned Hadoop cluster on Docker :

curl -LOs j.mp/ambari-singlenode && . ambari-singlenode

It finished, but somehow the web UI shows everything to be offline. Need some debugging to get it right, and in current condition I have so little knowledge of what happening behind the scenes, so I postpone the debugging later.

Inventor's Paradox