First steps with Hadoop

Background

I need some improvement in one of the batch processes I run. It were build using PHP, parsing text files into mysql database. So for a change I tried to learn Hadoop and Hive

Installation 1

Hadoop is a bit more complex than MySQL installation. Ok, so 'a bit' is an understatement. I tried to follow Windows installation procedure from HadoopOnWindows. I downloaded the binary package instead of the source package, because I am not in the mood of waiting mvn downloading endless list of jars. Well, some errors prevented me from continuing this path.

Installation 2

Virtual machines seems to be way to go. Not wanting to spend too much time installing and configuring VMs, I installed Vagrant, a tool to download images and configure VMs automatically. VirtualBox is required as Vagrant's default provider, so I installed it too.
At first I tried to follow this blog post titled Installing a Hadoop Cluster in Three Commands, but it somehow doesn't work either. So the steps below is copied from Gabriele Baldassarre's blog post  who supplied us with working Vagrantfile  and a few shell scripts:
  • git clone https://github.com/theclue/cdh5-vagrant
  • cd cdh5-vagrant
  • vi Vagrantfile
  • vagrant up
I needed to change the network setting a bit in the Vagrantfile because my server's external network is not DHCP, and bridging is out of the question. What I need is for Vagrant to set up a host-only network that is not internal to VirtualBox. 

Vagrantfile line 43:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
    master.vm.network :private_network, ip: "192.168.0.10"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"

So it doesn't matter that there are two private_network. 

The scripts set us 80 GB of secondary storage on the cdh-master node, please be sure you have plenty of space in the HDD.

After the downloads and configuration completed, access the Hue interface on the host-only IP : http://192.168.0.10:8888, and create the first user, this user will be defined as the administrator of this Hadoop system.

The problem 

After installation completes, I found that all of the VM's are running with 100% CPU usage attributed to the flume user. It turns out that the provisioning script copied the flume configuration file verbatim, which is configured to use a continuous sequence generator as event source. Changing the event source to syslogtcp and restarting flume-ng-agent service will cure this condition.

UPDATE:
It seems that the provisioned VMs all have default yarn.nodemanager.resource.memory-mb value, which is 8096 mb. For the 2048 mb VMs, I created this property in /etc/hadoop/conf/yarn-site.xml and set the value to 1600.

UPDATE 2:
Somehow there are misconfigured lines in the created VMs. In yarn-site.xml, I need to change  yarn.nodemanager.remote-app-log-dir from hdfs:///var/log/hadoop-yarn/apps to hdfs://cdh-master:8020/var/log/hadoop-yarn/apps.  And also need to change dfs.namenode.name.dir in /etc/hadoop/conf/hdfs-site.xml to file:///dfs/dn to prevent 'could only be replicated to 0 nodes' errors.

UPDATE 3:
after destroying and recreating all the vms, seems that the dfs.datanode.data.dir also need reconfiguring in each of data nodes, make them all point to file:///dfs/dn

Installation 3

In parallel, I also tried lightweight version of VMs, that is called Docker.  First challenge is that my PC is windows, so I installed boot2docker first, enable VT in the BIOS, then tried this one-liner from blog post Ambari provisioned Hadoop cluster on Docker :

curl -LOs j.mp/ambari-singlenode && . ambari-singlenode

It finished, but somehow the web UI shows everything to be offline. Need some debugging to get it right, and in current condition I have so little knowledge of what happening behind the scenes, so I postpone the debugging later.




Comments

Popular posts from this blog

Long running process in Linux using PHP

Reverse Engineering Reptile Kernel module to Extract Authentication code

SAP System Copy Lessons Learned