Sunday, December 28, 2014

Openshift Origin - Gear process hangs and ps as root also hangs

Background

A few months ago we installed openshift origin cluster in a few of our servers. We have tried openshift M3 with good results. Upgrading to M4 and deploying several live apps there, we found gears that hangs each week or two, that cannot be resolved by stopping and starting the gear. After the third time this is getting troublesome.

Symptoms

- Application URL doesn't respond
- haproxy-status URL shows gears in red status before eventually hangs itself
- Full processlist done by root in the node will hang when the process in question is being printed

Investigation

Before we understood the root cause, restarting the VM is sometimes the only solution if the problem occurs. Strace-ing the hanging ps shows that ps is stopped when trying to read some memory.
A few instances of the problem 'heals' by itself after one day.
In some instances killling the process with the cmdline anomaly will solve the problem.
The real eye opening is the blog post Reading /proc/pid/cmdline could hang forever. It never occured to me that Openshift Origin disables kernel out of memory mechanism for its gears, which when in effect, suspends the task/process that triggers the out of memory condition for the cgroups. The keyword is cgroup memory control.

Reconstruction

First lets watch the cgroup memory status. Under /cgroup/memory we have folders, we are interested in files named memory.oom_control because these files have additional line that shows whether out-of-memory (oom) condition is happening.
Monitoring OOM condition from root console
Lets try to consume memory. I choose to use rhc ssh to connect to the gear and from bash, executes a loop that consumes memory. I used this script I found in an StackOverflow post.
A="0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
for power in $(seq 8); do
  A="${A}${A}"
done
for power in $(seq 8); do
  A="${A}${A}"
done
for power in $(seq 8); do
  A="${A}${A}"
done

Meanwhile, check for OOM condition from root console. When OOM occured, the rhc ssh session would hang and under_oom flag would be 1.

under_oom flag is now 1
strace ps auxw, hangs in root console
The ps hangs when it tries to read /proc/pid/cmdline. Lets try checking the processlist with a different command, making use of openshift gear id that has under_oom of 1.

ps -u works but ps -u -f hangs

Sometimes ^C works, sometimes it doesn't. Now this is a dangerous condition. Lets reenable OOM killer and allows us to proceed.

The var/log/messages show us the result of reenabling the OOM killer :

Dec 28 23:39:11 node3 kernel: 3687037 total pagecache pages
Dec 28 23:39:11 node3 kernel: 9138 pages in swap cache
Dec 28 23:39:11 node3 kernel: Swap cache stats: add 549704, delete 540566, find 268368/295738
Dec 28 23:39:11 node3 kernel: Free swap  = 33397076kB
Dec 28 23:39:11 node3 kernel: Total swap = 33554428kB
Dec 28 23:39:12 node3 kernel: 4194288 pages RAM
Dec 28 23:39:12 node3 kernel: 111862 pages reserved
Dec 28 23:39:12 node3 kernel: 2181578 pages shared
Dec 28 23:39:12 node3 kernel: 1915782 pages non-shared
Dec 28 23:39:12 node3 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Dec 28 23:39:12 node3 kernel: [32264]  6248 32264    65906      998   5       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32265]  6248 32265     5571      436   9       0             0 haproxy
Dec 28 23:39:12 node3 kernel: [32266]  6248 32266     2834      257   8       0             0 bash
Dec 28 23:39:12 node3 kernel: [32267]  6248 32267    49522     1563   6       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32281]  6248 32281    11893     1993   1       0             0 ruby
Dec 28 23:39:12 node3 kernel: [32368]  6248 32368   150982     4796   4       0             0 httpd
Dec 28 23:39:12 node3 kernel: [32369]  6248 32369    65906     1102   6       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32378]  6248 32378     1018      117   5       0             0 tee
Dec 28 23:39:12 node3 kernel: [32379]  6248 32379     1018      113   7       0             0 tee
Dec 28 23:39:12 node3 kernel: [32380]  6248 32380   152662     4506   6       0             0 httpd
Dec 28 23:39:12 node3 kernel: [32381]  6248 32381   152662     4504   0       0             0 httpd
Dec 28 23:39:12 node3 kernel: [ 8297]     0  8297    27799     1691   1       0             0 sshd
Dec 28 23:39:12 node3 kernel: [ 8304]  6248  8304    27799     1021   0       0             0 sshd
Dec 28 23:39:12 node3 kernel: [ 8305]  6248  8305   174666   112273   8       0             0 bash
Dec 28 23:39:12 node3 kernel: [ 9207]  6248  9207   150982     1839   7       0             0 httpd
Dec 28 23:39:12 node3 kernel: Memory cgroup out of memory: Kill process 8305 (bash) score 917 or sacrifice child
Dec 28 23:39:12 node3 kernel: Killed process 8305, UID 6248, (bash) total-vm:698664kB, anon-rss:447584kB, file-rss:1508kB

Check the oom condition :
node3 542e715098988b7c23000009 # cat memory.oom_control
oom_kill_disable 0
under_oom 0
node3 542e715098988b7c23000009 #

We can see that under_oom has returned to 0. After that, I tried ps auxw as root and it doesn't hang anymore.

Conclusion

The hanging of the root console seem to be unexpected side effect of memory cgroups that supposed to suspend the application under cgroup, but also suspends other tasks that tried to access the app's memory as well.
When this condition happened, the solution is to either reenable the oom killer (as being demonstrated above) or to increase memory limit. Both steps will resume the suspended task and prevent other tasks from hanging. It turns out that these exactly what openshift watchman does, increase memory limit for 10%, and restart the gear. The logic is in the oom_plugin.
The reason that this happened is because watchman is not running. That is strange, because the installation team precisely followed the Openshift Comprehensive Deployment Guide. It seems that the guide missing the steps to enable watchman service.






Saturday, December 27, 2014

First steps with Hadoop

Background

I need some improvement in one of the batch processes I run. It were build using PHP, parsing text files into mysql database. So for a change I tried to learn Hadoop and Hive

Installation 1

Hadoop is a bit more complex than MySQL installation. Ok, so 'a bit' is an understatement. I tried to follow Windows installation procedure from HadoopOnWindows. I downloaded the binary package instead of the source package, because I am not in the mood of waiting mvn downloading endless list of jars. Well, some errors prevented me from continuing this path.

Installation 2

Virtual machines seems to be way to go. Not wanting to spend too much time installing and configuring VMs, I installed Vagrant, a tool to download images and configure VMs automatically. VirtualBox is required as Vagrant's default provider, so I installed it too.
At first I tried to follow this blog post titled Installing a Hadoop Cluster in Three Commands, but it somehow doesn't work either. So the steps below is copied from Gabriele Baldassarre's blog post  who supplied us with working Vagrantfile  and a few shell scripts:
  • git clone https://github.com/theclue/cdh5-vagrant
  • cd cdh5-vagrant
  • vi Vagrantfile
  • vagrant up
I needed to change the network setting a bit in the Vagrantfile because my server's external network is not DHCP, and bridging is out of the question. What I need is for Vagrant to set up a host-only network that is not internal to VirtualBox. 

Vagrantfile line 43:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
    master.vm.network :private_network, ip: "192.168.0.10"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"

So it doesn't matter that there are two private_network. 

The scripts set us 80 GB of secondary storage on the cdh-master node, please be sure you have plenty of space in the HDD.

After the downloads and configuration completed, access the Hue interface on the host-only IP : http://192.168.0.10:8888, and create the first user, this user will be defined as the administrator of this Hadoop system.

The problem 

After installation completes, I found that all of the VM's are running with 100% CPU usage attributed to the flume user. It turns out that the provisioning script copied the flume configuration file verbatim, which is configured to use a continuous sequence generator as event source. Changing the event source to syslogtcp and restarting flume-ng-agent service will cure this condition.

UPDATE:
It seems that the provisioned VMs all have default yarn.nodemanager.resource.memory-mb value, which is 8096 mb. For the 2048 mb VMs, I created this property in /etc/hadoop/conf/yarn-site.xml and set the value to 1600.

UPDATE 2:
Somehow there are misconfigured lines in the created VMs. In yarn-site.xml, I need to change  yarn.nodemanager.remote-app-log-dir from hdfs:///var/log/hadoop-yarn/apps to hdfs://cdh-master:8020/var/log/hadoop-yarn/apps.  And also need to change dfs.namenode.name.dir in /etc/hadoop/conf/hdfs-site.xml to file:///dfs/dn to prevent 'could only be replicated to 0 nodes' errors.

UPDATE 3:
after destroying and recreating all the vms, seems that the dfs.datanode.data.dir also need reconfiguring in each of data nodes, make them all point to file:///dfs/dn

Installation 3

In parallel, I also tried lightweight version of VMs, that is called Docker.  First challenge is that my PC is windows, so I installed boot2docker first, enable VT in the BIOS, then tried this one-liner from blog post Ambari provisioned Hadoop cluster on Docker :

curl -LOs j.mp/ambari-singlenode && . ambari-singlenode

It finished, but somehow the web UI shows everything to be offline. Need some debugging to get it right, and in current condition I have so little knowledge of what happening behind the scenes, so I postpone the debugging later.




Saturday, December 20, 2014

Hex editing in Linux text console using Vim

I don't usually edits binary files. But when there is a need of binary editing, a capable tool is a must-have. In the past I used hexedit in the Linux text console. But yesterday I can't seem to find the correct package to install in one of CentOS servers. To my surprise, Vim is perfectly capable of doing hex editing, if only you know the secret.

VIM binary mode

Do you know that, Vim, our reliable text editor, have a binary mode option ?

vim -b filename

If we don't enable binary mode, some EOL (end of line) characters will get converted to other form. Binary corruption is possible if vim is still in text mode.

Convert to hex dump

use this command to change the file into its hex dump:

 :%!xxd 

Edit the file


Edit the hex part of the file. The ascii part in the right side will not get converted back to binary, so stick to the left and middle column of the screen. 

Convert back to binary

This command will convert the hex dump to binary :

 :%!xxd -r

Afterwards you can write the file back 

:%wq

Caveats : backup before you edit something. Binary file editing is an error-prone procedure.

Encountering Zookeeper

I have tried several noSQL databases, but yet to see any in a production environment. Well, except the MongoDB that were the part of Openshift Origin cluster we installed in our data center. Last week events make me interacts with Apache Zookeper, which is hidden inside three EMC VIPR controller nodes.

Basic Facts

Apache Zookeeper have the following characteristics :
- in memory database system
- data is modeled as a tree, like a filesystem
- build using java programming language
- usually runs as a cluster of minimal 3 hosts
- usually listens on port 2181

The zookeeper cluster (called ensemble) are supposed to be resilient to failure. As an in memory database, it needs memory larger than the entire data tree.

Any changes to the database are strictly ordered, coordinated between all nodes in the ensemble. For each time there must be a leader, and all other hosts will became followers.

Checking a Zookeeper

Do a telnet to port 2181, and issue a 'ruok' command. type ruok , a healthy zookeeper will reply with 'imok'. Refer to Zookeeper Admin, the 4 letter commands that recognized by zookeper with version below 3.3 are :
'stat' : print server statistics. summary of the server and connected clients
'dump' : list sessions on nodes, only works in the leader
'envi' : print details of the running environment
'srst' :  reset server statisticas
'ruok' : check that server is running in non-error state

Bugs

We are recently hit with ZOOKEEPER-1573, which is a zookeper unable to load its database because there is an operation that refers to a child of a data node that doesn't exist. The cause seems to be that the zookeeper snapshots are 'fuzzy', they are written while the tree is updating, and there are parts of the transaction logs that being redo is already done and other parts are not done. The fix seem to be either to update zookeeper version so that such operation will be ignored, or to delete the problematic database and rely on other host's database to get synchronized.