Sunday, December 28, 2014

Openshift Origin - Gear process hangs and ps as root also hangs

Background

A few months ago we installed openshift origin cluster in a few of our servers. We have tried openshift M3 with good results. Upgrading to M4 and deploying several live apps there, we found gears that hangs each week or two, that cannot be resolved by stopping and starting the gear. After the third time this is getting troublesome.

Symptoms

- Application URL doesn't respond
- haproxy-status URL shows gears in red status before eventually hangs itself
- Full processlist done by root in the node will hang when the process in question is being printed

Investigation

Before we understood the root cause, restarting the VM is sometimes the only solution if the problem occurs. Strace-ing the hanging ps shows that ps is stopped when trying to read some memory.
A few instances of the problem 'heals' by itself after one day.
In some instances killling the process with the cmdline anomaly will solve the problem.
The real eye opening is the blog post Reading /proc/pid/cmdline could hang forever. It never occured to me that Openshift Origin disables kernel out of memory mechanism for its gears, which when in effect, suspends the task/process that triggers the out of memory condition for the cgroups. The keyword is cgroup memory control.

Reconstruction

First lets watch the cgroup memory status. Under /cgroup/memory we have folders, we are interested in files named memory.oom_control because these files have additional line that shows whether out-of-memory (oom) condition is happening.
Monitoring OOM condition from root console
Lets try to consume memory. I choose to use rhc ssh to connect to the gear and from bash, executes a loop that consumes memory. I used this script I found in an StackOverflow post.
A="0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
for power in $(seq 8); do
  A="${A}${A}"
done
for power in $(seq 8); do
  A="${A}${A}"
done
for power in $(seq 8); do
  A="${A}${A}"
done

Meanwhile, check for OOM condition from root console. When OOM occured, the rhc ssh session would hang and under_oom flag would be 1.

under_oom flag is now 1
strace ps auxw, hangs in root console
The ps hangs when it tries to read /proc/pid/cmdline. Lets try checking the processlist with a different command, making use of openshift gear id that has under_oom of 1.

ps -u works but ps -u -f hangs

Sometimes ^C works, sometimes it doesn't. Now this is a dangerous condition. Lets reenable OOM killer and allows us to proceed.

The var/log/messages show us the result of reenabling the OOM killer :

Dec 28 23:39:11 node3 kernel: 3687037 total pagecache pages
Dec 28 23:39:11 node3 kernel: 9138 pages in swap cache
Dec 28 23:39:11 node3 kernel: Swap cache stats: add 549704, delete 540566, find 268368/295738
Dec 28 23:39:11 node3 kernel: Free swap  = 33397076kB
Dec 28 23:39:11 node3 kernel: Total swap = 33554428kB
Dec 28 23:39:12 node3 kernel: 4194288 pages RAM
Dec 28 23:39:12 node3 kernel: 111862 pages reserved
Dec 28 23:39:12 node3 kernel: 2181578 pages shared
Dec 28 23:39:12 node3 kernel: 1915782 pages non-shared
Dec 28 23:39:12 node3 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Dec 28 23:39:12 node3 kernel: [32264]  6248 32264    65906      998   5       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32265]  6248 32265     5571      436   9       0             0 haproxy
Dec 28 23:39:12 node3 kernel: [32266]  6248 32266     2834      257   8       0             0 bash
Dec 28 23:39:12 node3 kernel: [32267]  6248 32267    49522     1563   6       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32281]  6248 32281    11893     1993   1       0             0 ruby
Dec 28 23:39:12 node3 kernel: [32368]  6248 32368   150982     4796   4       0             0 httpd
Dec 28 23:39:12 node3 kernel: [32369]  6248 32369    65906     1102   6       0             0 logshifter
Dec 28 23:39:12 node3 kernel: [32378]  6248 32378     1018      117   5       0             0 tee
Dec 28 23:39:12 node3 kernel: [32379]  6248 32379     1018      113   7       0             0 tee
Dec 28 23:39:12 node3 kernel: [32380]  6248 32380   152662     4506   6       0             0 httpd
Dec 28 23:39:12 node3 kernel: [32381]  6248 32381   152662     4504   0       0             0 httpd
Dec 28 23:39:12 node3 kernel: [ 8297]     0  8297    27799     1691   1       0             0 sshd
Dec 28 23:39:12 node3 kernel: [ 8304]  6248  8304    27799     1021   0       0             0 sshd
Dec 28 23:39:12 node3 kernel: [ 8305]  6248  8305   174666   112273   8       0             0 bash
Dec 28 23:39:12 node3 kernel: [ 9207]  6248  9207   150982     1839   7       0             0 httpd
Dec 28 23:39:12 node3 kernel: Memory cgroup out of memory: Kill process 8305 (bash) score 917 or sacrifice child
Dec 28 23:39:12 node3 kernel: Killed process 8305, UID 6248, (bash) total-vm:698664kB, anon-rss:447584kB, file-rss:1508kB

Check the oom condition :
node3 542e715098988b7c23000009 # cat memory.oom_control
oom_kill_disable 0
under_oom 0
node3 542e715098988b7c23000009 #

We can see that under_oom has returned to 0. After that, I tried ps auxw as root and it doesn't hang anymore.

Conclusion

The hanging of the root console seem to be unexpected side effect of memory cgroups that supposed to suspend the application under cgroup, but also suspends other tasks that tried to access the app's memory as well.
When this condition happened, the solution is to either reenable the oom killer (as being demonstrated above) or to increase memory limit. Both steps will resume the suspended task and prevent other tasks from hanging. It turns out that these exactly what openshift watchman does, increase memory limit for 10%, and restart the gear. The logic is in the oom_plugin.
The reason that this happened is because watchman is not running. That is strange, because the installation team precisely followed the Openshift Comprehensive Deployment Guide. It seems that the guide missing the steps to enable watchman service.






Saturday, December 27, 2014

First steps with Hadoop

Background

I need some improvement in one of the batch processes I run. It were build using PHP, parsing text files into mysql database. So for a change I tried to learn Hadoop and Hive

Installation 1

Hadoop is a bit more complex than MySQL installation. Ok, so 'a bit' is an understatement. I tried to follow Windows installation procedure from HadoopOnWindows. I downloaded the binary package instead of the source package, because I am not in the mood of waiting mvn downloading endless list of jars. Well, some errors prevented me from continuing this path.

Installation 2

Virtual machines seems to be way to go. Not wanting to spend too much time installing and configuring VMs, I installed Vagrant, a tool to download images and configure VMs automatically. VirtualBox is required as Vagrant's default provider, so I installed it too.
At first I tried to follow this blog post titled Installing a Hadoop Cluster in Three Commands, but it somehow doesn't work either. So the steps below is copied from Gabriele Baldassarre's blog post  who supplied us with working Vagrantfile  and a few shell scripts:
  • git clone https://github.com/theclue/cdh5-vagrant
  • cd cdh5-vagrant
  • vi Vagrantfile
  • vagrant up
I needed to change the network setting a bit in the Vagrantfile because my server's external network is not DHCP, and bridging is out of the question. What I need is for Vagrant to set up a host-only network that is not internal to VirtualBox. 

Vagrantfile line 43:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.box = "centos65-x86_64-20140116"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.4.2/centos64-x86_64-20140116.box"
  config.vm.define "cdh-master" do |master|
#    master.vm.network :public_network, :bridge => 'eth0'
    master.vm.network :private_network, ip: "192.168.0.10"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"
    master.vm.network :private_network, ip: "#{privateSubnet}.#{privateStartingIp}", :netmask => "255.255.255.0", virtualbox__intnet: "cdhnetwork"
    master.vm.hostname = "cdh-master"

So it doesn't matter that there are two private_network. 

The scripts set us 80 GB of secondary storage on the cdh-master node, please be sure you have plenty of space in the HDD.

After the downloads and configuration completed, access the Hue interface on the host-only IP : http://192.168.0.10:8888, and create the first user, this user will be defined as the administrator of this Hadoop system.

The problem 

After installation completes, I found that all of the VM's are running with 100% CPU usage attributed to the flume user. It turns out that the provisioning script copied the flume configuration file verbatim, which is configured to use a continuous sequence generator as event source. Changing the event source to syslogtcp and restarting flume-ng-agent service will cure this condition.

UPDATE:
It seems that the provisioned VMs all have default yarn.nodemanager.resource.memory-mb value, which is 8096 mb. For the 2048 mb VMs, I created this property in /etc/hadoop/conf/yarn-site.xml and set the value to 1600.

UPDATE 2:
Somehow there are misconfigured lines in the created VMs. In yarn-site.xml, I need to change  yarn.nodemanager.remote-app-log-dir from hdfs:///var/log/hadoop-yarn/apps to hdfs://cdh-master:8020/var/log/hadoop-yarn/apps.  And also need to change dfs.namenode.name.dir in /etc/hadoop/conf/hdfs-site.xml to file:///dfs/dn to prevent 'could only be replicated to 0 nodes' errors.

UPDATE 3:
after destroying and recreating all the vms, seems that the dfs.datanode.data.dir also need reconfiguring in each of data nodes, make them all point to file:///dfs/dn

Installation 3

In parallel, I also tried lightweight version of VMs, that is called Docker.  First challenge is that my PC is windows, so I installed boot2docker first, enable VT in the BIOS, then tried this one-liner from blog post Ambari provisioned Hadoop cluster on Docker :

curl -LOs j.mp/ambari-singlenode && . ambari-singlenode

It finished, but somehow the web UI shows everything to be offline. Need some debugging to get it right, and in current condition I have so little knowledge of what happening behind the scenes, so I postpone the debugging later.




Saturday, December 20, 2014

Hex editing in Linux text console using Vim

I don't usually edits binary files. But when there is a need of binary editing, a capable tool is a must-have. In the past I used hexedit in the Linux text console. But yesterday I can't seem to find the correct package to install in one of CentOS servers. To my surprise, Vim is perfectly capable of doing hex editing, if only you know the secret.

VIM binary mode

Do you know that, Vim, our reliable text editor, have a binary mode option ?

vim -b filename

If we don't enable binary mode, some EOL (end of line) characters will get converted to other form. Binary corruption is possible if vim is still in text mode.

Convert to hex dump

use this command to change the file into its hex dump:

 :%!xxd 

Edit the file


Edit the hex part of the file. The ascii part in the right side will not get converted back to binary, so stick to the left and middle column of the screen. 

Convert back to binary

This command will convert the hex dump to binary :

 :%!xxd -r

Afterwards you can write the file back 

:%wq

Caveats : backup before you edit something. Binary file editing is an error-prone procedure.

Encountering Zookeeper

I have tried several noSQL databases, but yet to see any in a production environment. Well, except the MongoDB that were the part of Openshift Origin cluster we installed in our data center. Last week events make me interacts with Apache Zookeper, which is hidden inside three EMC VIPR controller nodes.

Basic Facts

Apache Zookeeper have the following characteristics :
- in memory database system
- data is modeled as a tree, like a filesystem
- build using java programming language
- usually runs as a cluster of minimal 3 hosts
- usually listens on port 2181

The zookeeper cluster (called ensemble) are supposed to be resilient to failure. As an in memory database, it needs memory larger than the entire data tree.

Any changes to the database are strictly ordered, coordinated between all nodes in the ensemble. For each time there must be a leader, and all other hosts will became followers.

Checking a Zookeeper

Do a telnet to port 2181, and issue a 'ruok' command. type ruok , a healthy zookeeper will reply with 'imok'. Refer to Zookeeper Admin, the 4 letter commands that recognized by zookeper with version below 3.3 are :
'stat' : print server statistics. summary of the server and connected clients
'dump' : list sessions on nodes, only works in the leader
'envi' : print details of the running environment
'srst' :  reset server statisticas
'ruok' : check that server is running in non-error state

Bugs

We are recently hit with ZOOKEEPER-1573, which is a zookeper unable to load its database because there is an operation that refers to a child of a data node that doesn't exist. The cause seems to be that the zookeeper snapshots are 'fuzzy', they are written while the tree is updating, and there are parts of the transaction logs that being redo is already done and other parts are not done. The fix seem to be either to update zookeeper version so that such operation will be ignored, or to delete the problematic database and rely on other host's database to get synchronized.

Monday, October 20, 2014

Configuring Openshift Origin with S3-based persistent shared storage

This post will describe the steps that I take to provide shared storage for OpenShift Origin M4 installation. There were some difficulties that must be solved by non standard methods.

Requirement

When hosting applications on Openshift Origin platform, we are confronted with a bitter truth :
writing applications for cloud platforms requires us to avoid writing to local filesystems. There is no support for storage shared between gears. But we still need support multiple PHP applications that stores their attachment in the local filesystem with minimal code changes. So we need a way to quickly implement shared storage between gears of the same application. And maybe we could loosen the application isolation requirement just for the shared storage.

Basic Idea

The idea is to mount an S3 API-based storage on all nodes. And then each gear could refer to application's folder inside the shared storage to store and retrieve file attachments. My implementation uses an EMC VIPR shared storage with S3 API, which I assume would be harder than if we were using a real Amazon S3 storage. I used S3FS implementation from https://github.com/s3fs-fuse/s3fs-fuse to mount S3 storage as folders.

Pitfalls

Openshift gears are not allowed to write to arbitrary directories. The gears can't even peek to other gears directories, which is restricted using SELinux Multi Category Security (MCS). Custom SElinux policies were implemented, which is complex enough for a run-of-the-mill admin to understand. So mounting the S3 storage in the nodes is only half of the battle.
S3FS Fuse needs a newer version of Fuse than that were packaged with RHEL 6. And Fuse need a bit of patch to allow mounting the S3 storage using contexts other than fusefs_t.
Access control for a specified directory is cached in a process's lifetime, so if a running httpd is denied access, make sure that it is restarted after we remount the S3 storage in a different context.

Step-by-step

First, make sure that your system is clear from the old fuse package, then download latest fuse version from sourceforge. Extract it.

# wget http://downloads.sourceforge.net/project/fuse/fuse-2.X/2.9.3/fuse-2.9.3.tar.gz
# tar xzf fuse-2.9.3.tar.gz
# cd fuse-2.9.3
# ./configure --prefix=/usr
# export PKG_CONFIG_PATH=/usr/lib/pkgconfig:/usr/lib64/pkgconfig/

we need to add missing context option to lib/mount.c.
        FUSE_OPT_KEY("default_permissions",     KEY_KERN_OPT),
        FUSE_OPT_KEY("context=", KEY_KERN_OPT),
        FUSE_OPT_KEY("fscontext=",              KEY_KERN_OPT),
        FUSE_OPT_KEY("defcontext=",             KEY_KERN_OPT),
        FUSE_OPT_KEY("rootcontext=",            KEY_KERN_OPT),
        FUSE_OPT_KEY("max_read=",               KEY_KERN_OPT),
        FUSE_OPT_KEY("max_read=",               FUSE_OPT_KEY_KEEP),
        FUSE_OPT_KEY("user=",                   KEY_MTAB_OPT),

The bold part are the inserted lines. Do the changes, save, then compile the whole thing.

# make
# make install
# ldconfig
# modprobe fuse
# pkg-config --modversion fuse

We now may download the latest s3fs package.

# wget https://github.com/s3fs-fuse/s3fs-fuse/archive/master.zip
# unzip s3fs-fuse-master.zip
# cd s3fs-fuse-master
# ./configure --prefix=/usr/local
# make
# make install

Put your AWS credentials or other access key in .passwd-s3fs file. The syntax is :
accessKeyId:secretAccessKey
or
bucketName:accessKeyId:secretAccessKey
Ensure the ~/.passwd-s3fs is only readable to the user.

chmod 600 ~/.passwd-s3fs 

Lets mount the S3 storage.

s3fs always-on-non-core /var/[our-shared-folder] -o url=http://[server]:[port]/ -o use_path_request_style -o context=system_u:object_r:openshift_rw_file_t:s0 -o umask=000 -o allow_other

Change the our-shared-folder part to the mount point we want to use, and the server and port part to the S3 service endpoint. If we are using real S3, we omit the -o url part. And you also might want to omit use_path_request_style to use newer API style (virtual-hosted). We only need use_path_request_style if we are using an S3 compatible storage.

Configure the applications

As root in each node, create a folder for the application.
# mkdir /var/[our-shared-folder]/[appid]

Create a .openshift/action_hooks/build file inside the applications git repository with u+x bit set.
Fill it with :
#! /bin/bash
ln -sf /var/[our-shared-folder]/[appid] $OPENSHIFT_REPO_DIR/[appfolder]

Change the appfolder to the folder we want to store the attachments under the application's root directory.  Afterwards we could create a file in the folder  using PHP like :

$f = fopen("/file1.txt","wb");


Reference :
https://code.google.com/p/s3fs/issues/detail?id=170
http://tecadmin.net/mount-s3-bucket-centosrhel-ubuntu-using-s3fs/

Saturday, October 4, 2014

Debugging Ruby code - Mcollective server

In this post I record steps that I took to debug some ruby code. Actually the code was an ruby mcollective server code that were installed as part of openshift origin Node. The bug is that the server consistently fails to respond to client queries in my configuration. I documented the steps taken  even though I hadn't nailed the bug yet.

First thing first

First we need to identify the entry point. These commands would do the trick:
[root@broker ~]# service ruby193-mcollective status
mcollectived (pid  1069) is running...
[root@broker ~]# ps afxw | grep 1069
 1069 ?        Sl     0:03 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg
12428 pts/0    S+     0:00          \_ grep 1069

We found out that the service is :
  • running with pid 1069
  • running with configuration file /opt/rh/ruby193/root/etc/mcollective/server.cfg
  • service's source code is at /opt/rh/ruby193/root/usr/sbin/mcollectived

The most intrusive way yet the simplest

The simplest way is to insert 'puts' calls inside the code you want to debug. For objects, you want to call the inspect method.

But the code I am interested in is deep inside call graph of the mcollectived. I want to find out the details of activemq subscription. Skipping hours of skimming mcollective source (https://github.com/puppetlabs/marionette-collective/) and openshift origin mcollective server source (https://github.com/openshift/origin-server/tree/master/plugins/msg-node/mcollective), lets jump to the activemq.rb file :
[root@broker ~]# locate activemq.rb
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/connector/activemq.rb

Lets hack some code (if you're doing this for real, do some backup first):
[root@broker ~]# vi /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/connector/activemq.rb

Add some puts here and there ..
      # Subscribe to a topic or queue
      def subscribe(agent, type, collective)
        source = make_target(agent, type, collective)
        puts "XXXX subscribe to "
        puts agent
        puts type
        puts collective

And.. it doesnt work. Because the service reassigns standard output to /dev/null. Ah. And not far from the make_target call we have a Log.debug call, lets imitate it :
      def subscribe(agent, type, collective)
        source = make_target(agent, type, collective)
        Log.debug("XXXX subscribe to #{agent} - #{type} - #{collective}")
        unless @subscriptions.include?(source[:id])
          Log.debug("Subscribing to #{source[:name]} with headers #{source[:headers].inspect.chomp}")

And we need to know where the log goes.. check out the configuration file (or the /proc/[pid]/fd directory, if you want):

vi /opt/rh/ruby193/root/etc/mcollective/server.cfg

topicprefix = /topic/
main_collective = mcollective
collectives = mcollective
libdir = /opt/rh/ruby193/root/usr/libexec/mcollective
logfile = /var/log/openshift/node/ruby193-mcollective.log
loglevel = debug
daemonize = 1
direct_addressing = 1
registerinterval = 30

Restart the service :
service ruby193-mcollective restart

View the logs:
cat /var/log/openshift/node/ruby193-mcollective.log | grep XXX

[root@broker ~]# cat /var/log/openshift/node/ruby193-mcollective.log | grep XXX
D, [2014-10-04T09:59:22.392472 #17552] DEBUG -- : activemq.rb:371:in `subscribe' XXXX subscribe to discovery - broadcast - mcollective
D, [2014-10-04T09:59:26.049920 #17552] DEBUG -- : activemq.rb:371:in `subscribe' XXXX subscribe to openshift - broadcast - mcollective
D, [2014-10-04T09:59:26.095865 #17552] DEBUG -- : activemq.rb:371:in `subscribe' XXXX subscribe to rpcutil - broadcast - mcollective
D, [2014-10-04T09:59:26.191664 #17552] DEBUG -- : activemq.rb:371:in `subscribe' XXXX subscribe to mcollective - broadcast - mcollective
D, [2014-10-04T09:59:26.202263 #17552] DEBUG -- : activemq.rb:371:in `subscribe' XXXX subscribe to mcollective - directed - mcollective

There, I find what I came for, parameters for the subscribe method calls.

The nonintrusive way, but not yet successful

Actually, we are not supposed to hack source codes like that. Lets learn the real ruby debugger.
Check the command line and then stop the service.
[root@broker ~]# ps auxw | grep mcoll
root     17552  0.5  4.3 378212 44520 ?        Sl   09:59   0:03 ruby /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg
root     19873  0.0  0.0 103240   852 pts/0    S+   10:08   0:00 grep mcoll
[root@broker ~]# service ruby193-mcollective stop
Shutting down mcollective:                                 [  OK  ]
[root@broker ~]# ruby -rdebug  /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg
/usr/lib/ruby/1.8/tracer.rb:16: Tracer is not a class (TypeError)
        from /usr/lib/ruby/1.8/debug.rb:10:in `require'
        from /usr/lib/ruby/1.8/debug.rb:10

Oops. Something is wrong. I used the built in ruby, which is 1.8, not 1.9.3. Lets try again.

[root@broker ~]# scl enable ruby193 bash
[root@broker ~]# ruby -rdebug  /opt/rh/ruby193/root/usr/sbin/mcollectived --pid=/opt/rh/ruby193/root/var/run/mcollectived.pid --config=/opt/rh/ruby193/root/etc/mcollective/server.cfg
Debug.rb
Emacs support available.

/opt/rh/ruby193/root/usr/sbin/mcollectived:3:require 'mcollective'
(rdb:1)

Now we are in RDB, Ruby Debugger. What are the commands?
(rdb:1) help
Debugger help v.-0.002b
Commands
  b[reak] [file:|class:]
  b[reak] [class.]
                             set breakpoint to some position
  wat[ch]       set watchpoint to some expression
  cat[ch] (|off)  set catchpoint to an exception
  b[reak]                    list breakpoints
  cat[ch]                    show catchpoint
  del[ete][ nnn]             delete some or all breakpoints
  disp[lay]     add expression into display expression list
  undisp[lay][ nnn]          delete one particular or all display expressions
  c[ont]                     run until program ends or hit breakpoint
  s[tep][ nnn]               step (into methods) one line or till line nnn
  n[ext][ nnn]               go over one line or till line nnn
  w[here]                    display frames
  f[rame]                    alias for where
  l[ist][ (-|nn-mm)]         list program, - lists backwards
                             nn-mm lists given lines
  up[ nn]                    move to higher frame
  down[ nn]                  move to lower frame
  fin[ish]                   return to outer frame
  tr[ace] (on|off)           set trace mode of current thread
  tr[ace] (on|off) all       set trace mode of all threads
  q[uit]                     exit from debugger
  v[ar] g[lobal]             show global variables
  v[ar] l[ocal]              show local variables
  v[ar] i[nstance]  show instance variables of object
  v[ar] c[onst]     show constants of object
  m[ethod] i[nstance]  show methods of object
  m[ethod]    show instance methods of class or module
  th[read] l[ist]            list all threads
  th[read] c[ur[rent]]       show current thread
  th[read] [sw[itch]]  switch thread context to nnn
  th[read] stop        stop thread nnn
  th[read] resume      resume thread nnn
  p expression               evaluate expression and print its value
  h[elp]                     print this help
           evaluate

Lets checkout where we are (w).
(rdb:1) w
--> #1 /opt/rh/ruby193/root/usr/sbin/mcollectived:3
(rdb:1)

Ok, and list the source code (l):
(rdb:1) l
[-2, 7] in /opt/rh/ruby193/root/usr/sbin/mcollectived
   1  #!
   2
=> 3  require 'mcollective'
   4  require 'getoptlong'
   5
   6  opts = GetoptLong.new(
   7    [ '--help', '-h', GetoptLong::NO_ARGUMENT ],

Step to next line (n) :
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:4:require 'getoptlong'

The execution proceeds to the next line.
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:6:opts = GetoptLong.new(
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:12:if MCollective::Util.windows?
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/mcollective/util.rb:1:module MCollective

I found a little strange that the debugger steps in to another source file. 

(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:15:  configfile = "/opt/rh/ruby193/root/etc/mcollective/server.cfg"
(rdb:1) l
[10, 19] in /opt/rh/ruby193/root/usr/sbin/mcollectived
   10  )
   11
   12  if MCollective::Util.windows?
   13    configfile = File.join(MCollective::Util.windows_prefix, "etc", "server.cfg")
   14  else
=> 15    configfile = "/opt/rh/ruby193/root/etc/mcollective/server.cfg"
   16  end
   17  pid = ""
   18
   19  opts.each do |opt, arg|

But it quickly returns to the original source.

(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:17:pid = ""
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:19:opts.each do |opt, arg|
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:31:config = MCollective::Config.instance
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:1:module MCollective
(rdb:1) n
/opt/rh/ruby193/root/usr/sbin/mcollectived:33:config.loadconfig(configfile) unless config.configured
(rdb:1) n
warn 2014/10/04 10:16:16: config.rb:117:in `block in loadconfig' Use of deprecated 'topicprefix' option.  This option is ignored and should be removed from '/opt/rh/ruby193/root/etc/mcollective/server.cfg'
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:16: `' (NilClass)
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:3:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:2:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:1:in `'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `load'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `loadclass'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:142:in `loadconfig'
        from /opt/rh/ruby193/root/usr/sbin/mcollectived:33:in `
'
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:16:  remove_method :to_yaml rescue nil
(rdb:1)

This is a NilClass error, similar to NullPointerException, but I could proceeds further into another code by keeping doing (n):

(rdb:1)
n
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:17:  alias :to_yaml :psych_to_yaml
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:20:class Module
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:29: `' (NilClass)
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:3:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:2:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:1:in `'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `load'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `loadclass'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:142:in `loadconfig'
        from /opt/rh/ruby193/root/usr/sbin/mcollectived:33:in `
'
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:29:  remove_method :yaml_as rescue nil
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:30:  alias :yaml_as :psych_yaml_as
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych/core_ext.rb:33:if defined?(::IRB)
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych.rb:12:require 'psych/deprecated'
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych/deprecated.rb:79: `' (NilClass)
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/share/rubygems/rubygems/custom_require.rb:36:in `require'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:3:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:2:in `'
        from /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:1:in `'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `load'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/pluginmanager.rb:169:in `loadclass'
        from /opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:142:in `loadconfig'
        from /opt/rh/ruby193/root/usr/sbin/mcollectived:33:in `
'
/opt/rh/ruby193/root/usr/share/ruby/psych/deprecated.rb:79:  undef :to_yaml_properties rescue nil

(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/psych.rb:94:module Psych
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/yaml.rb:86:    engine = 'psych'
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/yaml.rb:96:module Syck # :nodoc:
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/yaml.rb:100:module Psych # :nodoc:
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/yaml.rb:104:YAML::ENGINE.yamler = engine
(rdb:1) n
/opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/facts/yaml_facts.rb:10:    class Yaml_facts
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/mcollective/facts/base.rb:1:module MCollective
(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:143:        PluginManager.loadclass("Mcollective::Connector::#{@connector}")
(rdb:1) l
[138, 147] in /opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb
   138          if @logger_type == "syslog"
   139            raise "The sylog logger is not usable on the Windows platform" if Util.windows?
   140          end
   141
   142          PluginManager.loadclass("Mcollective::Facts::#{@factsource}_facts")
=> 143          PluginManager.loadclass("Mcollective::Connector::#{@connector}")
   144          PluginManager.loadclass("Mcollective::Security::#{@securityprovider}")
   145          PluginManager.loadclass("Mcollective::Registration::#{@registration}")
   146          PluginManager.loadclass("Mcollective::Audit::#{@rpcauditprovider}") if @rpcaudit
   147          PluginManager << {:type => "global_stats", :class => RunnerStats.new}

It seems that the context of those errors were facts loading process. The connector is a particular interest, because :

(rdb:1) n
/opt/rh/ruby193/root/usr/share/ruby/mcollective/config.rb:144:        PluginManager.loadclass("Mcollective::Security::#{@securityprovider}")
(rdb:1) PluginManager["connector_plugin"]
#"unset", "activemq.pool.size"=>"1", "activemq.pool.1.host"=>"broker.openshift.local", "activemq.pool.1.port"=>"61613", "activemq.pool.1.user"=>"mcollective", "activemq.pool.1.password"=>"marionette", "yaml"=>"/opt/rh/ruby193/root/etc/mcollective/facts.yaml"}, @connector="Activemq", @securityprovider="Psk", @factsource="Yaml", @identity="broker.openshift.local", @registration="Agentlist", @registerinterval=30, @registration_collective=nil, @classesfile="/var/lib/puppet/state/classes.txt", @rpcaudit=false, @rpcauditprovider="", @rpcauthorization=false, @rpcauthprovider="", @configdir="/opt/rh/ruby193/root/etc/mcollective", @color=true, @configfile="/opt/rh/ruby193/root/etc/mcollective/server.cfg", @logger_type="file", @keeplogs=5, @max_log_size=2097152, @rpclimitmethod=:first, @libdir=["/opt/rh/ruby193/root/usr/libexec/mcollective"], @fact_cache_time=300, @loglevel="debug", @logfacility="user", @collectives=["mcollective"], @main_collective="mcollective", @ssl_cipher="aes-256-cbc", @direct_addressing=true, @direct_addressing_threshold=10, @default_discovery_method="mc", @default_discovery_options=[], @ttl=60, @mode=:client, @publish_timeout=2, @threaded=false, @logfile="/var/log/openshift/node/ruby193-mcollective.log", @daemonize=true>, @subscriptions=[], @msgpriority=0, @base64=false>

Yes, it loads the Activemq connector. Lets create a breakpoint in the subscribe method :
(rdb:1) b PluginManager["connector_plugin"].subscribe
Set breakpoint 1 at #.subscribe

Then continue (c)..

(rdb:1) c
[root@broker ~]#

Wait. It stops. Browsing some source code tells me that the code forks somewhere after that. And, the forked code seems to be unrelated to the debugger.. So it is a dead end for now.

Well thats all now for the record. I can't promise that there will be a more successful debugging session, but surely I hope that there will be.

How to move an EC2 Instance to another region

In this post I would describe the process of moving an EC2 instance to another region.

The background

I have a server in one of the EC2 regions that a bit pricey than the rest. It seems that moving it to another region would save me some bucks. Well, it turns out that I did a few blunders that maybe causes the savings to be negligible.

The initial plan

I read that snapshots could be copied to other regions. So the original plan is to create snapshots of existing volumes that support the instance (I have one instance with three EBS volumes), copy these to another region, and create a new instance in the new region.

The mistake

My mistake is that I assume creating a new instance is a simple matter of selecting the platform (i386 or x86_64) and the root EBS volume. Actually, it is not. First, we create an AMI (Amazon Machine Image) using an EBS snapshot, not EBS volume. Then we could launch a new instance based on the AMI. As shown below, when we are trying to create a new AMI from a snapshot we need to choose :

  • Architecture (i386 or x86_64)
  • Root device name - I knew this one
  • RAM disk ID 
  • Virtualization type - I chose paravirtual because that's what the original instance is
  • Kernel ID


The problem is, I cannot find the Kernel ID in the new region that matches the Kernel ID in the original region. Choosing default for the two parameters resulted in an instance that unable to boot successfully.


The real deal

So, it turns out that I chose the wrong path. From the Instance, I could Create Image, then after the image created, I could copy it to another region. 






After copying the image, we could launch a new instance based on the image.



Summary

Now we understood that the most efficient steps to copy an instance to another region is to create AMI from the instance, copy it to another region, and launch the AMI in the new region.



How to Peek inside your ActiveMQ Server

This post describes steps that can be taken for sysadmins to peek inside an ActiveMQ server. We assume root capability, otherwise we need a user which has access to ActiveMQ configuration files.

Step 1. Determine running ActiveMQ process

ps auxw | grep activemq

We got a java process running ActiveMQ :

[root@broker ~]# ps auxw | grep activemq
activemq  1236  0.1  0.0  19124   696 ?        Sl   07:00   0:02 /usr/lib/activemq/linux/wrapper /etc/activemq/wrapper.conf wrapper.syslog.ident=ActiveMQ wrapper.pidfile=/var/run/activemq//ActiveMQ.pid wrapper.daemonize=TRUE wrapper.lockfile=/var/lock/subsys/ActiveMQ
activemq  1243  3.2 12.2 2016568 125264 ?      Sl   07:00   1:06 java -Dactivemq.home=/usr/share/activemq -Dactivemq.base=/usr/share/activemq -Djavax.net.ssl.keyStorePassword=password -Djavax.net.ssl.trustStorePassword=password -Djavax.net.ssl.keyStore=/usr/share/activemq/conf/broker.ks -Djavax.net.ssl.trustStore=/usr/share/activemq/conf/broker.ts -Dcom.sun.management.jmxremote -Dorg.apache.activemq.UseDedicatedTaskRunner=true -Djava.util.logging.config.file=logging.properties -Dactivemq.conf=/usr/share/activemq/conf -Dactivemq.data=/usr/share/activemq/data -Xmx1024m -Djava.library.path=/usr/share/activemq/bin/linux-x86-64/ -classpath /usr/share/activemq/bin/wrapper.jar:/usr/share/activemq/bin/activemq.jar -Dwrapper.key=zvZTrwPTV6sBMrMd -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.pid=1236 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.tanukisoftware.wrapper.WrapperSimpleApp org.apache.activemq.console.Main start
root     10249  0.0  0.0 103244   860 pts/0    S+   07:35   0:00 grep activemq

From the result above, we know that the configuration file is in /usr/share/activemq/conf

Step 2. Determine whether ActiveMQ console are enabled

vi /usr/share/activemq/conf/activemq.xml

find the jetty.xml part, and make sure that it is enabled.
before:
after:
Check the jetty.xml too for console's port number.
vi /usr/share/activemq/conf/jetty.xml

Step 3. If we had changed activemq.xml, restart it


service activemq restart

Step 4. Obtain admin password

 vi /usr/share/activemq/conf/jetty-realm.properties

Right next to "admin:" is the admin's password.

Step 5. Finally, we could browse to localhost port 8161


If the server is not your localhost, please use SSH tunneling to port forward 8161 to 127.0.0.1:8161. Otherwise, just open a browser and type http://localhost:8161/
Use the admin password we got in step 4. No, you must check your own admin password, I won't tell you mine.

http://localhost:8161/
Click on the 'Manage ActiveMQ broker'.

home
Click on the Connections on the top menu.
 Now we see one client using Stomp connected to the activeMQ server. click on it.


The client, in this case, an Openshift Origin Node in the same VM as Broker, registered as a listener for:

  • Queue mcollective.nodes
  • Topic mcollective.discovery.agent
  • Topic mcollective.mcollective.agent
  • Topic mcollective.rpcutil.agent
  • Topic mcollective.openshift.agent

Summary

In this post, I have shown how to enable ActiveMQ web console in an ActiveMQ server configuration, and using the ActiveMQ web console to examine a client connecting to the server.

Friday, October 3, 2014

Verification of Node installation in Openshift Origin M4

The Openshift Origin Comprehensive Installation Guideline (http://openshift.github.io/documentation/oo_deployment_guide_comprehensive.html) states that there is several things that can be done to ensure the Node is ready for integration into Openshift cluster :

  • built-in script to check the node : 
    • oo-accept-node
  • check that facter runs properly :
    • /etc/cron.minutely/openshift-facts
  • check that mcollective communication works :
    • in the broker, run : oo-mco ping 
What I found that it is not enough. For example, openshift-facts show blanks, even though if there is an error with the facter functionality. So check the facter with :
  • facter 
And oo-mco ping works fine even though that there is something wrong with the rpc channel. I would suggest run these in the broker :
  • oo-mco facts kernel
  • oo-mco inventory

In one of our Openshift Origin M4 cluster , I have these lines in /opt/rh/ruby193/root/etc/mcollective/server.cfg:

main_collective = mcollective
collectives = mcollective
direct_access = 1

When I changed direct_access to 0, the oo-mco facts command doesn't work and  neither are the oo-admin-ctl-district -c add-node -n -i  

On the other cluster, I have these lines :

topicprefix = /topic/
main_collective = mcollective
collectives = mcollective
direct_access = 0

And the nodes works, albeit with warnings about topicprefix.

Additional notes :
Facter errors in my VMs (which have eth1 as the only working network interface) were fixed by ensuring /etc/openshift/node.conf contains these lines :
PUBLIC_NIC="eth1"
EXTERNAL_ETH_DEV="eth1"
INTERNAL_ETH_DEV="eth1"