Saturday, November 12, 2016

How to create LVM volume with thin provisioning

This post shows how to create LVM volume with thin provisioning, that is, only actually used ranges of the volume will actually be allocated.

Check volume groups

First, check lvm volume groups to find out which vg has space for our thin volume pool.

vgdisplay

Choose one of the volume groups with sufficient space. Because we are using thin provisioning, we could use less space than normal provisioning.

Second, check existing logical volumes also. 

lvs

Creating thin volume pool


Next, we create thin volume pool in the chosen volume group (example, vgdata).

lvcreate -L 50G --thinpool globalthinpool vgdata

Print the resulting volumes using lvs :


We see that globalthinpool are created with logical size 50 gigabytes.

Creating thinly provisioned volume

Now we create thinly provisioned volume using previously created pool.

lvcreate -V100G -T vgdata/globalthinpool -n dockerpool

The command would create a 100 G logical volume using thin provisioning. Note that the volume created is 100G, which is larger than the actual thin pool volume. Beware that we must monitor actual volume usage because if the 50GB volume runs out then the programs would froze. See http://unix.stackexchange.com/questions/197412/thin-lvm-pool-frozen-due-to-lack-of-free-space-what-to-do if you encountered such condition (hint : something like lvresize -L +100g vgdata/globalthinpool).

The result is shown in the picture below:

Complications


If we have errors such as /usr/sbin/thin_check execvp failed, usually this means thin_check not installed yet. In ubuntu, install by using

apt-get install thin-provisioning-tools

Formatting the new volume


Before the new volume could be used, format it using mkfs.ext4 :

mkfs.ext4 /dev/mapper/vgdata-dockervol


Now we could mount it :

mkdir /mnt/dockervol
mount /dev/mapper/vgdata-dockervol /mnt/dockervol

Conclusion



LVM thin provisioning could be used so that only used blocks are being allocated in the LVM.


Note: I also have uploaded the screencast session to youtube :


Sunday, October 30, 2016

How to Run X Windows Server inside Docker Container

Background

Sometimes I need to run X Windows-based applications inside Docker containers, and running the server locally is too unpractical because of latency reasons or the working laptop has no X Windows Server. First I tried to create a VirtualBox-based Vnc Server, and it worked fine albeit a little slow, but Docker containers seem to have better memory and disk footprint. So I tried to create Vnc Server running X Windows inside a Docker container. I already tried suchja/x11server (ref) but it has strange problems ignoring cursor keys of my MacBook on webkit page (such as Pentaho Data Integration's Formula page).

Starting point

Many of my Docker images are based on Debian Jessie. So I start from the instructions from this DigitalOcean article : https://www.digitalocean.com/community/tutorials/how-to-set-up-vnc-server-on-debian-8.  This vnc server is based on XFCE Desktop Environment. The steps are basically is to install :
  • xfce4 
  • xfce4-goodies 
  • gnome-icon-theme 
  • tightvncserver
  • iceweasel
After that, run vncserver :[display-number] with a user-specified password.

The initial dockerfile resulted is as follows :
And the resulting docker-compose.yml is :

Problems and problems

The problem is the vncserver fails looking for some 'default' fonts.

Creating vncserver0
Attaching to vncserver0
vncserver0 | xauth:  file /home/vuser/.Xauthority does not exist
vncserver0 | Couldn't start Xtightvnc; trying default font path.
vncserver0 | Please set correct fontPath in the vncserver script.
vncserver0 | Couldn't start Xtightvnc process.
vncserver0 | 
vncserver0 | 
vncserver0 | 30/10/16 09:20:28 Xvnc version TightVNC-1.3.9
vncserver0 | 30/10/16 09:20:28 Copyright (C) 2000-2007 TightVNC Group
vncserver0 | 30/10/16 09:20:28 Copyright (C) 1999 AT&T Laboratories Cambridge
vncserver0 | 30/10/16 09:20:28 All Rights Reserved.
vncserver0 | 30/10/16 09:20:28 See http://www.tightvnc.com/ for information on TightVNC
vncserver0 | 30/10/16 09:20:28 Desktop name 'X' (8732cbbb4029:3)
vncserver0 | 30/10/16 09:20:28 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
vncserver0 | 30/10/16 09:20:28 Listening for VNC connections on TCP port 5903
vncserver0 | Font directory '/usr/share/fonts/X11/misc/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/Type1/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/75dpi/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/100dpi/' not found - ignoring
vncserver0 | 
vncserver0 | Fatal server error:
vncserver0 | could not open default font 'fixed'

After comparing the container condition to my VirtualBox VM (that already worked), the VM has downloaded xfonts-100dpi because xfce4 recommends xorg and it requires xfonts-100dpi. The Dockerfile apt line has standard no-install-recommends clause in order to keep images small. So first step is to change the Dockerfile into :


After doing the change, next problem occurred :
Setting up libfontconfig1:amd64 (2.11.0-6.3+deb8u1) ...
Setting up fontconfig (2.11.0-6.3+deb8u1) ...
Regenerating fonts cache... done.
Setting up keyboard-configuration (1.123) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
Configuring keyboard-configuration
----------------------------------

Please select the layout matching the keyboard for this machine.

  1. English (US)
  2. English (US) - Cherokee
  3. English (US) - English (Colemak)
  4. English (US) - English (Dvorak alternative international no dead keys)
  5. English (US) - English (Dvorak)
  6. English (US) - English (Dvorak, international with dead keys)
  7. English (US) - English (Macintosh)
  8. English (US) - English (US, alternative international)
  9. English (US) - English (US, international with dead keys)
  10. English (US) - English (US, with euro on 5)
  11. English (US) - English (Workman)
  12. English (US) - English (Workman, international with dead keys)
  13. English (US) - English (classic Dvorak)
  14. English (US) - English (international AltGr dead keys)
  15. English (US) - English (left handed Dvorak)
  16. English (US) - English (programmer Dvorak)
  17. English (US) - English (right handed Dvorak)
  18. English (US) - English (the divide/multiply keys toggle the layout)
  19. English (US) - Russian (US, phonetic)
  20. English (US) - Serbo-Croatian (US)
  21. Other

Keyboard layout: 

During image build, there is a prompt asking us about keyboard layout, and typing an answer results in hang process. This error is similar to this stackoverflow question. I tried to do the suggestions there (copying to /etc/default/keyboard), but it still hangs. After struggling with many experiments, finally I use this Dockerfile :

How to use vnc server container

First, you build the image:

  • docker-compose build
Then, create the containers :
  • docker-compose up -d
Check the IP of the container (replace vncserver1 with container_name in your docker-compose.yml)
  • docker inspect vncserver1 | grep 172
Tunnel VNC by performing ssh port forwarding to the local IP (example, 172.17.0.9) and port 5903
  • ssh -L 5903:172.17.0.9:5903 user@servername 
Now to view the screen, you could connect to the VNC server by using VNC Viewer or opening vnc://localhost:5903 url in Safari.
For the X Windows based application, you first must grab the magic cookie :
docker exec -it vncserver1 bash
vuser@5eae70a4a75d:~$ ls                                                                                                                                                          
Desktop
vuser@5eae70a4a75d:~$ xauth list
5eae70a4a75d:3  MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
5eae70a4a75d/unix:3  MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
And then connect the docker container where the X Windows based application need to run :
docker exec -it pentahoserver bash
pentaho@3abd451c9b88:~/data-integration$ xauth add 172.17.0.9:3 MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
pentaho@3abd451c9b88:~/data-integration$ DISPLAY=172.17.0.9:3
pentaho@3abd451c9b88:~/data-integration$ export DISPLAY
pentaho@3abd451c9b88:~/data-integration$ ./spoon.sh


And the resulting X Windows Session is shown thru the VNC channel :

Variations

This Dockerfile uses LXDE  :
This one uses OpenBox Window Manager only :

Conclusion

We are able to run X Window Server running inside a Docker container. The resulting images are about 625 MB, which could be a lot smaller if we remove firefox (iceweasel) and use only Openbox Window Manager.

Friday, October 28, 2016

Docker Basic 101

Background

This post would describe notes that results from my initial exploration using docker. Docker could be described as a thin VM. Essentially docker runs processes in a linux host in a semi-isolated environment. It was a brilliant technical accomplishment that exploits several characteristic of running applications in a linux-based OS. First, that the result of package installation is the distribution of package files in certain directories, and changes to certain files. Second, that executable file from one Linux distribution could be run in another Linux distribution provided that all the required shared library and configuration files are in their places.

Basic characteristic of Docker images

Docker images are essentially similar to zip archives, organized as layer over layers. Each additional layer provide new file or changed files. 
Docker image should be portable, means it could be used in different instances of application in different hosts.
Docker images are built using docker entry script and Dockerfile, which :
a. Dockerfile essentially shows the steps to install the application packages or files. After executing each RUN command in the Dockerfile, Docker creates a layer that stores the files add or updated by the RUN command. 
b. Docker entry script shows what command will be executed if the image is being run. This could be a line running existing program, but it could also point to shell script provided.


The Dockerfile is written using a domain specific language, that shows how to build the layers composing the docker image. Some of the Dockerfile syntax are :
  • ADD : add a file to the docker image filesystem, if the file is tar then Docker would extract it first, and it were allowed to use file from local filesystem or an URL from the internet.
  • FROM : refers to one parent docker image, so the files from the parent image are available in the current docker image
  • RUN : executes some program/command inside the Docker environment, Docker would capture file changes resulting from the execution
  • CMD : shows docker entry command, this could point to a shell script already inside the Docker image filesystem or an existing program

Basic Docker usage

There are a few basic docker commands that you need to know when using docker.

  • docker ps : print running processes (containers in docker lingo). The variant is docker ps -a , shows running and terminated containers.
  • docker inspect : show various information of a running docker container. Join with grep to get shorter output, like : docker inspect | grep 172 -> this will filter to show IP addresses. Container could be referred by its name or id.
  • docker history : shows the layers composing the image, also shows no-operation layers such as MAINTAINER.
  • docker exec -it bash : runs a bash shell inside a container interactively. This functions mostly like ssh without needing ssh daemon running inside the container.
  • docker run -d --name [command] [params] : creates a new container from an image, runs it in the background using the entry point. There are other useful parameters before --name, such as : 
    • -v hostdir:containerdir -> mounts a host directory inside the container, this could also works for single file.
    • --link -> put the IP of othercontainer in /etc/hosts so the container could access the other container by specifying its name
    • -e VAR=value -> sets environment variable
  • docker start : starts the container using the entry point
  • docker stop : terminates the process inside the container
  • docker rm : deletes the container, note that named data volumes would not be deleted
Docker images from the internet could be referred when creating new containers, or also inside a Dockerfile. For example, to create a container using an official mariadb image, with data directory mounted from the host, using TokuDB storage engine, use this command :

docker run -d --name maria -e MYSQL_ROOT_PASSWORD=rootpassword -v /data/docker/mysql:/var/lib/mysql mariadb  --plugin-load=ha_tokudb --plugin-dir=/usr/lib/mysql/plugin --default-storage-engine=tokudb --wait_timeout=2000 --max_connections=500

Docker would try to retrieve mariadb image from the official image source (see https://hub.docker.com/_/mariadb/). The docker image syntax is : 
Specifying no tag in the command implies retrieving mariadb:latest.

Docker compose

Host specific configuration are given as parameters of the docker run command, and they could be very long, complicating the container creation command. To simplify this, use docker-compose :
  • docker-compose up -d : reads docker-compose.yml and starts new containers using the compose specification, assuming the docker images already built
  • docker-compose build : reads the docker-compose.yml and create images from specified Dockerfiles if required
  • docker-compose stop : stops the containers specified in docker-compose.yml
  • docker-compose start : starts the process in the containers (similar to docker start) referred by docker-compose.yml
Docker-compose uses docker-compose.yml as the blueprint of the containers. 
This example dockerfile shows various parts of the file: (note that this is a 1.x docker-compose format)

pdi:
    volumes:
     - ~/pentahorepo:/home/pentaho/pentahorepo
     - /tmp/.X11-unix:/tmp/.X11-unix
     - ~/pdidocker/dotkettle:/home/pentaho/data-integration/.kettle
     - ~/phpdocker/public/isms/comm:/home/pentaho/comm
     - ~/phpdocker/public/isms/mobility:/home/pentaho/mobility
    image: y_widyatama:pdi
    container_name: pdiisms
    environment:
     - DISPLAY=172.17.0.1:10.0
    external_links:
     - maria

The volumes clause specifies volumes that are mounted inside the container, like the -v parameter. Specifying a directory in the volume line before the colon (:) means a directory from the host. Specifying a simple name instead will be interpreted as Docker data volume, which would be created if it doesn't exist yet. The directory after the colon shows the target mount directory inside the container.
The image clause specified the base image of the container. Alternatively, a dockerfile clause could be specified instead of image clause.
The environment clause shows additional environment variables that would be set inside the container. 
External_links clause refers to name of existing running container that will be referred in /etc/hosts of the container.

Another example using two containers :

ismsdash-webserver:
  image: phpdockerio/nginx:latest
  container_name: ismsdash-webserver
  volumes:
      - .:/var/www/ismsdash
      - ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
  ports:
   - "80:80"
  links:
   - ismsdash-php-fpm

ismsdash-php-fpm:
  build: .
  dockerfile: php-fpm/Dockerfile
  container_name: ismsdash-php-fpm
  volumes:
    - .:/var/www/ismsdash
    - /data/isms/CDR:/var/www/ismsdash/public/CDR
    - ./php-fpm/php-ini-overrides.ini:/etc/php5/fpm/conf.d/99-overrides.ini
  external_links:
   - maria

   - pdiisms

This example shows two containers with some dependency links. The first, ismsdash-webserver, links to the second container using links clause. The second, ismsdash-php-fpm refers to two existing containers outside this docker-compose.yml file.
The first container is created from a docker image. The second container requires the image to be built first using a specified Dockerfile.
The ports clause specifies port forwarding from the host port to the container port.

Conclusion

Knowledge of several basic commands are neccessary in order to use docker. This post described some commands, hoping that the reader will be able to start using docker using these commands.

Wednesday, October 19, 2016

Running X11 Apps inside Docker on Remote Server

Background

Docker is fast-growing trend that I could no longer ignore, so I tried Docker running in a Linux server machine. Running server app is a breeze inside docker, but I need to run Pentaho Data Integration in the server, which uses X11 display. There is several references about forwarding X11 connection to a Docker container but none works for my setup, which has Quartz XServer running in  Mac OS X laptop and Docker service running in a remote Linux Server.

The usual way

The steps to run X Windowed Applications in Docker containers can be read from Running GUI Apps with Docker and Alternatives to SSH X11 Forwarding for Docker Containers, which essentially is as follows :
  1. Forwarding DISPLAY environment variable to the container
  2. Forwarding directory /tmp/.X11-unix to the container
I already tried such steps with no results, because I need to add another step before these two, that is forwarding X11 connection thru ssh connection to the server (not container). The step is as follows from an OS X laptop :

ssh -X username@servername.domain

Complications


We have DISPLAY set to something like localhost:10.0, and empty /tmp/.X11-unix directory. The reason is that forwarding thru ssh will result in TCP-based X11 service, and it is the opposite of unix socket based X11 service.



Inside the container, we are also unable to connect the localhost:10.0, because in the container the localhost would refer to the container's loopback interface, which is different from the host's loopback. For X11 display connections, the 10.0 means TCP port 6000+10, which as the result of X11 forwarding  only listens on 127.0.0.1 address.

Solutions

So we need to ensure the container able to connect to the forwarded X11 port. One solutions is to use socat :

nohup socat tcp-listen:6010,bind=172.17.0.1,reuseaddr,fork tcp:127.0.0.1:6010 &

Then the DISPLAY variable should be altered to point host IP in the docker network (172.17.0.1)


Unfortunately as you can see, there are authentication problems. I tried to copy the .Xauthority file from the host into the container, but it still failed. We could only understand why after getting it to work :

It seems that because different IP being used, the Xauthority file need a row with 172.17.0.1:10 as the key.

Having Xclock working, so we could also try something bigger like Pentaho's Spoon UI :

Additional notes

In order to get Pentaho Spoon UI working, I need to add gtk library as one layer in the Dockerfile :

USER root

RUN apt-get update && apt-get install -y libgtk2.0-0 xauth x11-apps

Caveat : for the formula editor to work in Spoon UI, we need to install firefox and libraries bridging firefox and swt, which is a topic for another post.

Conclusion

We have found out how to run X11-based apps from a Docker container with the X Server running in a laptop. What good does this for ? In some cases, we need to run a GUI-based app inside a Docker container, without installing vnc and x server inside the container. However we still have to provide X11 libraries inside the container.


Thursday, June 9, 2016

SAP System Copy Lessons Learned

Background

Earlier this year I was part of a team that does System Copy for a 20 terabyte plus SAP ERP RM-CA System. And just now I am involved in doing two system copy in just over one week, for much lesser amount of data. I think I would note some lessons learned from the experience in this blog. For the record, we are migrating from HP/UX and AIX to Linux x86 platform.

Things that go wrong

First, following the System Copy guide carefully is quite a lot of work - mainly because some important stuff are hidden in references in the guide. And reading a SAP note that are referenced in another SAP note, that are referenced in Installation Guide.. is a bit too much. Let me describe what thing goes wrong.

VM Time drift

The Oracle RAC Cluster have time drift problem, killing one instance when the other is shutting down. The cure for our VMWare-based Linux database server is hidden in SAP Note 989963 "Linux VMWARE Timing", which is basically add a tinker panic 0 in the ntp.conf and removing local undisciplined time source. And I think there are additional kernel parameters if your kernel is not new enough.

Memory usage grows beyond whats being used

Memory usage is very high, given our production cluster hosts so many servers, and thus so many concurrent database connections. Page table overhead is so large that the server is drowning in it. The cure is to enable hugepages in the kernel, allocate enough pages for Oracle SGA as hugepages, and ensuring Oracle could use the Huge Pages. See SAP Note 1672954 "Hugepages for Oracle".

Temporary Sequential (TemSe) Objects

For the QA system copy, the step deleting inconsistencies in TemSe consistency check takes forever, like a whole day, and its not even finished yet. It seems that SAP daily job to purge job logs is never been run before in the server. Our solution is to use report RSBTCDEL2 to delete job logs in time period intervals, like deleting every finished job logs that is over 1000 days old, then over 500 days old. See SAP Note 48400 for information about kinds of objects that are stored in TemSe, and the tcode to purge each kind of them. For example, report RSBTCDEL2 for job logs and RSBDCREO for BDC. 
The lessons are, please do purge your job logs in the QA and DEV server. In our PROD system we have 14 days retention for jobs, in QA and DEV, maybe 100 days of retention is enough.

NFS problems

Our former senior Basis confirms that Linux NFS is more problematic than AIX's NFS. Our lessons learned are:
  1. If it is possible to avoiding NFS access, avoid it. For example, transporting datafiles from source to target. NFS problems delay our process by about 8 hours, which is resolved by obtaining terabytes of storage and copying the data files to local disk via SCP.
  2. NFS problems occur mostly when the NFS server restarted. Be prepared to restart the NFS service in each client first, and restart client OS if things still doesn't work.
  3. Recommended NFS mount options for Linux is (ref : SAP Wiki):
rw,bg,hard,[intr],rsize=32768,wsize=32768,tcp,vers=3,suid,timeo=600

Out of disk space in non-database partitions

This is a manifestation of some other problems, like :
  1. Forgotting to deactivate app server in the system copy process. The dev_w? log files were filled with endless database connection errors after system copy recreates the database.
  2. Using the wrong directory for export destination.
These two should be obvious but it happens. The lessons are, check your running processes, and make sure export destination have enough free space.

Out of disk space in sapdata partition

Our database space usage estimation is very much off from what is really needed. The mistake is that we estimate the size from source system database size, meanwhile sapinst would create datafiles with additional space added as safety margin.
In order to correctly measure space, we need to be aware that :
  1. The sizes in DBSIZE.XML is only for estimating SAP's core tablespaces, and for SYSTEM and PSAPTEMP the estimate is a bit off. For SYSTEM we need additional 800 MB from the DBSIZE.XML in order to pass the ABAP Import phase successfully
  2. PSAPTEMP size is recommended by SAP in Note 936441 to be 20% from total data size used, which is very different from the estimate in DBSIZE.XML.  20% might be a bit too much for large (>500GB) databases, but if it is too small some large indexes might fail during import. For our case, DBSIZE estimates 10 GB for PSAPTEMP, we need 20 GB to pass ABAP Import phase (by Note 936441 it should be 100 GB though).
  3. Sapinst will create SAP core tablespace datafiles that might be consuming too much disk space. Be ready to provide larger storage capacity in order to have smooth system copy operation. In our case, we are forced to shrink PSAPSR3 datafiles in order to make room for SYSTEM and PSAPTEMP tablespace. 
  4. The default reserved space for root is 5% for one partition. This is very significant number of wasted space (5% from 600 GB sapdata partition is 30 GB wasted space for example). Some references said that ext2fs performance drops after 95% disk usage, but because sapdata mainly store datafiles with autoextend off then this should not be an issue. For sapdata I change the reserved space to 10000 blocks : tune2fs -r 10000 /x/y/sapdata

Permission issues

We used local linux user for OS authentication and authorization. And there is a need to match uid/guid across the servers in order to make transport via NFS work. We only found minimum clue to this issue in the System Copy guide. Currently our lessons here are :
  • need to restart sapinst after changing groups
  • saproot.sh is a tool to fix permissions AFTER gid/uid already fixed
  • double check that you don't have double entries in /etc/group or /etc/passwd as the result of gid/uid update efforts
  • in SUSE that there is nscd service that caches groups, restart the service to update the cache, and you don't need to restart the OS
  • ls -l sometimes could be confusing. ls -nl is better, because it shows numerical gids.


Performance issues

Migrating to new OS and hardware environment, we face some performance issues. SAP Note that describes some insight on this are Note 1817553 "What to do in case of general performance issues" and Note 853576 "Performance analysis w ASH and Oracle Advisors". Currently our standard operating procedure are like this :
  1. Execute AWR from ST04 -> Performance -> Wait Event Analysis -> Workload Reporting
  2. For the AWR, choose begin snapshot and end snapshot of 1/2 an hour where the performance issue occurs
  3. In the AWR result, Check SQLs with high elapsed time or IO, 
  4. Check the index and tables involved for stale statistics
  5. Check index storage quality from report RSORAISQN (see note 970538)
  6. Execute Oracle SQL Tuning on the SQL id (sqltrpt.sql)
  7. Apply the recommendation 

Summary

If you only do dry run once before the real System Copy, be prepared to handle unexpected things during the real process, such as obtaining additional storage. We might avoid some problems by by reading the System Copy guide carefully and reading 'between the lines'.

Saturday, April 16, 2016

'Cached' memory in Linux Kernel

It is my understanding that the free memory in linux operating system, can be shown by checking the second line in the result of "free -m" :





The first line, shows free memory that are really really free. The second line, shows free memory combined by buffers and cache. The reason is, I was told, that buffer and cache memory could be converted to free memory whenever there is a need. The cache memory is filled with the filesystem cache of the Linux Operating System.

The problem is, I was wrong. There are several cases where I find that cache memory is not being reduced when there is an application needing more memory. Instead, a part of the application memory is being sent to the swap, increasing swap usage and causing pauses in the system (while the memory pages being written to disk). In one case an Oracle database instance restarted and the team thinks it is because the memory demand too high (I think this is a bug).

The cache memory suppose to be reduced when we issue this command (ref: how-do-you-empty-the-buffers-and-cache-on-a-linux-system)
# echo 1 > /proc/sys/vm/drop_caches
But in our oracle database instances, running the command would only get a small reduction to the cached column. I also tried echo with 2 and 3 as the value, same results.

The truth

First, the 'cached' part not only contains file cache. It also contains memory mapped files and anonymous mappings. Shared memory also falls into 'anonymous mappings'. In Oracle systems without hugepages enabled, the SGA (System Global Area) is created as shared memory (see pythian-goodies-free-memory-swap-oracle-and-everything). Of course, the SGA could not be freed from the memory.. otherwise the database would be offline ! 

Another way to get better understanding of the memory usage is to use /proc/meminfo :

You might want to check Shmem usage there. In this example server, 0 values in Hugepages section shows that that hugepages is not being used in this server. For an Oracle database with large amount of RAM (say, 64 GB) and large amount of processes (500 or something) this could became a problem, mainly that the PageTables is going to be exceedingly large (thats another story). The memory in PageTables could not be used for anything else, so it is going to calculated as 'used' in the 'free -m' outputs.

Second, some people in their blog said that page cache (file cache) competes with application memory in order to gain portions of the real memory. A SUSE documentation confirms this :
The kernel swaps out rarely accessed memory pages to use freed memory pages as cache to speed up file system operations, for example during backup operations.

Limiting page cache usage

SUSE developed additional kernel parameter to control this behavior, vm.pagecache_limit_mb and vm.pagecache_limit_ignore_dirty. This two parameters could be used to limit pagecache ( = file cache) size that competes with ordinary memory. When pagecache is below this limit, it is allowed to compete directly with application memory for memory usage, allowing the kernel to swap either when the blocks of memory (file cache or application memory) is not accessed in recent times. Pagecache blocks above this limit is deemed to be less important than application memory, so application memory will not get swapped if there is large amount of page cache that could be freed.
In Red Hat Enterprise Linux 5, we have kernel parameter vm.pagecache that is similar to SUSE's parameter, but using percentage value instead. The default parameter value is 100, meaning that whole memory is available for use as page cache (see Memory_Usage_and_Page_Cache-Tuning_the_Page_Cache). I believe this also the case with CentOS Linuxes.

Conclusion

You might want to check your Linux distribution's documentation about page cache. There are some non-standard parameters that each distribution have that enable us to contain the page cache usage of the Linux operating system.

Thursday, March 3, 2016

Nostalgic Programming in Pascal

A writer once said that in the new world, programmers would be free to choose any programming languages to do their job with. They would be able to use the most productive language for themselves and also for the task at hand. In this event, I am feeling a bit nostalgic, and found that there is an open source software called Free Pascal Compiler. In the past I learnt programming using Pascal as my second language, specifically Turbo Pascal.

Background

Needing to write a simple program to verify that a CSV file have the specified number of columns. It could be done using awk etc, but I need the program to be fast because the file size is large and the rows is in the order of hundreds of thousand. 

The program


Explanation

At first the field counter returns strange result that very much different than the expected one. I was baffled, until I remembered that pascal Strings have maximum 255 characters wide (the files have more characters in each line). So it turns out to be a switch ($H+) to enable null terminated AnsiString in FPC instead of normal 255 character strings (aka ShortString). See http://wiki.freepascal.org/Character_and_string_types :

The type String may refer to ShortString or AnsiString, depending from the {$H} switch. If the switch is off ({$H-}) then any string declaration will define a ShortString. It size will be 255 chars, if not otherwise specified. If it is on ({$H+}) string without length specifier will define an AnsiString, otherwise a ShortString with specified length. In mode delphiunicode' String is UnicodeString.

Compiling 

In my ubuntu system, I installed fpc using apt-get.
 apt-get install fp-compiler-2.6.2

Then all you do is fpc filename.pas :

The warning 'contains output sections' is fixed in later version of fpc, but for this version it is harmless.

Conclusion

You don't need to seek Turbo Pascal anymore to do Pascal programming, now you can use free pascal.

Monday, February 1, 2016

Deploying Yii Application in Openshift Origin

This post would describe how we deploy a Yii application into Openshift Origin. It should also work in Openshift Enterprise and Openshift online.

Challenges

PHP-based application that runs in a Gear must not write to the application directory, because it is not writable. Openshift provides a Data Directory for each gear, which we could use for the purpose of writing assets and application runtime log.

For the case of load-balanced application, error messages written to application log is also stored in multiple gears, making troubleshooting more complex than it is.

Solution

Use deploy action hook in order to create directories in the data directory and symbolic links to the application. Change the deploy script to be executable, in Windows systems without TortoiseGit we need to do some git magic.
Create this file as .openshift/action_hooks/deploy in the application source code.

If your application is hosted using 'php' directory in the source code :


If your application is hosted in the root of the application source code :


Then make it executable and commit it :




For checking application.log, do this in the Ruby command prompt : (a trick I learned from this blog)



Another way to troubleshoot is to do port forwarding to different gears :
After port forwarding set up, you could access the apache in the gear by going to http://127.0.0.1:8080 in your browser (adjust with the actual port shown by rhc port forward)

Adapting Openshift Origin for High load

Openshift Origin is a Platform as a Service software platform which enables us to horizontally scale applications and also manage multiple applications in one cluster. One openshift node could contain many applications, the default settings allows for 100 gears (which could be 100 different applications or may be only 4 applications each with 25 gears). Each gear contains a separate apache instance. This post would describe adjustments that I have done on an Openshift M4 cluster that are deployed using the definitive guide. Maybe I really should upgrade the cluster to newer version, but we are currently running production load in this cluster.

The Node architecture

Load balancing in an Openshift application is done by haproxy. The general application architecture is shown below (replace Java with PHP for cases of PHP-based application) (ref : Openshift Blog : How haproxy scales apps).
The gear shown running code, for PHP applications, each consists of one Apache HTTPD instance. 

What is not shown above is that the user actually connects to another Apache instance running in the node host, working as a Reverse-Proxy Server, this could be seen in below image (taken from Redhat Access : Openshift Node Architecture): 
Using Haproxy as a load balancer means the 'Gear 1' in the picture above is replaced by a haproxy instance which distributes loads into another gears. I tried to draw using yEd with this diagram as a result :

Haproxy limits


First, the haproxy cartridge specified 128 as session limit. What if I planned to have 1000 concurrent users ? If the users are using IE, each could create 3 concurrent connection, totaling 3000 connections. So this need to be changed.
I read somewhere that haproxy is capable proxying to tens of thousands connections, so I changed the maxconn from 256 and 128 into 25600 and 12800, respectively. The file need to be changed is haproxy/conf/haproxy.cfg in the application primary (front) gear, lines 28 and 53. Normal user is allowed to make this change (rhc ssh first to your gear).

It seems that open files limit also need to be changed. The indication is these error messages are shown in app-root/logs/haproxy.log :

[WARNING] 018/142729 (27540) : [/usr/sbin/haproxy.main()] Cannot raise FD limit to 51223.
[WARNING] 018/142729 (27540) : [/usr/sbin/haproxy.main()] FD limit (1024) too low for maxconn=25600/maxsock=51223. Please raise 'ulimit-n' to 51223 or more to avoid any trouble.

To fix this, change configuration file /etc/security/limits.conf as root, add this line:

* hard nofile 60000

And stop-start your gear after the change.

Preventing port number exhaustion

High number of load might cause the openshift node to run out of port numbers. The symptom is this error message in /var/log/httpd/error_log (only accessible to root) :

[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed
[Mon Feb 01 10:35:37 2016] [error] (99)Cannot assign requested address: proxy: HTTP: attempt to connect to 127.4.238.2:8080 (*) failed

Such error messages are related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1085115, which give us a clue of what really happened. On my system, I used mod_rewrite as reverse proxy module, which have the shortcoming of not reusing socket connections to backend gears. Socket open and closing if done too often would kept many sockets in TIME_WAIT state, which port number by default will not be used for 2 minutes. So one workaround is to enable net.ipv4.tcp_tw_reuse, add this line to the file  /etc/sysctl.conf  as root:
 
net.ipv4.tcp_tw_reuse = 1

And to force in effect immediately, do this in the shell prompt :

sysctl -w net.ipv4.tcp_tw_reuse=1

Conclusion

These changes, done properly, will adjust software limits in the openshift node, enabling the openshift node to receive high amount of requests and/or concurrent connections. Of course you still need to design your database queries to be quick and keep track of memory usage. But these software limits if left unchanged might became a bottleneck for your Openshift application performance.

Sunday, January 31, 2016

Long running process in Linux using PHP

Background

To do stuff, I usually create web-based applications written in PHP. Sometimes we need to run something that takes a long time, far longer than the 10 second psychological limit for web pages.
A bit of googling in stack overflow found us this http://stackoverflow.com/questions/2212635/best-way-to-manage-long-running-php-script, but I will tell the similar story with a different solution. One of the long running tasks that need to be run is a Pentaho data integration transformation.

Difficulties in long running PHP scripts

I encountered some problems when trying to make PHP do long running tasks :
  1. PHP script timeout. This could be solved by running set_time_limit(0); before the long running tasks.
  2. Memory leaks. The framework I normally use have a bit of memory issues, this can be solved either by patching the framework (ok, it is a bit difficult to do, but I did something similar in the past) or splitting the data to process into several batches. And if you are going to loop the batches in one PHP run, make sure after each batch there are no dangling reference to the objects processed. 
  3. Browser disconnects in Apache-PHP environment would terminate the PHP script. During my explorations I found that :
    1. Some firewall usually disconnects a HTTP connection after 60 seconds.
    2. Firefox have a long timeout (300 seconds or something, ref here
    3. Chrome have timeout similar to Firefox (about 300, ref here), and longer for AJAX (stackoverflow ref doesnt timeout after 15 hours)
  4. Difficulties in running pentaho transformations, because the PHP module would run as www-data, and will be unable to access the kettle repository stored in another user's home directory.

Workarounds

I have experiences using these workarounds to force PHP to be able to do long running web pages :
  • Workaround 1 : use set_time_limit(0); and ignore_user_abort(true); to ensure script keeps running even after client disconnects.  Unfortunately the user will no longer see the result of our script.
  • Workaround 2 : use HTTPS so the firewall will unable to do layer 7 processing and doesn't dare disconnect the connection. If the user closed the browser then the script would still terminate, except when you also do workaround 1.
I haven't tried detaching a child process yet like , but my other solutions involve separate process for background processing with similar benefits.

Solution A - Polling task tables using cron

It is better to separate the user interface part (PHP web script) with the background processing part. My first solution is to create cron task that are run every 3 minutes, which runs a PHP CLI script which checks a background task table for tasks with state 'SUBMITTED'. Upon processing the task, the script should update the state to 'PROCESSING'. 
So the user interface/ front end only checks the background task table, and when the user orders to, inserts a task there with the specification required by the task, setting the state to 'SUBMITTED'.
When cron gets to run the PHP CLI script, it would check for tasks, and if there any, change the first task state to PROCESSING and begin processing. When processing complete, the PHP CLI script would change the state to COMPLETED.
Complications happen, so we will need to do risk management by :
  1. logging phases of the process in some database table, including warnings that might be issued during processing.
  2. recording error rows if there is any in another database table, so the user could view problematic rows
Currently this solution works, but recently I came across another solution that might be a better fit for running a Linux process.

Solution B - Using inotifywait and control files

In this solution, I created a control file which contains only one line of CSV. I prepared a PHP CLI script which parses the CSV and executes a long running process, and also a PHP Web page which would write to the control file. Inotifywait from inotify-tools will listen on file system notifications from Linux kernel that are related to changes on the control file.
The scenario is like this :
  1. User opens PHP web page, and choose parameters for the background task, clicked on Submit
  2. PHP web page receive the submitted parameters, and write them into the control file, including job id. The user received a page that states 'task submitted'.
  3. A shell script that running inotifywait, will wait for notifications on the control file, specifically for the close_write event
  4. After close_write event received, the shell script will continue, and run PHP CLI script to do the background processing
  5. PHP CLI script reads the control file for parameters and job id
  6. PHP CLI script executes linux process, redirecting the output to a file identified by job id in a specific directory
  7. The web page that states 'Task Submitted' could periodically poll the output file with the job id, and shows the output to the end user (OK, this one I need to actually try later)
  8. PHP CLI returns, the shell script performs an endless loop by going to (3)

Conclusion

By using Linux file system notifications, we could trigger task execution with parameter specified from a PHP web page. The task could be run as another Linux user, provider the user running the shell script. Data sanitization are done by php, so no strange commands could be passed to the background task. 

These solutions are written entirely in open source solutions. I saw that Azure have WebJobs which might fulfill similar requirements that I have, only it is in Azure platform which I never used.

Hack : Monitoring CPU usage from Your Mobile

Background

Sometimes I need to run a long-running background process in the server, and I need to know when does the CPU usage returns to (almost) 0, indicating the process finished. I know there are other options, like sending myself an email when the process finished, but currently I am satisfied with monitoring the CPU usage.

The old way

I have an android cellphone, which allows me to :

  1. Launch ConnectBot, type ssh username and password connect to the server
  2. type top
  3. watch the top result

The new way

Because I am more familiar with PHP than with anything else right now (ok, there are times I am more familiar with C Sharp, but it is another story), I  do a quick google search for 'php cpu usage' and found http://stackoverflow.com/questions/13131003/get-cpu-percent-usage-in-php. Using stix's solution I created this simple JSON web service using PHP :


For displaying the CPU as a graph, another google search pointed me to Flot, a javascript library allowing us to draw (plot) simple charts. A tutorial shows me how to draw CPU chart similar to Windows's  : http://www.jqueryflottutorial.com/how-to-make-jquery-flot-realtime-update-chart.html.
The principle is to use ajax to periodically call PHP JSON web service to get CPU usage statistics.

I adapted the code to add sys cpu usage in addition to total cpu usage.
The source codes are shown below :
Put the html and php file in your a folder in the server, unzip flots files in the same directory, and you're good to go.

Conclusion

Using flot and PHP we could monitor CPU usage remotely, and its compatibility with mobile browsers allow us to use our mobile devices to monitor server's CPU usage.


Saturday, January 2, 2016

Installing MariaDB and TokuDB in Ubuntu Trusty

Background

In this post I would tell a story about installing MariaDB in Ubuntu Trusty, and the process that I went through to enable TokuDB engine. I need to experiment using the engine as an alternative to the Archive engine, to store compressed table rows. It have better performance than InnoDB tables (row_format=compressed), and it was recommended in some blog posts (this post and this

Packages for Ubuntu Trusty

In order to be able to use TokuDB, I seek the documentation and find out that Ubuntu version 12.10 and newer for 64-bit platform requires mariadb-tokudb-engine-5.5 package.  Despite the existence of mariadb-5.5 packages, I found no package containing tokudb keyword in the official Ubuntu Trusty repositories. The mariadb 5.5 server package also doesn't contain ha_tokudb.so (see file list). 

The solution is to use the repository from this online wizard.

Installing mariadb-server-10.1, we have many storage engines available, tokudb and cassandra being the more interesting ones. 

Preparation - disable Hugepages

Kernel hugepages are not compatible with TokuDB Engine. I disabled it by inserting some rows in /etc/rc.local :

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi


Enabling TokuDB


I enabled tokudb by running this command in mariaDB's SQL prompt as root:

INSTALL SONAME 'ha_tokudb';

Upon retrospect, maybe I supposed to uncomment the plugin load line in /etc/mysql/conf.d/tokudb.conf .

Using TokuDB

Having enabled tokuDB, check by show engines :

Cool.

The syntax to use it from MariaDB is a bit different from Percona or Tokutek distribution :

CREATE TABLE xxx (columns .., PRIMARY KEY pk_name(pk_field1,pk_field2..)) ENGINE = TokuDB COMPRESSION=TOKUDB_SNAPPY;

We also could transform existing InnoDB table (or other kind of tables) into TokuDB table, but beware that this will recreate the entire table in TokuDB engine :

ALTER TABLE xxx ENGINE =TokuDB COMPRESSION=TOKUDB_SNAPPY;

There are two ways of optimizing TokuDB tables, the first one is to do light 'maintenance' :

OPTIMIZE TABLE xxx;

But if you want to free some space you need to recreate (reorg?) the table :

ALTER TABLE xxx ENGINE =TokuDB COMPRESSION=TOKUDB_SNAPPY;

The compression options (refer here but beware of syntax differences) is as follows : 
  • tokudb_default, tokudb_zlib: compress using zlib library, medium CPU and compression ratio
  • tokudb_fast, tokudb_quicklz: Use the quicklz library, the lightest compression with low CPU usage,
  • tokudb_small, tokudb_lzma: Use the lzma library. the highest compression and highest CPU usage
  • tokudb_uncompressed: No compression is used.
  • tokudb_snappy: compression using Google's snappy algorithm, reasonable compression and fast performance.

Caveats

  • Currently I still cannot enable InnoDB/XtraDB page level compression
  • Syntax differences confused me sometimes, some information that are not clear in MariaDB's website can be read in Percona's website.
  • Xtrabackup doesn't work for TokuDB tables, you need plain mysqldump or mydumper to backup TokuDB tables
  • Mydumper in Ubuntu Trusty repository doesn't work with MariaDB 10.1
  • I still unable to compile recent Mydumper version in Ubuntu Trusty - MariaDB 10.1 combination