Thursday, March 9, 2017

Securing Openshift Origin Nodes

Background

We have deployed Openshift Origin based cluster based on  Origin Milestone 4 release. When security assessment performed on several of the applications in the cluster, some issues crop up and needs further remediation. Some issue related to application code, some others related to the openshift node configuration, which we shall discuss here.

SSH issues

One of the issues is SSH weak algorithm support.
To remediate that, we need to tweak /etc/sshd/sshd_config by inserting additional lines :

#mitigasi assesment security SSH weak algoritm support
Ciphers aes128-ctr,aes192-ctr,aes256-ctr
MACs hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com

SSL issues

The other issue is related to SSL crypto algorithms. The cipher suite 3DES is no longer considered secure, so  we need to tweak /etc/httpd/conf.d/000001_openshift_origin_node.conf (line 63) by adding   !3DES:!DES-CBC3-SHA  :

SSLCipherSuite kEECDH:+kEECDH+SHA:kEDH:+kEDH+SHA:+kEDH+CAMELLIA:kECDH:+kECDH+SHA:kRSA:+kRSA+SHA:+kRSA+CAMELLIA:!aNULL:!eNULL:!SSLv2:!RC4:!DES:!EXP:!SEED:!IDEA:!3DES:!DES-CBC3-SHA


We also need to disable SSLv2 and v3 in 000001_openshift_origin_node.conf (line 58) :

SSLProtocol ALL -SSLv2 -SSLv3

And, because SSL certificate chains its a bit tricky, we are required to have SSLCertificateChain line too (inserted in line 32 of the same file)

SSLCertificateChainFile /etc/pki/tls/certs/localhost.crt

The httpd SSL virtual host configuration conflicts with openshift's, so need to delete all virtual host line in /etc/httpd/conf.d/ssl.conf .

The final step, files localhost.crt, localhost.key in /etc/pki/tls/certs/localhost.crt and /etc/pki/tls/private/localhost.key respectively  need to be replaced with one from the company's valid SSL certificates.

Restart httpd afterwards.

SSL in node proxy issue

Nodejs websocket proxy runs in port 8443, and also have SSL issues. We use the websocket proxy if the application in openshift requires websocket technology.

In /etc/openshift/web-proxy-config.json (between private key line at line 125 and } in 126), need to add these line :

"ciphers" : "kEECDH:+kEECDH+SHA:kEDH:+kEDH+SHA:+kEDH+CAMELLIA:kECDH:+kECDH+SHA:kRSA:+kRSA+SHA:+kRSA+CAMELLIA:!aNULL:!eNULL:!SSLv2:!RC4:!DES:!EXP:!SEED:!IDEA:+3DES:!DES-CBC3-SHA"

Also need to replace this file - /opt/rh/nodejs010/root/usr/lib/node_modules/openshift-node-web-proxy/lib/utils/http-utils.js with the latest from https://raw.githubusercontent.com/openshift/origin-server/master/node-proxy/lib/utils/http-utils.js. Just edit the file in vi, delete all lines, insert with the raw lines from github.

Conclusion

Some maintainance are needed to ensure openshift origin nodes are not a security liability. These steps would reduce number of security issues need to be dealt with when securing apps in the Openshift origin cluster.


Saturday, November 12, 2016

How to create LVM volume with thin provisioning

This post shows how to create LVM volume with thin provisioning, that is, only actually used ranges of the volume will actually be allocated.

Check volume groups

First, check lvm volume groups to find out which vg has space for our thin volume pool.

vgdisplay

Choose one of the volume groups with sufficient space. Because we are using thin provisioning, we could use less space than normal provisioning.

Second, check existing logical volumes also. 

lvs

Creating thin volume pool


Next, we create thin volume pool in the chosen volume group (example, vgdata).

lvcreate -L 50G --thinpool globalthinpool vgdata

Print the resulting volumes using lvs :


We see that globalthinpool are created with logical size 50 gigabytes.

Creating thinly provisioned volume

Now we create thinly provisioned volume using previously created pool.

lvcreate -V100G -T vgdata/globalthinpool -n dockerpool

The command would create a 100 G logical volume using thin provisioning. Note that the volume created is 100G, which is larger than the actual thin pool volume. Beware that we must monitor actual volume usage because if the 50GB volume runs out then the programs would froze. See http://unix.stackexchange.com/questions/197412/thin-lvm-pool-frozen-due-to-lack-of-free-space-what-to-do if you encountered such condition (hint : something like lvresize -L +100g vgdata/globalthinpool).

The result is shown in the picture below:

Complications


If we have errors such as /usr/sbin/thin_check execvp failed, usually this means thin_check not installed yet. In ubuntu, install by using

apt-get install thin-provisioning-tools

Formatting the new volume


Before the new volume could be used, format it using mkfs.ext4 :

mkfs.ext4 /dev/mapper/vgdata-dockervol


Now we could mount it :

mkdir /mnt/dockervol
mount /dev/mapper/vgdata-dockervol /mnt/dockervol

Conclusion



LVM thin provisioning could be used so that only used blocks are being allocated in the LVM.


Note: I also have uploaded the screencast session to youtube :


Sunday, October 30, 2016

How to Run X Windows Server inside Docker Container

Background

Sometimes I need to run X Windows-based applications inside Docker containers, and running the server locally is too unpractical because of latency reasons or the working laptop has no X Windows Server. First I tried to create a VirtualBox-based Vnc Server, and it worked fine albeit a little slow, but Docker containers seem to have better memory and disk footprint. So I tried to create Vnc Server running X Windows inside a Docker container. I already tried suchja/x11server (ref) but it has strange problems ignoring cursor keys of my MacBook on webkit page (such as Pentaho Data Integration's Formula page).

Starting point

Many of my Docker images are based on Debian Jessie. So I start from the instructions from this DigitalOcean article : https://www.digitalocean.com/community/tutorials/how-to-set-up-vnc-server-on-debian-8.  This vnc server is based on XFCE Desktop Environment. The steps are basically is to install :
  • xfce4 
  • xfce4-goodies 
  • gnome-icon-theme 
  • tightvncserver
  • iceweasel
After that, run vncserver :[display-number] with a user-specified password.

The initial dockerfile resulted is as follows :
And the resulting docker-compose.yml is :

Problems and problems

The problem is the vncserver fails looking for some 'default' fonts.

Creating vncserver0
Attaching to vncserver0
vncserver0 | xauth:  file /home/vuser/.Xauthority does not exist
vncserver0 | Couldn't start Xtightvnc; trying default font path.
vncserver0 | Please set correct fontPath in the vncserver script.
vncserver0 | Couldn't start Xtightvnc process.
vncserver0 | 
vncserver0 | 
vncserver0 | 30/10/16 09:20:28 Xvnc version TightVNC-1.3.9
vncserver0 | 30/10/16 09:20:28 Copyright (C) 2000-2007 TightVNC Group
vncserver0 | 30/10/16 09:20:28 Copyright (C) 1999 AT&T Laboratories Cambridge
vncserver0 | 30/10/16 09:20:28 All Rights Reserved.
vncserver0 | 30/10/16 09:20:28 See http://www.tightvnc.com/ for information on TightVNC
vncserver0 | 30/10/16 09:20:28 Desktop name 'X' (8732cbbb4029:3)
vncserver0 | 30/10/16 09:20:28 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
vncserver0 | 30/10/16 09:20:28 Listening for VNC connections on TCP port 5903
vncserver0 | Font directory '/usr/share/fonts/X11/misc/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/Type1/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/75dpi/' not found - ignoring
vncserver0 | Font directory '/usr/share/fonts/X11/100dpi/' not found - ignoring
vncserver0 | 
vncserver0 | Fatal server error:
vncserver0 | could not open default font 'fixed'

After comparing the container condition to my VirtualBox VM (that already worked), the VM has downloaded xfonts-100dpi because xfce4 recommends xorg and it requires xfonts-100dpi. The Dockerfile apt line has standard no-install-recommends clause in order to keep images small. So first step is to change the Dockerfile into :


After doing the change, next problem occurred :
Setting up libfontconfig1:amd64 (2.11.0-6.3+deb8u1) ...
Setting up fontconfig (2.11.0-6.3+deb8u1) ...
Regenerating fonts cache... done.
Setting up keyboard-configuration (1.123) ...
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
Configuring keyboard-configuration
----------------------------------

Please select the layout matching the keyboard for this machine.

  1. English (US)
  2. English (US) - Cherokee
  3. English (US) - English (Colemak)
  4. English (US) - English (Dvorak alternative international no dead keys)
  5. English (US) - English (Dvorak)
  6. English (US) - English (Dvorak, international with dead keys)
  7. English (US) - English (Macintosh)
  8. English (US) - English (US, alternative international)
  9. English (US) - English (US, international with dead keys)
  10. English (US) - English (US, with euro on 5)
  11. English (US) - English (Workman)
  12. English (US) - English (Workman, international with dead keys)
  13. English (US) - English (classic Dvorak)
  14. English (US) - English (international AltGr dead keys)
  15. English (US) - English (left handed Dvorak)
  16. English (US) - English (programmer Dvorak)
  17. English (US) - English (right handed Dvorak)
  18. English (US) - English (the divide/multiply keys toggle the layout)
  19. English (US) - Russian (US, phonetic)
  20. English (US) - Serbo-Croatian (US)
  21. Other

Keyboard layout: 

During image build, there is a prompt asking us about keyboard layout, and typing an answer results in hang process. This error is similar to this stackoverflow question. I tried to do the suggestions there (copying to /etc/default/keyboard), but it still hangs. After struggling with many experiments, finally I use this Dockerfile :

How to use vnc server container

First, you build the image:

  • docker-compose build
Then, create the containers :
  • docker-compose up -d
Check the IP of the container (replace vncserver1 with container_name in your docker-compose.yml)
  • docker inspect vncserver1 | grep 172
Tunnel VNC by performing ssh port forwarding to the local IP (example, 172.17.0.9) and port 5903
  • ssh -L 5903:172.17.0.9:5903 user@servername 
Now to view the screen, you could connect to the VNC server by using VNC Viewer or opening vnc://localhost:5903 url in Safari.
For the X Windows based application, you first must grab the magic cookie :
docker exec -it vncserver1 bash
vuser@5eae70a4a75d:~$ ls                                                                                                                                                          
Desktop
vuser@5eae70a4a75d:~$ xauth list
5eae70a4a75d:3  MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
5eae70a4a75d/unix:3  MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
And then connect the docker container where the X Windows based application need to run :
docker exec -it pentahoserver bash
pentaho@3abd451c9b88:~/data-integration$ xauth add 172.17.0.9:3 MIT-MAGIC-COOKIE-1  1469d123c6fcb10e0fe8915e3f44ed71
pentaho@3abd451c9b88:~/data-integration$ DISPLAY=172.17.0.9:3
pentaho@3abd451c9b88:~/data-integration$ export DISPLAY
pentaho@3abd451c9b88:~/data-integration$ ./spoon.sh


And the resulting X Windows Session is shown thru the VNC channel :

Variations

This Dockerfile uses LXDE  :
This one uses OpenBox Window Manager only :

Conclusion

We are able to run X Window Server running inside a Docker container. The resulting images are about 625 MB, which could be a lot smaller if we remove firefox (iceweasel) and use only Openbox Window Manager.

Friday, October 28, 2016

Docker Basic 101

Background

This post would describe notes that results from my initial exploration using docker. Docker could be described as a thin VM. Essentially docker runs processes in a linux host in a semi-isolated environment. It was a brilliant technical accomplishment that exploits several characteristic of running applications in a linux-based OS. First, that the result of package installation is the distribution of package files in certain directories, and changes to certain files. Second, that executable file from one Linux distribution could be run in another Linux distribution provided that all the required shared library and configuration files are in their places.

Basic characteristic of Docker images

Docker images are essentially similar to zip archives, organized as layer over layers. Each additional layer provide new file or changed files. 
Docker image should be portable, means it could be used in different instances of application in different hosts.
Docker images are built using docker entry script and Dockerfile, which :
a. Dockerfile essentially shows the steps to install the application packages or files. After executing each RUN command in the Dockerfile, Docker creates a layer that stores the files add or updated by the RUN command. 
b. Docker entry script shows what command will be executed if the image is being run. This could be a line running existing program, but it could also point to shell script provided.


The Dockerfile is written using a domain specific language, that shows how to build the layers composing the docker image. Some of the Dockerfile syntax are :
  • ADD : add a file to the docker image filesystem, if the file is tar then Docker would extract it first, and it were allowed to use file from local filesystem or an URL from the internet.
  • FROM : refers to one parent docker image, so the files from the parent image are available in the current docker image
  • RUN : executes some program/command inside the Docker environment, Docker would capture file changes resulting from the execution
  • CMD : shows docker entry command, this could point to a shell script already inside the Docker image filesystem or an existing program

Basic Docker usage

There are a few basic docker commands that you need to know when using docker.

  • docker ps : print running processes (containers in docker lingo). The variant is docker ps -a , shows running and terminated containers.
  • docker inspect : show various information of a running docker container. Join with grep to get shorter output, like : docker inspect | grep 172 -> this will filter to show IP addresses. Container could be referred by its name or id.
  • docker history : shows the layers composing the image, also shows no-operation layers such as MAINTAINER.
  • docker exec -it bash : runs a bash shell inside a container interactively. This functions mostly like ssh without needing ssh daemon running inside the container.
  • docker run -d --name [command] [params] : creates a new container from an image, runs it in the background using the entry point. There are other useful parameters before --name, such as : 
    • -v hostdir:containerdir -> mounts a host directory inside the container, this could also works for single file.
    • --link -> put the IP of othercontainer in /etc/hosts so the container could access the other container by specifying its name
    • -e VAR=value -> sets environment variable
  • docker start : starts the container using the entry point
  • docker stop : terminates the process inside the container
  • docker rm : deletes the container, note that named data volumes would not be deleted
Docker images from the internet could be referred when creating new containers, or also inside a Dockerfile. For example, to create a container using an official mariadb image, with data directory mounted from the host, using TokuDB storage engine, use this command :

docker run -d --name maria -e MYSQL_ROOT_PASSWORD=rootpassword -v /data/docker/mysql:/var/lib/mysql mariadb  --plugin-load=ha_tokudb --plugin-dir=/usr/lib/mysql/plugin --default-storage-engine=tokudb --wait_timeout=2000 --max_connections=500

Docker would try to retrieve mariadb image from the official image source (see https://hub.docker.com/_/mariadb/). The docker image syntax is : 
Specifying no tag in the command implies retrieving mariadb:latest.

Docker compose

Host specific configuration are given as parameters of the docker run command, and they could be very long, complicating the container creation command. To simplify this, use docker-compose :
  • docker-compose up -d : reads docker-compose.yml and starts new containers using the compose specification, assuming the docker images already built
  • docker-compose build : reads the docker-compose.yml and create images from specified Dockerfiles if required
  • docker-compose stop : stops the containers specified in docker-compose.yml
  • docker-compose start : starts the process in the containers (similar to docker start) referred by docker-compose.yml
Docker-compose uses docker-compose.yml as the blueprint of the containers. 
This example dockerfile shows various parts of the file: (note that this is a 1.x docker-compose format)

pdi:
    volumes:
     - ~/pentahorepo:/home/pentaho/pentahorepo
     - /tmp/.X11-unix:/tmp/.X11-unix
     - ~/pdidocker/dotkettle:/home/pentaho/data-integration/.kettle
     - ~/phpdocker/public/isms/comm:/home/pentaho/comm
     - ~/phpdocker/public/isms/mobility:/home/pentaho/mobility
    image: y_widyatama:pdi
    container_name: pdiisms
    environment:
     - DISPLAY=172.17.0.1:10.0
    external_links:
     - maria

The volumes clause specifies volumes that are mounted inside the container, like the -v parameter. Specifying a directory in the volume line before the colon (:) means a directory from the host. Specifying a simple name instead will be interpreted as Docker data volume, which would be created if it doesn't exist yet. The directory after the colon shows the target mount directory inside the container.
The image clause specified the base image of the container. Alternatively, a dockerfile clause could be specified instead of image clause.
The environment clause shows additional environment variables that would be set inside the container. 
External_links clause refers to name of existing running container that will be referred in /etc/hosts of the container.

Another example using two containers :

ismsdash-webserver:
  image: phpdockerio/nginx:latest
  container_name: ismsdash-webserver
  volumes:
      - .:/var/www/ismsdash
      - ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
  ports:
   - "80:80"
  links:
   - ismsdash-php-fpm

ismsdash-php-fpm:
  build: .
  dockerfile: php-fpm/Dockerfile
  container_name: ismsdash-php-fpm
  volumes:
    - .:/var/www/ismsdash
    - /data/isms/CDR:/var/www/ismsdash/public/CDR
    - ./php-fpm/php-ini-overrides.ini:/etc/php5/fpm/conf.d/99-overrides.ini
  external_links:
   - maria

   - pdiisms

This example shows two containers with some dependency links. The first, ismsdash-webserver, links to the second container using links clause. The second, ismsdash-php-fpm refers to two existing containers outside this docker-compose.yml file.
The first container is created from a docker image. The second container requires the image to be built first using a specified Dockerfile.
The ports clause specifies port forwarding from the host port to the container port.

Conclusion

Knowledge of several basic commands are neccessary in order to use docker. This post described some commands, hoping that the reader will be able to start using docker using these commands.

Wednesday, October 19, 2016

Running X11 Apps inside Docker on Remote Server

Background

Docker is fast-growing trend that I could no longer ignore, so I tried Docker running in a Linux server machine. Running server app is a breeze inside docker, but I need to run Pentaho Data Integration in the server, which uses X11 display. There is several references about forwarding X11 connection to a Docker container but none works for my setup, which has Quartz XServer running in  Mac OS X laptop and Docker service running in a remote Linux Server.

The usual way

The steps to run X Windowed Applications in Docker containers can be read from Running GUI Apps with Docker and Alternatives to SSH X11 Forwarding for Docker Containers, which essentially is as follows :
  1. Forwarding DISPLAY environment variable to the container
  2. Forwarding directory /tmp/.X11-unix to the container
I already tried such steps with no results, because I need to add another step before these two, that is forwarding X11 connection thru ssh connection to the server (not container). The step is as follows from an OS X laptop :

ssh -X username@servername.domain

Complications


We have DISPLAY set to something like localhost:10.0, and empty /tmp/.X11-unix directory. The reason is that forwarding thru ssh will result in TCP-based X11 service, and it is the opposite of unix socket based X11 service.



Inside the container, we are also unable to connect the localhost:10.0, because in the container the localhost would refer to the container's loopback interface, which is different from the host's loopback. For X11 display connections, the 10.0 means TCP port 6000+10, which as the result of X11 forwarding  only listens on 127.0.0.1 address.

Solutions

So we need to ensure the container able to connect to the forwarded X11 port. One solutions is to use socat :

nohup socat tcp-listen:6010,bind=172.17.0.1,reuseaddr,fork tcp:127.0.0.1:6010 &

Then the DISPLAY variable should be altered to point host IP in the docker network (172.17.0.1)


Unfortunately as you can see, there are authentication problems. I tried to copy the .Xauthority file from the host into the container, but it still failed. We could only understand why after getting it to work :

It seems that because different IP being used, the Xauthority file need a row with 172.17.0.1:10 as the key.

Having Xclock working, so we could also try something bigger like Pentaho's Spoon UI :

Additional notes

In order to get Pentaho Spoon UI working, I need to add gtk library as one layer in the Dockerfile :

USER root

RUN apt-get update && apt-get install -y libgtk2.0-0 xauth x11-apps

Caveat : for the formula editor to work in Spoon UI, we need to install firefox and libraries bridging firefox and swt, which is a topic for another post.

Conclusion

We have found out how to run X11-based apps from a Docker container with the X Server running in a laptop. What good does this for ? In some cases, we need to run a GUI-based app inside a Docker container, without installing vnc and x server inside the container. However we still have to provide X11 libraries inside the container.


Thursday, June 9, 2016

SAP System Copy Lessons Learned

Background

Earlier this year I was part of a team that does System Copy for a 20 terabyte plus SAP ERP RM-CA System. And just now I am involved in doing two system copy in just over one week, for much lesser amount of data. I think I would note some lessons learned from the experience in this blog. For the record, we are migrating from HP/UX and AIX to Linux x86 platform.

Things that go wrong

First, following the System Copy guide carefully is quite a lot of work - mainly because some important stuff are hidden in references in the guide. And reading a SAP note that are referenced in another SAP note, that are referenced in Installation Guide.. is a bit too much. Let me describe what thing goes wrong.

VM Time drift

The Oracle RAC Cluster have time drift problem, killing one instance when the other is shutting down. The cure for our VMWare-based Linux database server is hidden in SAP Note 989963 "Linux VMWARE Timing", which is basically add a tinker panic 0 in the ntp.conf and removing local undisciplined time source. And I think there are additional kernel parameters if your kernel is not new enough.

Memory usage grows beyond whats being used

Memory usage is very high, given our production cluster hosts so many servers, and thus so many concurrent database connections. Page table overhead is so large that the server is drowning in it. The cure is to enable hugepages in the kernel, allocate enough pages for Oracle SGA as hugepages, and ensuring Oracle could use the Huge Pages. See SAP Note 1672954 "Hugepages for Oracle".

Temporary Sequential (TemSe) Objects

For the QA system copy, the step deleting inconsistencies in TemSe consistency check takes forever, like a whole day, and its not even finished yet. It seems that SAP daily job to purge job logs is never been run before in the server. Our solution is to use report RSBTCDEL2 to delete job logs in time period intervals, like deleting every finished job logs that is over 1000 days old, then over 500 days old. See SAP Note 48400 for information about kinds of objects that are stored in TemSe, and the tcode to purge each kind of them. For example, report RSBTCDEL2 for job logs and RSBDCREO for BDC. 
The lessons are, please do purge your job logs in the QA and DEV server. In our PROD system we have 14 days retention for jobs, in QA and DEV, maybe 100 days of retention is enough.

NFS problems

Our former senior Basis confirms that Linux NFS is more problematic than AIX's NFS. Our lessons learned are:
  1. If it is possible to avoiding NFS access, avoid it. For example, transporting datafiles from source to target. NFS problems delay our process by about 8 hours, which is resolved by obtaining terabytes of storage and copying the data files to local disk via SCP.
  2. NFS problems occur mostly when the NFS server restarted. Be prepared to restart the NFS service in each client first, and restart client OS if things still doesn't work.
  3. Recommended NFS mount options for Linux is (ref : SAP Wiki):
rw,bg,hard,[intr],rsize=32768,wsize=32768,tcp,vers=3,suid,timeo=600

Out of disk space in non-database partitions

This is a manifestation of some other problems, like :
  1. Forgotting to deactivate app server in the system copy process. The dev_w? log files were filled with endless database connection errors after system copy recreates the database.
  2. Using the wrong directory for export destination.
These two should be obvious but it happens. The lessons are, check your running processes, and make sure export destination have enough free space.

Out of disk space in sapdata partition

Our database space usage estimation is very much off from what is really needed. The mistake is that we estimate the size from source system database size, meanwhile sapinst would create datafiles with additional space added as safety margin.
In order to correctly measure space, we need to be aware that :
  1. The sizes in DBSIZE.XML is only for estimating SAP's core tablespaces, and for SYSTEM and PSAPTEMP the estimate is a bit off. For SYSTEM we need additional 800 MB from the DBSIZE.XML in order to pass the ABAP Import phase successfully
  2. PSAPTEMP size is recommended by SAP in Note 936441 to be 20% from total data size used, which is very different from the estimate in DBSIZE.XML.  20% might be a bit too much for large (>500GB) databases, but if it is too small some large indexes might fail during import. For our case, DBSIZE estimates 10 GB for PSAPTEMP, we need 20 GB to pass ABAP Import phase (by Note 936441 it should be 100 GB though).
  3. Sapinst will create SAP core tablespace datafiles that might be consuming too much disk space. Be ready to provide larger storage capacity in order to have smooth system copy operation. In our case, we are forced to shrink PSAPSR3 datafiles in order to make room for SYSTEM and PSAPTEMP tablespace. 
  4. The default reserved space for root is 5% for one partition. This is very significant number of wasted space (5% from 600 GB sapdata partition is 30 GB wasted space for example). Some references said that ext2fs performance drops after 95% disk usage, but because sapdata mainly store datafiles with autoextend off then this should not be an issue. For sapdata I change the reserved space to 10000 blocks : tune2fs -r 10000 /x/y/sapdata

Permission issues

We used local linux user for OS authentication and authorization. And there is a need to match uid/guid across the servers in order to make transport via NFS work. We only found minimum clue to this issue in the System Copy guide. Currently our lessons here are :
  • need to restart sapinst after changing groups
  • saproot.sh is a tool to fix permissions AFTER gid/uid already fixed
  • double check that you don't have double entries in /etc/group or /etc/passwd as the result of gid/uid update efforts
  • in SUSE that there is nscd service that caches groups, restart the service to update the cache, and you don't need to restart the OS
  • ls -l sometimes could be confusing. ls -nl is better, because it shows numerical gids.


Performance issues

Migrating to new OS and hardware environment, we face some performance issues. SAP Note that describes some insight on this are Note 1817553 "What to do in case of general performance issues" and Note 853576 "Performance analysis w ASH and Oracle Advisors". Currently our standard operating procedure are like this :
  1. Execute AWR from ST04 -> Performance -> Wait Event Analysis -> Workload Reporting
  2. For the AWR, choose begin snapshot and end snapshot of 1/2 an hour where the performance issue occurs
  3. In the AWR result, Check SQLs with high elapsed time or IO, 
  4. Check the index and tables involved for stale statistics
  5. Check index storage quality from report RSORAISQN (see note 970538)
  6. Execute Oracle SQL Tuning on the SQL id (sqltrpt.sql)
  7. Apply the recommendation 

Summary

If you only do dry run once before the real System Copy, be prepared to handle unexpected things during the real process, such as obtaining additional storage. We might avoid some problems by by reading the System Copy guide carefully and reading 'between the lines'.

Saturday, April 16, 2016

'Cached' memory in Linux Kernel

It is my understanding that the free memory in linux operating system, can be shown by checking the second line in the result of "free -m" :





The first line, shows free memory that are really really free. The second line, shows free memory combined by buffers and cache. The reason is, I was told, that buffer and cache memory could be converted to free memory whenever there is a need. The cache memory is filled with the filesystem cache of the Linux Operating System.

The problem is, I was wrong. There are several cases where I find that cache memory is not being reduced when there is an application needing more memory. Instead, a part of the application memory is being sent to the swap, increasing swap usage and causing pauses in the system (while the memory pages being written to disk). In one case an Oracle database instance restarted and the team thinks it is because the memory demand too high (I think this is a bug).

The cache memory suppose to be reduced when we issue this command (ref: how-do-you-empty-the-buffers-and-cache-on-a-linux-system)
# echo 1 > /proc/sys/vm/drop_caches
But in our oracle database instances, running the command would only get a small reduction to the cached column. I also tried echo with 2 and 3 as the value, same results.

The truth

First, the 'cached' part not only contains file cache. It also contains memory mapped files and anonymous mappings. Shared memory also falls into 'anonymous mappings'. In Oracle systems without hugepages enabled, the SGA (System Global Area) is created as shared memory (see pythian-goodies-free-memory-swap-oracle-and-everything). Of course, the SGA could not be freed from the memory.. otherwise the database would be offline ! 

Another way to get better understanding of the memory usage is to use /proc/meminfo :

You might want to check Shmem usage there. In this example server, 0 values in Hugepages section shows that that hugepages is not being used in this server. For an Oracle database with large amount of RAM (say, 64 GB) and large amount of processes (500 or something) this could became a problem, mainly that the PageTables is going to be exceedingly large (thats another story). The memory in PageTables could not be used for anything else, so it is going to calculated as 'used' in the 'free -m' outputs.

Second, some people in their blog said that page cache (file cache) competes with application memory in order to gain portions of the real memory. A SUSE documentation confirms this :
The kernel swaps out rarely accessed memory pages to use freed memory pages as cache to speed up file system operations, for example during backup operations.

Limiting page cache usage

SUSE developed additional kernel parameter to control this behavior, vm.pagecache_limit_mb and vm.pagecache_limit_ignore_dirty. This two parameters could be used to limit pagecache ( = file cache) size that competes with ordinary memory. When pagecache is below this limit, it is allowed to compete directly with application memory for memory usage, allowing the kernel to swap either when the blocks of memory (file cache or application memory) is not accessed in recent times. Pagecache blocks above this limit is deemed to be less important than application memory, so application memory will not get swapped if there is large amount of page cache that could be freed.
In Red Hat Enterprise Linux 5, we have kernel parameter vm.pagecache that is similar to SUSE's parameter, but using percentage value instead. The default parameter value is 100, meaning that whole memory is available for use as page cache (see Memory_Usage_and_Page_Cache-Tuning_the_Page_Cache). I believe this also the case with CentOS Linuxes.

Conclusion

You might want to check your Linux distribution's documentation about page cache. There are some non-standard parameters that each distribution have that enable us to contain the page cache usage of the Linux operating system.