Sunday, December 10, 2017

Ruby Blues in Openshift Origin v2 (or How I broke my app by installing new relic gem)

The day I tried to install a Ruby gem for New Relic RPM, it causes almost a day  downtime for Phusion Passenger-run Rails Application. It turns out that there are several thing that worth noting about Ruby dependency management.

The First Error

The error message that causes downtime is like this :

Ruby (Rack) application could not be started

These are the possible causes:
  • There may be a syntax error in the application's code. Please check for such errors and fix them.
  • A required library may not installed. Please install all libraries that this application requires.
  • The application may not be properly configured. Please check whether all configuration files are written correctly, fix any incorrect configurations, and restart this application.
  • A service that the application relies on (such as the database server or the Ferret search engine server) may not have been started. Please start that service.
Further information about the error may have been written to the application's log file. Please check it in order to analyse the problem.
Error message:
Could not find newrelic_rpm-3.18.1.330 in any of the sources (Bundler::GemNotFound)
Exception class:
PhusionPassenger::UnknownError
Application root:
/var/www/openshift/console
The nature of this error is tightly coupled to how 'bundle' and 'gem install' works.

Bundle install and Gem.lock

If we execute bundle install, bundle would :
1) connect to the internet to check for latest gem versions
2) Sometimes update the gem.lock (need to explore further when it does) when finding newer gem version
3) install the gem using standard gem environment
These 3 might break your application during some circumstances. In my case, the environment running the application is a bit different than the environment in the command line.

Gem environment

The result of running 'gem environment' tells much about the situation :

RubyGems Environment:
  - RUBYGEMS VERSION: 1.8.24
  - RUBY VERSION: 1.9.3 (2013-11-22 patchlevel 484) [x86_64-linux]
  - INSTALLATION DIRECTORY: /opt/rh/ruby193/root/usr/local/share/gems
  - RUBY EXECUTABLE: /opt/rh/ruby193/root/usr/bin/ruby
  - EXECUTABLE DIRECTORY: /opt/rh/ruby193/root/usr/local/bin
  - RUBYGEMS PLATFORMS:
    - ruby
    - x86_64-linux
  - GEM PATHS:
     - /opt/rh/ruby193/root/usr/local/share/gems
     - /root/.gem/ruby/1.9.1
     - /opt/rh/ruby193/root/usr/share/gems
  - GEM CONFIGURATION:
     - :update_sources => true
     - :verbose => true
     - :benchmark => false
     - :backtrace => false
     - :bulk_threshold => 1000
  - REMOTE SOURCES:
     - http://rubygems.org/

Things to note here, the INSTALLATION DIRECTORY and GEM PATHS difference. The /usr/local/share/gems exist in GEM PATHS, as well as /usr/share/gems, but INSTALLATION will put the gems under /usr/local/share/gems.

The environment for running Ruby is a bit different too :


broker ~ # cat /var/www/openshift/console/script/console_ruby
export LD_LIBRARY_PATH=/opt/rh/ruby193/root/usr/local/lib64:/opt/rh/ruby193/root/usr/lib64:/opt/rh/v8314/root/usr/lib64
export GEM_HOME=/opt/rh/ruby193/root/usr/share/gems
export GEM_PATH=/opt/rh/root/usr/local/share/gems:/opt/rh/ruby193/root/usr/share/gems

ruby193-ruby $@

So the first GEM_PATH missed ruby193 prefix. Without the ruby193 prefix, the gems inside the actual directory would not be found, causing the error. Whats with the local directory anyway, this is not the first time I get confused by Linux/Unix directory scheme. It has historical significance but IMHO it is better not to have too many directories nowadays.

Solution to Problem I (or how to install newrelic_rpm in openshift broker&console)

Instead of fixing the GEM_PATH, I opt to use the install gem in the /usr/share/gems directory by using --install-dir :


gem install newrelic_rpm -v 3.18.1.330 --install-dir=/opt/rh/ruby193/root/usr/share/gems

Then add this one-liner in /var/www/openshift/broker/Gemfile

gem 'newrelic_rpm', '3.18.1.330'


Second error

Actually, before finding the solution above, another error crops up with these symptoms :


broker console # bundle install --local
Could not find rake-0.9.2.2 in any of the sources
broker console # bundle install
Fetching gem metadata from https://rubygems.org/...........
Fetching gem metadata from https://rubygems.org/..
Could not find openshift-origin-console-1.26.3.1 in any of the sources

As someone who are not familiar Ruby development, I found these two errors confusing. The first error, where bundle cannot find rake package, is very strange, because the 0.9.2.2 package is there :
broker console # gem list rake

*** LOCAL GEMS ***

rake (0.9.2.2)
broker console # ls -l /opt/rh/ruby193/root/usr/share/gems/gems | grep rake
drwxr-xr-x.  4 root root 4096 Nov 18  2014 rake-0.9.2.2
broker console #

Because version 0.9.2.2 is in available in the 'net (see https://rubygems.org/gems/rake/versions), I tried to use the internet by removing the --local keyword, and this time bundle complains about openshift-origin-console, which is not avaiable in the great internet.

The clue for the second error is the result of bundle config command :
broker console # bundle config
Settings are listed in order of priority. The top value will be used.

path
Set for your local app (/var/www/openshift/console/.bundle/config): "vendor"

disable_shared_gems
Set for your local app (/var/www/openshift/console/.bundle/config): "1"

Which is not the same with the neighboring application which has no settings at all:

broker broker # bundle config
Settings are listed in order of priority. The top value will be used.

broker broker # 

Solution to problem II

The solution is to remove .bundle/config file, which inadvertently created when I tried running bundle install --path vendor. The file redirects local gem searches into the ./vendor directory, thus skipping /opt/rh/ruby193/root/usr/share/gems directory and causing the 2nd error. The bundle program are unable to find the openshift gem location because it were searching the ./vendor directory.


broker console # ls -al
total 104
drwxr-x---. 14 apache apache  4096 Dec 10 20:46 .
drwxr-xr-x.  4 root   root    4096 Jun 15  2014 ..
drwxr-x---.  8 apache apache  4096 Dec  8 08:05 app
drwxr-xr-x.  2 root   root    4096 Dec 10 20:46 .bundle
drwxr-x---.  5 apache apache  4096 Dec  8 08:38 config
-rw-r-----.  1 apache apache   166 Jul 11  2014 config.ru
-rw-r-----.  1 apache apache   809 Dec  8 08:37 Gemfile
-rw-r--r--.  1 root   root    3487 Dec 10 20:31 Gemfile.lock
-rw-r--r--.  1 root   root    3453 Dec  8 07:18 Gemfile.lock.copy
drwxr-x---.  6 apache apache  4096 Dec  9 13:12 httpd
drwxr-x---.  2 apache apache  4096 Dec  8 08:38 log
drwxr-x---.  2 apache apache  4096 Jul 11  2014 .openshift
-rw-r-----.  1 apache apache 11754 Jul 11  2014 openshift-origin-console.spec
drwxr-x---.  3 apache apache  4096 Dec  8 08:05 public
-rw-r-----.  1 apache apache   398 Jul 11  2014 Rakefile
-rw-r-----.  1 apache apache  9208 Jul 11  2014 README.rdoc
drwxr-x---.  2 apache apache  4096 Jul 11  2014 run
drwxr-x---.  2 apache apache  4096 Dec  9 13:12 script
drwxr-x---.  7 apache apache  4096 Dec  8 08:05 test
drwxr-x---.  6 apache apache  4096 Dec  8 08:05 tmp
drwxr-x---.  6 apache apache  4096 Dec 10 20:46 vendor
-rw-r--r--.  1 root   root    1485 Dec 10 20:27 versionlist.log
broker console # rm -rf .bundle
broker console #

Now I can do bundle install --local again..

Lessons learned

What I learned, that is : bundle install are evil. It  install new gems and change application dependencies, sometimes this breaks your app (especially if your app is written by Red Hat and has missed some GEM_PATH). You must understood bundle's basic mechanism, because incorrect parameters will lead to breaking your application.

Saturday, October 21, 2017

Running Pods as Anyuid in Openshift Origin

When using Openshift Origin, by default all pods are running with 'restricted' context, where they are forced to use a generated user id. Some Containers just doesn't work that way, so we need to relax the restriction a bit. Reference : https://blog.openshift.com/understanding-service-accounts-sccs/

Creating A service account

First, create a service account in your project (see https://docs.openshift.com/enterprise/3.0/admin_guide/manage_scc.html). These are a sample yaml to do that :
kind: ServiceAccount
apiVersion: v1
metadata:
  name: mysvcacct
Note that underscore are not allowed as service account name despite the official openshift example contains it.

Assigning anyuid

Then, a cluster administrator should login to the project and assign anyuid SCC :


oc loginoc project theprojectoc adm policy add-scc-to-user anyuid -z mysvcacct

Using the service account

Now, edit the deployment config or the replication controller config to use the service account :

apiVersion: v1
kind: ReplicationController
metadata:
  name: spark-master-controller
  namespace: sparkz
  selfLink: /api/v1/namespaces/sparkz/replicationcontrollers/spark-master-controller
  uid: a1f26de8-b6e3-11e7-846c-005056a56b12
  resourceVersion: '129053544'
  generation: 2
  creationTimestamp: '2017-10-22T04:44:04Z'
  labels:
    component: spark-master
spec:
  replicas: 1
  selector:
    component: spark-master
  template:
    metadata:
      creationTimestamp: null
      labels:
        component: spark-master
    spec:
      containers:
        - name: spark-master
          image: 'gcr.io/google_containers/spark:latest'
          command:
            - /start-master
          ports:
            - containerPort: 7077
              protocol: TCP
            - containerPort: 8080
              protocol: TCP
          resources:
            requests:
              cpu: 100m
          terminationMessagePath: /dev/termination-log
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      serviceAccountName: mysvcacct
      securityContext: {}
status:
  replicas: 1
  fullyLabeledReplicas: 1
  readyReplicas: 1
  availableReplicas: 1
  observedGeneration: 2

Note the serviceAccountName at the same level as containers inside spec. Add the row if it doesn't exist yet.

Cleaning Openshift Origin Images Registry

When using and tending an Openshift Origin cluster (for example, Origin version 3.7), it is normal to start the storage allocation in small sizes. However soon we find that storage for registry get filled up quickly with images from each build process. This post will show how to clean them up.

Preparation before pruning

First you need oc (origin client) binary and a user account with cluster administration capability.
If the openshift docker registry  is installed inside the cluster without external access, then you also going to need OS access to one of the hosts inside the cluster.
First step is to login to the cluster from your client or inside one of the hosts:
oc login

Prune steps

Reading the documentation (https://docs.openshift.com/enterprise/3.0/admin_guide/pruning_resources.html) we find that the pruning starts at deployment, then builds, and last images.

Pruning Deployment 

Run this to preview which deployment are going to be pruned:
oc adm prune deployments
Then execute the pruning :
oc adm prune deployments --confirm
We could use the CLI utility oadm (origin adm) or oc adm command, depending on availability of the executable.

Pruning Builds

Run this to preview the builds :
oc adm prune builds
or
oadm prune builds
Then execute the pruning :
oc adm prune builds --confirm

Pruning Registry Images

And finally, we prune images. Images could not be pruned if there are some deployments are referring to the image, thus the prune deployment  and build steps above are need to be done first.

oadm prune images
Confirm the pruning with additional --confirm flag:
oadm prune images --confirm

If we the registry is not accessible, we get this message :
error: error communicating with registry: Get http://IPredacted:5000/healthz: dial tcp IPredacted:5000: getsockopt: operation timed out
Such error means we need to ssh into one of the hosts in to be able to prune images.
 

Lessons Upgrading MySQL DB 5.1 to Percona 5.7

I just recently upgraded a database server that were previously running MySQL 5.1 (standard, Sun/Oracle version) into Percona Server 5.7. A few quirks notable enough to warrant this blog post.

Planning and Preparation

A Percona blog post (mysql-upgrade-best-practices) stated that the best way to upgrade with such huge difference in major version (5.1 to 5.7) is to do a full logical dump for all database except mysql, dump user and grants, uninstall database and remove datafiles, then install new version and import the logical dump and grants. But alas the database we are going to upgrade is so big and the IO subsystem became some sort of bottleneck when doing logical dump, our colleagues tried to do mysqldump and it tooks more than 2 days to run, prompting us to cancel the backup (otherwise it would interfere with workday application usage of the database).  Reading the blog I noted that :
  1. for major version upgrade, using logical dump  and restore dump is the safest way to go.
  2. for minor version upgrade, in-place upgrade is possible.
  3. for major version upgrade, do not skip versions. I deduced that the sequence is : 5.1 ⇨ 5.5 ⇨ 5.6 ⇨  5.7
  4. two mandatory reading for my upgrade  - http://dev.mysql.com/doc/refman/5.5/en/upgrading-from-previous-series.html and http://dev.mysql.com/doc/refman/5.6/en/upgrading-from-previous-series.html
  5. another two mandatory reading for percona server : https://www.percona.com/doc/percona-server/5.5/upgrading_guide_51_55.html#changes-in-server-configuration and https://www.percona.com/doc/percona-server/5.6/changed_in_56.html
The main thing is to be careful. In IT world, to be careful is to do backups before you do something big.
Backing up is mandatory for any upgrade task relating to production system. Our database server is running on VMWare platform, so we prepared to do a snapshot before doing the upgrade. But alternative backup is in order, so we installed Percona Xtrabackup to create such backup. Another option is to use mydumper (see https://www.percona.com/blog/2015/11/12/logical-mysql-backup-tool-mydumper-0-9-1-now-available/), but in the past I were having problems compiling it so I avoid it.

Percona Xtrabackup

Being an open-source software, it is strange that the Percona Xtrabackup PDF manual is not easily found. Need to register your name, address, company, et cetera, just to get a hold of the PDF version of the manual. Anyway, the quirk is, sometimes the manual said 'innobackupex' and sometimes said 'xtrabackup'. Checking the installed file reveals that innobackupex is symbolic-link to xtrabackup.. seems that the two are now interchangeable, but no such statement found in the PDF manual. 
Quoting the Percona Xtrabackup 2.4.8 PDF manual in chapter Ten :
innobackupex is the tool which provides functionality to backup a whole MySQL database instance using the xtrabackup in combination with tools like xbstream and xbcrypt
The paragraph above is confusing, because seems that the correct way to do backup is by using innobackupex. But in the web version of the manual we get :

innobackupexinnobackupex is the symlink for xtrabackupinnobackupex still supports all features and syntax as 2.2 version did, but is now deprecated and will be removed in next major release.

Seems that the web version is the best one to follow, unfortunately I noticed this after completing the backup task.
Our backup command is like this : 
xtrabackup --backup --compress --compress-threads=4  --target-dir=/targetdir/compressed/ --user=root --password=xxxxxx
One quirk is xtrabackup doesn't like the internal innodb engine (error : Built-In InnoDB 5.1 is not supported in this release), so I need to add these in /etc/my.cnf and restart the db before doing the xtrabackup command :

ignore-builtin-innodb
plugin-load=innodb=ha_innodb_plugin.so

Replacing MySQL 5.1 with Percona Server 5.1

To replace mysql 5.1, we uninstall them and then install percona server. Both (mysql and percona server) are using my.cnf. The steps are :
  1. stop mysqld service (note: mysqld is the Oracle-based service, percona-based service in RHEL is mysql)
  2. ensure backups are done, if not, do create one backup.
  3. uninstall mysql by : yum remove mysql mysql-server
  4. install percona server by : yum install Percona-Server-client-51 Percona-Server-server-51 (refer to the steps in yum-related installation in https://www.percona.com/doc/percona-server/5.1/installation.html)
These steps quite straightforward, but Percona Server 5.1 don't like the innodb engine plugin that were installed before (Error: The option ignore-builtin-innodb is incompatible with Percona Server with XtraDB), forcing me to remove/comment these two  lines :


#ignore-builtin-innodb#plugin-load=innodb=ha_innodb_plugin.so
After that and starting the mysql service (not mysqld), all is well.

Upgrading Percona Server 5.1 to 5.5

  1. stop mysql service
    • service mysql stop
  2. check installed packages 
    • rpm -qa | grep Percona-Server
  3. uninstall 
    • rpm -qa | grep Percona-Server | xargs rpm -e --nodeps
  4. install 5.5 version by 
    • yum install Percona-Server-server-55 Percona-Server-client-55
  5. run in skip grant tables mode:
    • /usr/sbin/mysqld --skip-grant-tables --user=mysql &
  6. then do the actual upgrade process:
    • mysql_upgrade
  7. stop and start :
    • service mysql stop
    • service mysql start
The process run smoothly with no quirks.

Upgrading Percona Server 5.5 to 5.6


For 5.5 ⇨  5.6 upgrade, similar steps are found in https://www.percona.com/doc/percona-server/5.6/upgrading_guide_55_56.html :
  1. stop mysql service
    • service mysql stop
  2. uninstall by 
    • rpm -qa | grep Percona-Server | xargs rpm -e --nodeps
  3. install by
    • yum install Percona-Server-server-56 Percona-Server-client-56
  4. run in skip grant tables mode:
    • /usr/sbin/mysqld --skip-grant-tables --user=mysql &
  5. then do the actual upgrade process:
    • mysql_upgrade
  6. stop and start :
    • service mysql stop
    • service mysql start
In step 4, the server refused to start because of the unknown 'log_slow_queries' option in /etc/my.cnf. Seems that in 5.6 this is replaced by slow_query_log_file (see https://stackoverflow.com/questions/10755151/mysql-what-is-the-difference-between-slow-query-log-vs-log-slow-queries), so I replace log_slow_queries with slow_query_log_file.
After that resolved, we proceed to step 5 we found another error :
mysqlcheck: Got error: 1045: Access denied for user 'root'@'localhost' (using password: NO) when trying to connect
This error pops up when I tried to run mysql_upgrade. Seems this is a known bug (https://bugs.mysql.com/bug.php?id=72896) that  didn't  get fixed, in which mysql_upgrade calls flush privileges which rereads grant tables . Our solution is to use alternate syntax to execute mysql_upgrade :
mysql_upgrade -u root -p 
 After doing that, all is well.

Upgrading Percona Server 5.6 to 5.7

The reference for the 5.6 ⇨ 5.7  upgrade is https://www.percona.com/doc/percona-server/5.7/upgrading_guide_56_57.html. Which essentially the same with 5.5 ⇨ 5.6 upgrade, so I would not duplicate here. But there is a few major difference :

  • By default, mysql_upgrade will convert tables that are using  date,time, and timestamp to the MySQL 5.6.4 format (note that the 5.6 upgrade process does not issue a warning about these  tables, and doesn't suggest conversion process either). The new binary date/time format are more space-efficient and allows extension types (such as TIMESTAMP(4)) with fractional seconds. Refer to https://www.percona.com/blog/2016/04/27/upgrading-to-mysql-5-7-focusing-on-temporal-types/, the majority running time of mysql_upgrade are now spent converting these tables (shown as ALTER TABLE ... FORCE). A workaround to prevent this is to run mysql_upgrade with -s flag / --upgrade-system-tables flag, which does not upgrade data tables.
  • Default sql_mode is now includes  ONLY_FULL_GROUP_BY, STRICT_TRANS_TABLES. The only_full_group_by now fixes enforces group by to be in proper form, but unfortunately this breaks many sloppy SQLs that previously allowed to run. The strict_trans_tables changes the mysql behavior on INSERT and UPDATE queries regarding invalid field contents.

Conclusion

In-place upgrades from 5.1 to 5.7 are indeed feasible, especially when the data size is quite large and mysqldump takes too much time to run. For backups, Percona Xtrabackup will perform the task with better speed than mysqldump (but with binary backup as a result).





Thursday, March 9, 2017

Securing Openshift Origin Nodes

Background

We have deployed Openshift Origin based cluster based on  Origin Milestone 4 release. When security assessment performed on several of the applications in the cluster, some issues crop up and needs further remediation. Some issue related to application code, some others related to the openshift node configuration, which we shall discuss here.

SSH issues

One of the issues is SSH weak algorithm support.
To remediate that, we need to tweak /etc/sshd/sshd_config by inserting additional lines :

#mitigasi assesment security SSH weak algoritm support
Ciphers aes128-ctr,aes192-ctr,aes256-ctr
MACs hmac-sha1,hmac-ripemd160,hmac-ripemd160@openssh.com

SSL issues

The other issue is related to SSL crypto algorithms. The cipher suite 3DES is no longer considered secure, so  we need to tweak /etc/httpd/conf.d/000001_openshift_origin_node.conf (line 63) by adding   !3DES:!DES-CBC3-SHA  :

SSLCipherSuite kEECDH:+kEECDH+SHA:kEDH:+kEDH+SHA:+kEDH+CAMELLIA:kECDH:+kECDH+SHA:kRSA:+kRSA+SHA:+kRSA+CAMELLIA:!aNULL:!eNULL:!SSLv2:!RC4:!DES:!EXP:!SEED:!IDEA:!3DES:!DES-CBC3-SHA


We also need to disable SSLv2 and v3 in 000001_openshift_origin_node.conf (line 58) :

SSLProtocol ALL -SSLv2 -SSLv3

And, because SSL certificate chains its a bit tricky, we are required to have SSLCertificateChain line too (inserted in line 32 of the same file)

SSLCertificateChainFile /etc/pki/tls/certs/localhost.crt

The httpd SSL virtual host configuration conflicts with openshift's, so need to delete all virtual host line in /etc/httpd/conf.d/ssl.conf .

The final step, files localhost.crt, localhost.key in /etc/pki/tls/certs/localhost.crt and /etc/pki/tls/private/localhost.key respectively  need to be replaced with one from the company's valid SSL certificates.

Restart httpd afterwards.

SSL in node proxy issue

Nodejs websocket proxy runs in port 8443, and also have SSL issues. We use the websocket proxy if the application in openshift requires websocket technology.

In /etc/openshift/web-proxy-config.json (between private key line at line 125 and } in 126), need to add these line :

"ciphers" : "kEECDH:+kEECDH+SHA:kEDH:+kEDH+SHA:+kEDH+CAMELLIA:kECDH:+kECDH+SHA:kRSA:+kRSA+SHA:+kRSA+CAMELLIA:!aNULL:!eNULL:!SSLv2:!RC4:!DES:!EXP:!SEED:!IDEA:+3DES:!DES-CBC3-SHA"

Also need to replace this file - /opt/rh/nodejs010/root/usr/lib/node_modules/openshift-node-web-proxy/lib/utils/http-utils.js with the latest from https://raw.githubusercontent.com/openshift/origin-server/master/node-proxy/lib/utils/http-utils.js. Just edit the file in vi, delete all lines, insert with the raw lines from github.

Conclusion

Some maintainance are needed to ensure openshift origin nodes are not a security liability. These steps would reduce number of security issues need to be dealt with when securing apps in the Openshift origin cluster.