Sunday, January 31, 2016

Long running process in Linux using PHP

Background

To do stuff, I usually create web-based applications written in PHP. Sometimes we need to run something that takes a long time, far longer than the 10 second psychological limit for web pages.
A bit of googling in stack overflow found us this http://stackoverflow.com/questions/2212635/best-way-to-manage-long-running-php-script, but I will tell the similar story with a different solution. One of the long running tasks that need to be run is a Pentaho data integration transformation.

Difficulties in long running PHP scripts

I encountered some problems when trying to make PHP do long running tasks :
  1. PHP script timeout. This could be solved by running set_time_limit(0); before the long running tasks.
  2. Memory leaks. The framework I normally use have a bit of memory issues, this can be solved either by patching the framework (ok, it is a bit difficult to do, but I did something similar in the past) or splitting the data to process into several batches. And if you are going to loop the batches in one PHP run, make sure after each batch there are no dangling reference to the objects processed. 
  3. Browser disconnects in Apache-PHP environment would terminate the PHP script. During my explorations I found that :
    1. Some firewall usually disconnects a HTTP connection after 60 seconds.
    2. Firefox have a long timeout (300 seconds or something, ref here
    3. Chrome have timeout similar to Firefox (about 300, ref here), and longer for AJAX (stackoverflow ref doesnt timeout after 15 hours)
  4. Difficulties in running pentaho transformations, because the PHP module would run as www-data, and will be unable to access the kettle repository stored in another user's home directory.

Workarounds

I have experiences using these workarounds to force PHP to be able to do long running web pages :
  • Workaround 1 : use set_time_limit(0); and ignore_user_abort(true); to ensure script keeps running even after client disconnects.  Unfortunately the user will no longer see the result of our script.
  • Workaround 2 : use HTTPS so the firewall will unable to do layer 7 processing and doesn't dare disconnect the connection. If the user closed the browser then the script would still terminate, except when you also do workaround 1.
I haven't tried detaching a child process yet like , but my other solutions involve separate process for background processing with similar benefits.

Solution A - Polling task tables using cron

It is better to separate the user interface part (PHP web script) with the background processing part. My first solution is to create cron task that are run every 3 minutes, which runs a PHP CLI script which checks a background task table for tasks with state 'SUBMITTED'. Upon processing the task, the script should update the state to 'PROCESSING'. 
So the user interface/ front end only checks the background task table, and when the user orders to, inserts a task there with the specification required by the task, setting the state to 'SUBMITTED'.
When cron gets to run the PHP CLI script, it would check for tasks, and if there any, change the first task state to PROCESSING and begin processing. When processing complete, the PHP CLI script would change the state to COMPLETED.
Complications happen, so we will need to do risk management by :
  1. logging phases of the process in some database table, including warnings that might be issued during processing.
  2. recording error rows if there is any in another database table, so the user could view problematic rows
Currently this solution works, but recently I came across another solution that might be a better fit for running a Linux process.

Solution B - Using inotifywait and control files

In this solution, I created a control file which contains only one line of CSV. I prepared a PHP CLI script which parses the CSV and executes a long running process, and also a PHP Web page which would write to the control file. Inotifywait from inotify-tools will listen on file system notifications from Linux kernel that are related to changes on the control file.
The scenario is like this :
  1. User opens PHP web page, and choose parameters for the background task, clicked on Submit
  2. PHP web page receive the submitted parameters, and write them into the control file, including job id. The user received a page that states 'task submitted'.
  3. A shell script that running inotifywait, will wait for notifications on the control file, specifically for the close_write event
  4. After close_write event received, the shell script will continue, and run PHP CLI script to do the background processing
  5. PHP CLI script reads the control file for parameters and job id
  6. PHP CLI script executes linux process, redirecting the output to a file identified by job id in a specific directory
  7. The web page that states 'Task Submitted' could periodically poll the output file with the job id, and shows the output to the end user (OK, this one I need to actually try later)
  8. PHP CLI returns, the shell script performs an endless loop by going to (3)

Conclusion

By using Linux file system notifications, we could trigger task execution with parameter specified from a PHP web page. The task could be run as another Linux user, provider the user running the shell script. Data sanitization are done by php, so no strange commands could be passed to the background task. 

These solutions are written entirely in open source solutions. I saw that Azure have WebJobs which might fulfill similar requirements that I have, only it is in Azure platform which I never used.

Hack : Monitoring CPU usage from Your Mobile

Background

Sometimes I need to run a long-running background process in the server, and I need to know when does the CPU usage returns to (almost) 0, indicating the process finished. I know there are other options, like sending myself an email when the process finished, but currently I am satisfied with monitoring the CPU usage.

The old way

I have an android cellphone, which allows me to :

  1. Launch ConnectBot, type ssh username and password connect to the server
  2. type top
  3. watch the top result

The new way

Because I am more familiar with PHP than with anything else right now (ok, there are times I am more familiar with C Sharp, but it is another story), I  do a quick google search for 'php cpu usage' and found http://stackoverflow.com/questions/13131003/get-cpu-percent-usage-in-php. Using stix's solution I created this simple JSON web service using PHP :


For displaying the CPU as a graph, another google search pointed me to Flot, a javascript library allowing us to draw (plot) simple charts. A tutorial shows me how to draw CPU chart similar to Windows's  : http://www.jqueryflottutorial.com/how-to-make-jquery-flot-realtime-update-chart.html.
The principle is to use ajax to periodically call PHP JSON web service to get CPU usage statistics.

I adapted the code to add sys cpu usage in addition to total cpu usage.
The source codes are shown below :
Put the html and php file in your a folder in the server, unzip flots files in the same directory, and you're good to go.

Conclusion

Using flot and PHP we could monitor CPU usage remotely, and its compatibility with mobile browsers allow us to use our mobile devices to monitor server's CPU usage.


Saturday, January 2, 2016

Installing MariaDB and TokuDB in Ubuntu Trusty

Background

In this post I would tell a story about installing MariaDB in Ubuntu Trusty, and the process that I went through to enable TokuDB engine. I need to experiment using the engine as an alternative to the Archive engine, to store compressed table rows. It have better performance than InnoDB tables (row_format=compressed), and it was recommended in some blog posts (this post and this

Packages for Ubuntu Trusty

In order to be able to use TokuDB, I seek the documentation and find out that Ubuntu version 12.10 and newer for 64-bit platform requires mariadb-tokudb-engine-5.5 package.  Despite the existence of mariadb-5.5 packages, I found no package containing tokudb keyword in the official Ubuntu Trusty repositories. The mariadb 5.5 server package also doesn't contain ha_tokudb.so (see file list). 

The solution is to use the repository from this online wizard.

Installing mariadb-server-10.1, we have many storage engines available, tokudb and cassandra being the more interesting ones. 

Preparation - disable Hugepages

Kernel hugepages are not compatible with TokuDB Engine. I disabled it by inserting some rows in /etc/rc.local :

if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
   echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi


Enabling TokuDB


I enabled tokudb by running this command in mariaDB's SQL prompt as root:

INSTALL SONAME 'ha_tokudb';

Upon retrospect, maybe I supposed to uncomment the plugin load line in /etc/mysql/conf.d/tokudb.conf .

Using TokuDB

Having enabled tokuDB, check by show engines :

Cool.

The syntax to use it from MariaDB is a bit different from Percona or Tokutek distribution :

CREATE TABLE xxx (columns .., PRIMARY KEY pk_name(pk_field1,pk_field2..)) ENGINE = TokuDB COMPRESSION=TOKUDB_SNAPPY;

We also could transform existing InnoDB table (or other kind of tables) into TokuDB table, but beware that this will recreate the entire table in TokuDB engine :

ALTER TABLE xxx ENGINE =TokuDB COMPRESSION=TOKUDB_SNAPPY;

There are two ways of optimizing TokuDB tables, the first one is to do light 'maintenance' :

OPTIMIZE TABLE xxx;

But if you want to free some space you need to recreate (reorg?) the table :

ALTER TABLE xxx ENGINE =TokuDB COMPRESSION=TOKUDB_SNAPPY;

The compression options (refer here but beware of syntax differences) is as follows : 
  • tokudb_default, tokudb_zlib: compress using zlib library, medium CPU and compression ratio
  • tokudb_fast, tokudb_quicklz: Use the quicklz library, the lightest compression with low CPU usage,
  • tokudb_small, tokudb_lzma: Use the lzma library. the highest compression and highest CPU usage
  • tokudb_uncompressed: No compression is used.
  • tokudb_snappy: compression using Google's snappy algorithm, reasonable compression and fast performance.

Caveats

  • Currently I still cannot enable InnoDB/XtraDB page level compression
  • Syntax differences confused me sometimes, some information that are not clear in MariaDB's website can be read in Percona's website.
  • Xtrabackup doesn't work for TokuDB tables, you need plain mysqldump or mydumper to backup TokuDB tables
  • Mydumper in Ubuntu Trusty repository doesn't work with MariaDB 10.1
  • I still unable to compile recent Mydumper version in Ubuntu Trusty - MariaDB 10.1 combination