Posts

Showing posts from January, 2015

Processing CSV Files using Hive / Hadoop / HDFS

Background When there is a need to process large-sized CSV files, Apache Hive became a good option since it allow us to directly query these files. I will try to describe my recent experiences in using Apache Hive. In this case I need to group the rows and count the rows for each group. I will compare to my existing systems using MySQL database, one built using PHP and other built using combination of Pentaho and PHP. Installation & Configuration We have many components of Hadoop-Hive-HDFS ecosystem : HDFS : Namenode service, Datanode services.  MapReduce : ResourceManager service, NodeManager services Hive  ZooKeeper Each component have their own configuration file (or files), and their own log files.  For simplicity, in my opinion nothing beats the Apache-MySQL-PHP stack. Minus points for Hadoop-Hive-HDFS in complexity standpoint. I think we need additional management layer to be able to cope with complexity, maybe like Cloudera Manager or Apache Ambari, whi

Openshift Log Aggregation And Analysis using Splunk

Image
Splunk is one of popular tools we use to analyze log files. In this post I would describe how to configure an openshift cluster to send all of the platform log files (mind that this excludes gear log files) to Splunk. Configure Splunk to listen on TCP port From splunk web console home, choose 'Add Data', 'monitor', 'TCP/UDP', fill in port 10514 (TCP), click 'Next', select sourcetype Operating System - linux_messages_syslog. Configure Rsyslog Forwarding These steps should be done in every openshift node, openshift broker and console. As root, create an /etc/rsyslog.d/forward.conf file  as follows (change splunkserver to your splunk server IP, and the @@ means TCP, instead of @ for UDP) $WorkDirectory /var/lib/rsyslog # where to place spool files $ActionQueueFileName fwdRule1 # unique name prefix for spool files $ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible) $ActionQueueSaveOnShutdown on # save messages to disk on sh