Processing CSV Files using Hive / Hadoop / HDFS
Background When there is a need to process large-sized CSV files, Apache Hive became a good option since it allow us to directly query these files. I will try to describe my recent experiences in using Apache Hive. In this case I need to group the rows and count the rows for each group. I will compare to my existing systems using MySQL database, one built using PHP and other built using combination of Pentaho and PHP. Installation & Configuration We have many components of Hadoop-Hive-HDFS ecosystem : HDFS : Namenode service, Datanode services. MapReduce : ResourceManager service, NodeManager services Hive ZooKeeper Each component have their own configuration file (or files), and their own log files. For simplicity, in my opinion nothing beats the Apache-MySQL-PHP stack. Minus points for Hadoop-Hive-HDFS in complexity standpoint. I think we need additional management layer to be able to cope with complexity, maybe like Cloudera Manager or Apache Ambari, whi