Project: Merge small files on HDFS for Hive table

Project: Merge small files on HDFS for Hive table

Introduction

Github: https://github.com/sskaje/hive_merge

This is a solution for small file problems on HDFS, but Hive table only.

Here is why I wrote this project: Solving Small Files Problem on CDH4.

This script simply INSERT the requested table/partition to a new table, let data be merged by Hive itself, then INSERT back with compression.

Continue reading “Project: Merge small files on HDFS for Hive table” »

Project: Merge small files on HDFS for Hive table by @sskaje: https://sskaje.me/2013/12/project-merge-small-files-hdfs-hive-table/

Incoming search terms:

Solving Small Files Problem on CDH4

This morning when I open my Cloudera Manager, it shows the NameNode server is ‘Concerning’ with a message like ‘The DataNode has xxx blocks. Warning threshold: 200,000 block(s).’.
I tried to google this, said that there might be too many files on HDFS, as DataNode’s default block size is 128MB on my CDH4, a single file with 1 byte would take a 128MB block.

Then I tried hdfs dfs -count to find out number of files of each directory on HDFS, about 70k files under /user/hdfs/.staging and 170k under a folder for Flume-NG.

I’m collecting logs with Flume-NG on CDH4 and trying to analyse with hive, from syslog, sink to HDFS and MySQL(infobright). The HDFS part in the configuration looks like:

Continue reading “Solving Small Files Problem on CDH4” »

Solving Small Files Problem on CDH4 by @sskaje: https://sskaje.me/2013/12/solving-small-files-problem-cdh4/

Incoming search terms: