Solving Small Files Problem on CDH4

This morning when I open my Cloudera Manager, it shows the NameNode server is ‘Concerning’ with a message like ‘The DataNode has xxx blocks. Warning threshold: 200,000 block(s).’.
I tried to google this, said that there might be too many files on HDFS, as DataNode’s default block size is 128MB on my CDH4, a single file with 1 byte would take a 128MB block.

Then I tried hdfs dfs -count to find out number of files of each directory on HDFS, about 70k files under /user/hdfs/.staging and 170k under a folder for Flume-NG.

I’m collecting logs with Flume-NG on CDH4 and trying to analyse with hive, from syslog, sink to HDFS and MySQL(infobright). The HDFS part in the configuration looks like:

Continue reading “Solving Small Files Problem on CDH4” »

Solving Small Files Problem on CDH4 by @sskaje: https://sskaje.me/2013/12/solving-small-files-problem-cdh4/