Solving Small Files Problem on CDH4

This morning when I open my Cloudera Manager, it shows the NameNode server is ‘Concerning’ with a message like ‘The DataNode has xxx blocks. Warning threshold: 200,000 block(s).’.
I tried to google this, said that there might be too many files on HDFS, as DataNode’s default block size is 128MB on my CDH4, a single file with 1 byte would take a 128MB block.

Then I tried hdfs dfs -count to find out number of files of each directory on HDFS, about 70k files under /user/hdfs/.staging and 170k under a folder for Flume-NG.

I’m collecting logs with Flume-NG on CDH4 and trying to analyse with hive, from syslog, sink to HDFS and MySQL(infobright). The HDFS part in the configuration looks like:

which means every 30 seconds files are moved from a .tmp to FlumeData.%TIMESTAMP%.lz4, or every when number of lines of current file reaches 1,000, new file generated.
Everyday, I have about 10 million lines of logs randomly generated and sent to this cluster, that means everyday I have about 8,000 – 10,000 new small(<= 50KB) files written to my HDFS. This is also the reason why my Hive SELECT takes such a long time processing a SELECT SUM() everyday. Time to merge files!

I don't know how others using Hive, but this is a working way for me. And you might notice that I turn off hive.exec.compress.output before inserting to a temp table, then turn on after that. I tried keep this on, but the nothing changed in /ad/show, no merge, no split. Then I tried the way above, temp table after the 1st INSERT has only 4 files in it's folder at an average size of 256MB; INSERT back to ext_ad_show, still 4 files, but compressed to about 80MB each, much better than before.
Solving Small Files Problem on CDH4 by @sskaje: https://sskaje.me/2013/12/solving-small-files-problem-cdh4/