Project: Merge small files on HDFS for Hive table

Project: Merge small files on HDFS for Hive table Introduction Github: https://github.com/sskaje/hive_merge This is a solution for small file problems on HDFS, but Hive table only. Here is why I wrote this project: Solving Small Files Problem on CDH4. This script simply INSERT the requested table/partition to a new table, let data be merged by … Continue reading “Project: Merge small files on HDFS for Hive table”

Solving Small Files Problem on CDH4

This morning when I open my Cloudera Manager, it shows the NameNode server is ‘Concerning’ with a message like ‘The DataNode has xxx blocks. Warning threshold: 200,000 block(s).’. I tried to google this, said that there might be too many files on HDFS, as DataNode’s default block size is 128MB on my CDH4, a single … Continue reading “Solving Small Files Problem on CDH4”