Project: Merge small files on HDFS for Hive table
Introduction
Github: https://github.com/sskaje/hive_merge
This is a solution for small file problems on HDFS, but Hive table only.
Here is why I wrote this project: Solving Small Files Problem on CDH4.
This script simply INSERT the requested table/partition to a new table, let data be merged by Hive itself, then INSERT back with compression.
Configuration Properties in Hive
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions.pernode=10000; SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.created.files=1000000; # hive merge SET hive.merge.size.per.task=256000000; SET hive.merge.mapfiles=true; SET hive.merge.mapredfiles=true; SET hive.merge.smallfiles.avgsize=16000000; # Following are for compression SET mapred.output.compress=true; SET mapred.output.compression.type=BLOCK; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec; |
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Hive Merge v0.1 Author: sskaje (https://sskaje.me/) Error: database and table are required. Usage: python merge.py OPTIONS options: -h, --help Display this menu -D, --debug Debug mode, display HiveQL only -d, --database=database Database name -t, --table=table Table name -c, --compress Enable compression -C, --compress-codec=codec Compression codec. lz4, gzip, bzip2,lzo, snappy, deflate(default). -p, --pk=partition_key Partition key -P, --pv=partition_value Partition value -S, --merge-size=merge_size Merge size before compression, hive.merge.size.per.task, 256000000 by default |
Examples
1 2 |
# Merge files in a table sudo -u hdfs python merge.py -d lecai_ad -t ext_ad_show |
1 2 |
# Merge files in a table, lz4 compressed sudo -u hdfs python merge.py -d lecai_ad -t ext_ad_show -c -C lz4 |
1 2 |
# Merge files in a partition(entry_date='2013-12-29'), lz4 compressed sudo -u hdfs python merge.py -d lecai_ad -t ext_ad_show -p entry_date -P '2013-12-29' -c -C lz4 |
1 2 |
# Merge files in a partition(entry_date='2013-12-29', type='1'), lz4 compressed sudo -u hdfs python merge.py -d lecai_ad -t ext_ad_show -p entry_date -P '2013-12-29' -p type -P 1 -c -C lz4 |
Project: Merge small files on HDFS for Hive table by @sskaje: https://sskaje.me/2013/12/project-merge-small-files-hdfs-hive-table/
Incoming search terms:
- hive merge small file
- spark too many small files
- hive combine small files in partition
- hive small file merge to combine files in a partition
- Spark creating hive tables with too many small files
Link to this post!