Project: Merge small files on HDFS for Hive table
This is a solution for small file problems on HDFS, but Hive table only.
Here is why I wrote this project: Solving Small Files Problem on CDH4.
This script simply INSERT the requested table/partition to a new table, let data be merged by Hive itself, then INSERT back with compression.
Continue reading “Project: Merge small files on HDFS for Hive table” »
Incoming search terms:
This morning when I open my Cloudera Manager, it shows the NameNode server is ‘Concerning’ with a message like ‘The DataNode has xxx blocks. Warning threshold: 200,000 block(s).’.
I tried to google this, said that there might be too many files on HDFS, as DataNode’s default block size is 128MB on my CDH4, a single file with 1 byte would take a 128MB block.
Then I tried hdfs dfs -count to find out number of files of each directory on HDFS, about 70k files under /user/hdfs/.staging and 170k under a folder for Flume-NG.
I’m collecting logs with Flume-NG on CDH4 and trying to analyse with hive, from syslog, sink to HDFS and MySQL(infobright). The HDFS part in the configuration looks like:
Continue reading “Solving Small Files Problem on CDH4” »
Incoming search terms:
Integer factorization: http://en.wikipedia.org/wiki/Integer_factorization
In number theory, integer factorization or prime factorization is the decomposition of a composite number into smaller non-trivial divisors, which when multiplied together equal the original integer.
Msieve is a C library implementing a suite of algorithms to factor large integers. It contains an implementation of the SIQS and GNFS algorithms; the latter has helped complete some of the largest public factorizations known
msieve has CUDA supported!!
Continue reading “Collections: Integer factorization” »
I was finding articles/wikis how to emulate an arm linux (armel) on centos/ubuntu, then I found this from MDN: https://developer.mozilla.org/en-US/docs/Developer_Guide/Virtual_ARM_Linux_environment.
This article uses an old release by linaro which based on Ubuntu natty that can no longer be found on http://ports.ubuntu.com.
As Ubuntu says, armel would not be supported, that’s why the latest code name of ubuntu supporting armel is begin with ‘Q’.
I found another server release and a new nano, tried that with similar commands, notes are below:
Continue reading “Virtualized ARM on Ubuntu” »