Native (GZIP) decompress not faster than builtin

Stefan Podkowinski

2009-05-10 15:36:21 UTC

Jens,

As your test shows, using a native codec won't make much sense for
small files, since the involved JNI overhead will likely out-weight
any possible gains. With all the performance improvements in java 5 +
6 its reasonable to ask whether the native implementation does really
improve performance. I'd look at it as another option to further
squeeze out some more performance if you really need to.

- Stefan

Post by Jens Riboe
Hi,
During the past week I decided to use native decompress for a Hadoop job
(using 0.20.0). But before implementing it I decided to write a small
benchmark just so understand how much faster (better) it was. The result
came out as a surprise
May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
INFO: Loaded the native-hadoop library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory
<clinit>
INFO: Successfully loaded & initialized native-zlib library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool
getDecompressor
INFO: Got brand-new decompressor
Time of Hadoop decompressor running 'small' job = 0:00:01.684 (1.684
ms/file)
Time of Hadoop decompressor running 'large' job = 0:00:10.074 (1007.400
ms/file)
Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340
ms/file)
Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400
ms/file)
Hadoop vs. Vanilla [small]: 125.67%
Hadoop vs. Vanilla [large]: 99.80%
For a small file, Hadoop native decompress takes 25% longer time to run that
Java's built-in GZIPInputStream and for a few megabyte sized file the speed
difference is negligible.
I wrote a blog post about it which also contains the full source code of the
benchmark.
http://blog.ribomation.com/2009/05/07/comparison-of-decompress-ways-in-hadoo
p/
[1] Am I missing some key information for how to correctly use native GZIP
compress?
I'm using codec pooling by the way.
[2] Will native decompress only take off for files larger than 100MB or
1000MB?
In my application I'm reading many KB sized gz files from an
external source,
So I cannot change the compress method nor the file size.
[3] Has anybody experienced something similar to my result?
Kind regards /jens