2009年8月5日星期三

Packing all-in-one JAR for Hadoop (HadoopJar)

Hadoop allows us to pack our code into a jar file and run it with "hadoop jar mycode.jar". However, if our code depends on other jars (most non-trival codes do), distributing those depdendency jars becomes a problem.

Here I introduce one approach used by myself, which packs our own code and dependency jars into one jar. This all-in-one jar can be run using "hadoop jar my-all-in-one.jar", and all dependencies will work with no problem. I name this solution as HadoopJar.

I use an Ant task to do this. The XML code snippet is as follows,
<target name="hadoop-jar" depends="compile" description="Create binary distribution">
<!-- Firstly, copy all dependency jars into build/lib, while build is the root folder for the future jar. -->
<copy todir="${path.build.classes}/lib">
<fileset dir="lib">
<include name="**/*.jar">
<!-- We exclude hadoop-*-core.jar because it's already in the hadoop classpath -->
<exclude name="**/hadoop-*-core.jar">
</exclude>
</include>

<!-- Combine all dependency jars' names to a string, which can be used as a CLASSPATH value -->
<pathconvert property="hadoop-jar.classpath" pathsep=" ">
<regexpmapper from="^(.*)/lib/(.*\.jar)$" to="lib/\2" />
<path>
<fileset dir="${path.build.classes}/lib">
<include name="**/*.jar" />
</fileset>
</path>

<!-- Generate a manifest file contains the previous made CLASSPATH string -->
<manifest file="MANIFEST.MF">
<attribute name="Class-Path" value="${hadoop-jar.classpath}" />
<!-- Set a default entry point -->
<attribute name="Main-Class" value="org.nogroup.Main" />
</manifest>

<!-- Pack everything into one HadoopJar -->
<jar basedir="${path.build.classes}" manifest="MANIFEST.MF" jarfile="${path.build}/learning-hadoop.jar">
<include name="**/*.class">
<include name="**/*.jar">
</include>

<!-- Delete the manifest file -->
<delete dir="${path.build.classes}/lib" />
<delete file="MANIFEST.MF" />

</target>


We are done :). I tried this on our hadoop-0.15.0 cluster with 6 machines, and it also works in higher version of Hadoop, including hadoop in local-mode.

Hope this helps.

p.s. Dependency jars, also called third-party jars, third-party library.
I also wrote a Chinese version at here.

没有评论: