Including external jars in a Hadoop job
One of the disadvantages of setting up a Hadoop development environment in Eclipse is that I have been dependent on Eclipse to take care of job submission for me and so I had never worried about doing it by hand. I have been developing mostly on a single node cluster (i.e my laptop) which meant I never had the need to submit a job to an actual cluster, a remote cluster in this case. Also, the first MapReduce programs I have written and run on the cluster (more to follow) were not dependent on third party jars. However, the program I am working on depends on a third-party xml parser which in turn depends on another jar.
As it turns out, I had to specify 3 external jars everytime I submit a job. I knew there was a -libjars option that you could use as I had seen it somewhere (including the hadoop help when you don't specify all arguments for a command) but I did not pay attention since I did not need it then. Googling around, I found a mention of copying the jars to the lib folder of the Hadoop installation. It seemed a good solution untill you think about a multi-node cluster which means you have to copy the libraries to every node. Also, what if you do not have complete control of the clusters. Will you have write permissions to lib folder.
Luckily, I bumped into a solution suggested Doug Cutting as an answer to someone who had a similar predicament. The solution was to create a "lib" folder in your project and copy all the external jars into this folder. According to Doug, Hadoop will look for third-party jars in this folder. It works great!
My Blog List
-
SXSW: Is Privacy on the Social Web a Technical Problem? - How to deal with user privacy on social networks as they grow, mature and become more sophisticated has been a frequent topic of conversation at this year'...3 hours ago
-
The Onion on Google's data - The Onion has a hilarious article, "Google Responds To Privacy Concerns With Unsettlingly Specific Apology", that should be enjoyable for this crowd. An ex...2 days ago
-
Why Europe’s Largest Ad Targeting Platform Uses Hadoop - Richard Hutton, CTO of nugg.ad, authored the following post about how and why his company uses Hadoop. nugg.ad operates Europe’s largest targeting platform...3 days ago
-
I might not see tomorrow... - Thoughts to paper...Random thoughts Listen, I might be gone by tomorrow so give me a chance Allow me to tell you my thoughts Before the end of my time My w...1 week ago
-
Del.icio.us Python API - One of my recent research tasks required me to retrieve various information from Delicious.com, a well-known social bookmarking service. My programming l...1 week ago
-
Search Engine Basics - Receive the question of "how search works ?" couple times recently so try to document the whole process. This is intended to highlight the key concepts but...1 week ago
-
New threadpool design - In MySQL 6.0 a threadpool design was implemented based on libevents and mutexes. This design unfortunately had a number of deficiences: 1) The performance u...3 months ago
-
Are you ready for the judgment? - By Roy Davison. God is "the Judge of all the earth" (Genesis 18:25). "The LORD shall judge the peoples" (Psalm 7:8 // Hebrews 10:30). "God shall judge the ...3 months ago
-
Suarez’s The Daemon - Finished reading Daniel Suarez’s The Daemon, in between getting grants and writing papers and such, this semester. This is maybe the best book I have rea...9 months ago

December 4, 2009 5:15 PM
I tried the solutions mentioned in the blog. However, only the second solution worked for me. The better one, which the author suggested, didn't work for me, unfortunately, even though the same approach was proposed by amazon. Thus, I'm wondering how that solution works exactly. For me, I created a jar file, which contains my code as well as a lib directory with 3rd party jars in it. Any insight is greatly appreciated.
Post a Comment