Adku: Giving back - Extending MRJob

Friday, May 27, 2011

Giving back - Extending MRJob

If you’ve ever used MRJob to run a MapReduce and wanted to debug a problem, you’ll soon realize that attaching a debugger to your mappers and reducers is very difficult. We’ve solved this by building a local, inline job runner that makes it as easy to debug your MapReduce as if it were a standalone Python script.

Over the last few weeks, Adku has been playing around with a cool new project named MRJob, a python package that streamlines the development of MapReduce jobs. One of the core features of MRJob is that you can write your MapReduce job once (in Python!) and run it pretty much anywhere. Out of the box, MRJob comes with support for launching your jobs onto a Hadoop cluster. It also comes with built-in support for launching jobs directly on Amazon's ElasticMapReduce (EMR) service. Why is this exciting you ask? Well, Amazon makes it simple to launch arbitrarily large/small Hadoop clusters in a matter of minutes. If you need a tiny cluster to test that your job works remotely, you can run your cluster on a single “m1.small” instance. If you need a huge Hadoop cluster to crunch gigabytes or even terabytes worth of data, you can spin up a cluster with dozens of “m1.large” instances.

Before spinning up hundreds of machines to run your mind-blowing MapReduce job, you'll probably want to test that your job works before blowing through your IT budget. Nothing sucks more than spinning up a huge cluster, waiting for Amazon to provision all your machines, then find out that you had a typo in your job. Because Amazon rounds up on instance-hours used, your 5 minute jaunt to nowhere just cost you “$/instance-hour * # of instances * 1 hour” Luckily, MRJob provides ways for you to test your job before launching on EMR.

Out of the box, MRJob comes with support for a local job runner that acts like a mini-Hadoop cluster with a single mapper and a single reducer. The local runner works by first trying to emulate the environment you would see in Hadoop. Then it tries to run the equivalent of “python mapper -> sorter -> python reducer” Because the local job runner tries to clone the remote environment, this is for testing that you’ve uploaded all job dependencies. The local job runner tries to emulate Hadoop even more by using pipes to Hadoop streaming. Unfortunately, because the local job runner works by spinning up subprocesses, your favorite debugger will not be able to attach to your map/reduce phases.

Over the last few weeks, we've struggled with debugging our local MapReduce jobs. Raising exceptions in the map/reduce phases feels gross. Emitting counters with entire error messages felt even worse. If you’ve ever tried to debug a mapreduce job in a debugger or an IDE, you’ll know how difficult it is to get something as simple as a traceback. Really, what we want do is attach a debugger to our map/reduce phase to see exactly what was going on. After some poking and prodding at MRJob's internals, we figured out a way to run mappers and reducers in the same process as the launcher script. By running your job with “--runner inline” you can now attach a debugger to your job and observe pretty much everything! “--runner inline” is designed to be used in conjunction with “--runner local” It is not meant to replace the default run mode (local) as we do NOT faithfully replicate the file system you would see in Hadoop. This runner is purely for development/debugging purposes.

Where can you get this patch you ask? Well, we love giving back to the community so we sent our patch upstream a few days ago. Today, I’m happy to report that the maintainers over at Yelp accepted our changeset and has released as a part of mrjob 0.2.6!

PyPi registration - http://pypi.python.org/pypi/mrjob

Homepage - https://github.com/Yelp/mrjob
Documentation - http://packages.python.org/mrjob/

-Matt