Friday, May 27, 2011

Giving back - Extending MRJob

If you’ve ever used MRJob to run a MapReduce and wanted to debug a problem, you’ll soon realize that attaching a debugger to your mappers and reducers is very difficult. We’ve solved this by building a local, inline job runner that makes it as easy to debug your MapReduce as if it were a standalone Python script.

Over the last few weeks, Adku has been playing around with a cool new project named MRJob, a python package that streamlines the development of MapReduce jobs. One of the core features of MRJob is that you can write your MapReduce job once (in Python!) and run it pretty much anywhere. Out of the box, MRJob comes with support for launching your jobs onto a Hadoop cluster. It also comes with built-in support for launching jobs directly on Amazon's ElasticMapReduce (EMR) service. Why is this exciting you ask? Well, Amazon makes it simple to launch arbitrarily large/small Hadoop clusters in a matter of minutes. If you need a tiny cluster to test that your job works remotely, you can run your cluster on a single “m1.small” instance. If you need a huge Hadoop cluster to crunch gigabytes or even terabytes worth of data, you can spin up a cluster with dozens of “m1.large” instances.

Before spinning up hundreds of machines to run your mind-blowing MapReduce job, you'll probably want to test that your job works before blowing through your IT budget. Nothing sucks more than spinning up a huge cluster, waiting for Amazon to provision all your machines, then find out that you had a typo in your job. Because Amazon rounds up on instance-hours used, your 5 minute jaunt to nowhere just cost you “$/instance-hour * # of instances * 1 hour” Luckily, MRJob provides ways for you to test your job before launching on EMR.

Out of the box, MRJob comes with support for a local job runner that acts like a mini-Hadoop cluster with a single mapper and a single reducer. The local runner works by first trying to emulate the environment you would see in Hadoop. Then it tries to run the equivalent of “python mapper -> sorter -> python reducer” Because the local job runner tries to clone the remote environment, this is for testing that you’ve uploaded all job dependencies. The local job runner tries to emulate Hadoop even more by using pipes to Hadoop streaming. Unfortunately, because the local job runner works by spinning up subprocesses, your favorite debugger will not be able to attach to your map/reduce phases.

Over the last few weeks, we've struggled with debugging our local MapReduce jobs. Raising exceptions in the map/reduce phases feels gross. Emitting counters with entire error messages felt even worse. If you’ve ever tried to debug a mapreduce job in a debugger or an IDE, you’ll know how difficult it is to get something as simple as a traceback. Really, what we want do is attach a debugger to our map/reduce phase to see exactly what was going on. After some poking and prodding at MRJob's internals, we figured out a way to run mappers and reducers in the same process as the launcher script. By running your job with “--runner inline” you can now attach a debugger to your job and observe pretty much everything! “--runner inline” is designed to be used in conjunction with “--runner local” It is not meant to replace the default run mode (local) as we do NOT faithfully replicate the file system you would see in Hadoop. This runner is purely for development/debugging purposes.

Where can you get this patch you ask? Well, we love giving back to the community so we sent our patch upstream a few days ago. Today, I’m happy to report that the maintainers over at Yelp accepted our changeset and has released as a part of mrjob 0.2.6!

PyPi registration - http://pypi.python.org/pypi/mrjob

Homepage - https://github.com/Yelp/mrjob
Documentation - http://packages.python.org/mrjob/

-Matt

Monday, May 16, 2011

We’ve moved into new offices!


It's been really busy over the last few months, while we've been building our product and company, and we haven’t had much time to come up for air. We've got one piece of fun news that we'd like to share - we just moved into our new office! As we've expanded the team, we could no longer fit into our previous space which was graciously provided by our friends at Pivotal Labs.

We’re based in the heart of the tech center of San Francisco in SoMa at 300 Beale St close to great technology startups and software giants as well as great restaurants and public transportation.

We looked at quite a few different spaces and really wanted something unique and creative and, after a lot of looking, we found the perfect place for us. For those who’ve been around San Francisco for a while, this space used to be occupied Elroy’s which incidentally happened to be one of the first restaurants I ever ate at when I moved to the Bay Area.

If you’re in the area, please stop by and say hello! We’ve got a lot of plans for the office and I’ve posted some pictures of how the space looks in the raw. If you have ideas on how we should decorate, just let us know in the comments.

This is just another step in our journey as a startup and we’re really excited to share it with you.
Entrance

Bottom floor looking up at the mezzanine
Front conference/game room

Back conference room
Future break room and bike racks
Top floor looking down.