Adku

Tuesday, October 11, 2011

Speed feels good

Sitting in front of my computer typing at over 120 words-per-minute feels DAMN good.. damn good. or at least I imagine it would be if I could actually type 120wpm. Sadly, as much as I try, I can't type that fast, but.. I *can* cheat my way into working as fast as my fictional friend, let's call him Carlos, can (or faster, shhh!).

There are probably better methods of working faster out there, but in this post, I want to convince you that you should alias your already short commands to be even shorter and how doing that will make both your fingers more efficient AND your brain more efficient.

Efficient Fingers

How many times a day do you type "ls"? 100? 1000? If you
you would save 33% of the number of keystrokes used to issue that command (including the enter key). 33% is pretty damn significant. That's 100-1000 keystrokes less per day. Your fingers are already looking like body builders you type-a-holic, they don't need to be working out any harder than they need to.

What about "cd .."?
increases your productivity by 300%! That's like 200 less keystrokes a day?

Not to belabor the point, but how many times do you type "cd" and don't immediately type "ls" afterwards?

Here are some common ones I use daily.

You may have noticed that I chose "d" and "s" for these common commands. The reason is because "d" and "s" are on the home row of your qwerty keyboard. This means your fingers have a shorter distance to travel which means MORE SPEED. They are also for your left hand so that your right hand can sit on top of the enter key ready and waiting so that it can get all up on that enter key RIGHT AFTER you type your 1-letter command. 2 hands are better than 1.

You can alias all day long and find longer and longer commands that you type often. Things like "ssh -i ec2-keypair ubuntu@ec2.adku.com" or "cd /mnt/var/log/supervisor" each of which I do 100 times a day have their own aliases, saving me completely UNFATHOMABLE amounts of time. I seriously can't even comprehend the savings.

Anyway, the point is, (fictional) Carlos may type faster than me, but that doesn't mean he gets more done than me =)

Efficient Brain

In practice, after years of obsessing over aliases and reducing keystrokes, I've unexpectedly found that the productivity gain is actually quite a bit more than just the reduced mechanical overhead of hitting keys. It also allows you to work that much closer to the speed of thought, eliminating the costly context switch of deciding what to do vs executing the commands. We all know how expensive context switching is when writing code, but there is a much smaller, but more pervasive context switch that happens between the time you decide you want to look at your log files and when your actually able to see them. Sometimes you get to the right folder and forget what you were looking for. Getting to your log files in 3 seconds instead of 10 seconds turns out to be a big difference and lets you keep your mind on what your doing and less on telling your fingers what buttons to push == SPEED x 1000!

If you like working fast and getting a lot done, consider applying for a job at Adku.

Monday, October 3, 2011

Scalable realtime stats with Graphite

Introduction

Here at Adku, we’re always looking for ways to move faster and smarter. There’s nothing worse than having to wait a day or two to see if a code push has a positive or negative effect on our bottom line. One way to track application issues is to create a stat and graph it. While there’s a litany of solutions out there today, we’re using Graphite, a realtime graphing framework and we love what its done for us thus far. Here’s what Graphite can do:

http://graphite.wikidot.com/screen-shots

http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

We recently went through the process of setting up our own Graphite cluster and below are the references/steps that we used to get everything running. We hope this helps :)

Graphite installation guide

This guide will install all the software packages required to get you up and running. Graphite will be setup to run across multiple machines. NOTE, these steps are virtually verbatim copies of our setup scripts. We are assuming you’re working with a clean Ubuntu 10.04 LTS installation (we’re using Ubuntu AMIs on EC2) We will NOT be delving into specifics for all non-Graphite related apps. In case you’re curious, this guide will install the following packages:

Process control - supervisor - akin to init.d

Web serving - nginx, uwsgi - akin to apache + mod_wsgi/mod_python

Caching - memcached

Stats collection - statsite - akin to Etsy’s statsd without a dependency on node.js

Graphite - graphite-web, carbon, whisper - required for stats collection

Step 1 - Download our example configs

Step 2 - Run bootstrap.sh

Step 3 - Copy all files from the archive into /etc and /opt respectively

Step 4 - Update following config variables

/etc/nginx/nginx.conf - worker processes (line 4)

/etc/nginx/sites-enabled/graphite - public hostname (line 3)

/etc/supervisor/supervisord.conf - web processes (line 39)

/opt/graphite/conf/carbon.conf - storage dir and cluster servers (line 2 and line 96)

/opt/graphite/conf/relay-rules.conf - cluster servers (line 3)

/opt/graphite/conf/storage-schemas.conf - retention times (line 4 - optional)

/opt/graphite/webapp/graphite/local_settings.py - storage dir and cluster servers (all lines)

Step 5 - Setup the graphite-web - Django app

cd /opt/graphite/webapp/graphite

yes no | python manage.py syncdb

Step 6 - Spin up supervisord - should spin up all other processes

sudo supervisorctl status

sudo supervisord (only if needed)

Step 7 - Ensure that our processes didn’t blow up

sudo tail -f /var/log/supervisor/*.log

Step 8 - Feed stats to your cluster. For a list of clients see

PHP / Java - https://github.com/etsy/statsd

Python - http://pypi.python.org/pypi/python-statsd

Ruby - https://github.com/dawanda/statsd-client

Step 9 - Check your stats server that your stat showed up!

http://statsd.yourhost.com:7001/

Further reference

Graphite

http://graphite.wikidot.com/screen-shots

http://readthedocs.org/docs/graphite/en/1.0/

nginx/uwsgi - used for web serving

http://wiki.nginx.org/Configuration

http://blog.martinfjordvald.com/2010/07/nginx-primer/

statsite - Python implementation of Etsy’s statsd

https://github.com/kiip/statsite

http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

Thursday, September 15, 2011

A Hodgepodge of Python

Here at Adku, we do most of our development in Python. Python is concise, expressive, and well suited to rapid prototyping and readable code. I must admit though, I was initially skeptical of the choice to use Python. In my mind, dynamic typing = bugs, interpreted language = slow. Ahhhhh! But slowly my misgivings have disappeared, and I’ve come to really appreciate the Python Language for the wide coverage of its standard library, and the elegant design of many of the idioms.

For the rest of this post, I’d like to go over three of my favorite python features.

The timeit module

My initial worries about python’s performance versus traditional compiled languages have turned out to be irrelevant*. But in the few times when I’ve had some cause for worry, I found python’s timeit module to be a great help. Unlike other profiling modules, timeit can be run strictly from the command line. So we can quickly answer questions such as the following.

how much is xrange was faster than range?

:~/ $ python -m timeit 'for i in xrange(1000): continue'

100000 loops, best of 3: 19.1 usec per loop

:~/ $ python -m timeit 'for i in range(1000): continue'

10000 loops, best of 3: 24.6 usec per loop

What is the fastest way to reverse a list?

:~/ $ python -m timeit 'l=range(1000)' 'l[::-1]'

100000 loops, best of 3: 12.3 usec per loop

:~/ $ python -m timeit 'l=range(1000)' 'l.reverse()'

100000 loops, best of 3: 8.85 usec per loop

And as you’ll notice from the second example, even simple multiline programs can be timed; just separate lines with quotes.

* a poor algorithm seems to be more often responsible.

** Matt just pointed out to me that more detailed profiling can be obtained with a cProfile (two)liner

python -m cProfile (your script path) > timing.txt
cat timing.txt | sort -snk 4 | tail -n 50

zip

You can read about zip here: http://docs.python.org/library/functions.html#zip. The way I like to think about it is: if nested for loops are like walking through a matrix element by element, zip allows you to just “zip” down the diagonal.

Generally, zip is used to combine columns or rows of data. But there are two uses cases which I think are really cool.

Matrix transpose:

Suppose we are given a matrix expressed as a nested list:

a = [[...],...[...]]

Then we can find the transpose of this matrix by writing

zip(*a)

Grouping a list:

Suppose we have a list

a = [x1, x2, x3, …, xn]

Then we can group it into [(x1,x2), (x3,x4), … (xn-1,xn)] by writing

zip(*[iter(a)]*2)

Replace 2 with any integer k, and you can make k groups.

The main thing to understand in both of these examples is the * operator. In python, * has the usual meaning of multiplication when it’s a binary operator, but as a unary operator it unzip a list and pass it as arguments to a function. Aside from that though, both of these examples just require some staring and thinking to understand. But it’s worth it! I promise

collections.defaultdict

With collections.defaultdict, you can set a default value to the dictionaries you create. No more

a = {}

…

if not ‘foo’ in a:

a[‘foo’] = 0

else:

a[‘foo’] += 1

…

Just do

a = collections.defaultdict(int)

…

a[‘foo’] += 1

...

If you want, you can also do nested dictionaries. For example,

a = collections.defaultdict(lambda: collections.defaultdict(int))

will initialized a “matrix” of all zeroes.

That’s it for now. I would love to hear about more python features in the comments.

Tuesday, August 16, 2011

pip - python package management

If you’ve ever developed a Python application before, you may be familiar with the litany of packages you need to get everything running. You might start off your web app with Django then add in Jinja2 to get some template speedups. From there, you might want some additional speed so you install packages like Cython, MarksupSafe, and simplejson for their C-extensions. After that, you realize you might want to dabble around with PyMongo as well. You go back to hacking for a while until you start thinking to yourself... how am I ever going to manage all these 3rd party packages? Luckily for us, things aren’t as dire as they seem. Python package management has gotten a lot better over the last few years thanks in part to PyPi, Python’s central package repository and a tool called “pip”

http://www.pip-installer.org/en/latest/index.html

In essence, pip is a python package installer

Things pip can do

Manage python packages much like apt-get manages system packages
Automatically build C-extensions as required
Automatically upgrade/downgrade packages based on your specs

Here’s a quick cheatsheet to the commands we most commonly use

Enumerating installed packages

$ pip freeze

Uninstalling packages

$ sudo pip uninstall Django

Installation - By name and version

$ sudo pip install Django==1.2.5 pycrypto==2.0.1 simplejson==2.1.5

Installation - From a requirements file

A requirements file is a text file that literally looks like the following. This makes it trivial to version control what python packages/versions we have running in our production systems.

Django==1.2.5
pycrypto==2.0.1
simplejson==2.1.5

$ sudo pip install -r my-requirements.txt

We found "pip" to be VERY fast at enumerating already installed packages. To see this for yourself, re-run the above command! We currently have about 35 packages in production and "pip" can enumerate all of these in about half a second.

Installation - Using a private PyPi server

All commands thus far have managed to magically discover these packages/versions versions and install them on our behalf. Behind the scenes, "pip" installs from PyPi, a public HTTP server that has meta-information regarding all these packages. You can think of PyPi as the closest thing Python has to a central package repository. Most of the time, we can rely on PyPi being up. To isolate yourself from the occasional PyPi hiccup though, you can setup your own private PyPi server. To install packages using your own private server, run

$ sudo pip install [-r my-requirements.txt

] [package==version] --index-url http://your-pypi-server

:8001/simple

Setting up your own private PyPi server has some benefits, you can

Isolate yourself from PyPi going down
Download all packages over a local network connection (Speedy!)
Manage custom-modified packages and have them install as part of the standard requirements process - for example, if you needed to hack Tornado

Setting up a private PyPi server is a bit beyond the scope of this article but here’s an article that definitely helped us on our way

http://brandonkonkle.com/blog/2010/mar/25/creating-personal-pypi-chishop/

Thanks to “pip” we’ve managed to tame our package installation process. Hopefully, after reading this article you will have too :)

Thursday, August 11, 2011

Hardly working hard

I am lazy. I don’t remember when it began, but to the great frustration of my mother, it has yet to end. Luckily though, there are many tools available for the lazy software engineer, some of which I’d like to share today in this post.

Too lazy to type?

I like saving key strokes. Why type when you can...not type? There are a couple of bash tricks I learned from Carlos and Jesse that I really like:

cd - cd to last visited directory. Everybody knows about cd ~, but this is at least as useful. Instead of taking you home, cd - takes you to the last directory you visited.

Example use case:

cd /path/to/logs

tail log1 | cut -f 3 | sort

tail log2 | cut -f 5 | sort

…

cd -

vim

2. ctrl + r: Reverse search bash history. Here at Adku, we run A LOT of map reduces. Most of the time, we’re calling the same few commands. ctrl + r lets us run, forget, and then run again later.

Example use case:

fab some_map_reduce_job:arg1=’ahhhh’,arg2=’wahhhhh’

cd ../

touch ‘asdf’

…

ctrl + r

fab

fab some_map_reduce_job:arg1=’ahhhh’,arg2=’wahhhhh’

3. !n:x-y Recall arguments x-y from the last nth command. This is another command that is a great help when running map reduces with complicated argument lists. Though, it is a lot more complicated than the other commands mentioned here, and worth reading about separately. True to the theme of this post, I redirect you to: http://www.catonmat.net/blog/the-definitive-guide-to-bash-command-line-history/ for a great resource on this, and will stick to giving a few simple examples.

Example use case(s):

>echo “foobar”

foobar

>echo “moocow”

moocow

>!-2

foobar

>!-2

moocow

--------

cat foo > impossible_to_remember_filename_akj3kj2437fvaj

nano !:3

--------

port install haskell-platform

##oh no, permission failure##

sudo !!

---------

>echo “hi” && echo “why”

>!!:0-1

>!-2:2-$

why

Too lazy to go to work?

Being too lazy to go to work is, of course, never a problem for me. But theoretically, I could see the following scenario occur.

1. While at work, start a ssh tunnel between your work computer and a gateway server. (For details on how we do this, checkout: http://blog.adku.com/2011/06/working-remotely.html)

2. Go home, relax.

3. Start a ssh session to your work server.

4. Start tmux

5. never go to work again.

With the exception of 4 (and 5?), this is probably a familiar process. But here’s the beauty of four:

connection closed by remote host.

Never again.

To quote the tmux man page:

Each session is persistent and will survive accidental disconnection (such as ssh(1) connection timeout) or intentional detaching (with the `C-b d' key strokes).

Aside from never losing your work due to connection problems, Tmux also makes it easy to start a job at work, go home, and continue monitoring that job remotely, or vice versa.

For example:

I start a tmux session at start tailing a random log at work:

To remember this session, I name it by typing

ctrl + b + : rename_session “foo”

Then I go home.

At home, I ssh into my work computer and type

and everything just works!

In addition to creating and naming sessions, tmux makes it easy to create and name panes within sessions. I often find myself running many jobs of the same type, which I like to organize into named panes within a single session.

The panes I have open are called “mo”, “meeny”, and “eeny”. In each screen I am running a different map reduce job. By naming the panes, I’ll remember exactly what I was running in each pane wherever I go.

Basic panel usage:

create	ctrl + b c
delete	ctrl + b x
rename	ctrl + b ,
navigate to previous panel	ctrl + b p
nagivate to next panel	ctrl + b n
navigate to panel #	ctrl + b #

Too lazy to...continue?

Monday, July 25, 2011

Working with Git - Part 1 - Our initial experience

When Adku first opened shop, we used Subversion for source control. Since then, we’ve migrated over to the wonderful world of Git and wanted to share why we moved and some aha moments we’ve experienced along the way.

Why we moved to Git

Speedy commits and updates, subversion commits got to be painfully slow - sometimes taking minutes
Cheap local branching
Better merging. e.g. merges follow renames and code moves
Better support for file renames and file permission changes
Interactive commits or the ability to commit partial files

Our aha moments

A “commit” is essentially a diff with a pointer pointing to the commit we created a diff from
A “branch” is a pointer that can be updated to point to any commit
Deleting a “branch” is akin to deleting the pointer to that commit, if you have another “branch” (pointer) pointing to that commit, your changeset will NOT disappear
A “remote” is a reference to a clone of this repository. This clone is usually hosted on another machine but it can also be another repository on your own hard drive.
“git pull” is effectively the same thing as a “git fetch && git merge”
“git fetch” is safe to re-run and does NOT update any files in your branch. It updates your knowledge of where the “remote” repository thinks it is.
“git rebase origin/master” works by uncommiting all your commits in reverse order, updating your branch to origin/master, then replays your commits in order
Your repository will continue to think a branch exists on a “remote” repository even if it was deleted by another developer. To remove branches that don’t actually exist on the “remote” anymore, you can run “git fetch --prune”

At the end of the day, we realized that these insights did not come cheaply. We invested a lot of time trying to understand Git and we recognized that we could not afford to have every new hire spending weeks getting up to speed on a version control system.

We ultimately decided to write bash scripts for the most common developer actions to help encapsulate all our Git knowledge and best practices. The culmination of that project is a development model we’re calling Dr. Git (Develop/Release with Git) which we’re excited to share in a following blog post.

Tuesday, July 19, 2011

Redmine

There are so many issue tracking tools available today with so many features, it was hard to figure out which one was best for us. We looked at JIRA, Trac, FogBugz, Plan.io and a few others, but in the end, we ended up choosing Redmine and hosting it ourselves.

For our specific usage patterns, we had some very specific features that we wanted in an issue tracker. Things like road map planning tools, searching, and UI were pretty pervasive, but there were a couple that we had trouble finding out of the box.

Tight email integration

We wanted to be able to reply-all to an email and add the issue tracker to create a new issue with the thread in the description.

Subsequent emails to the same gmail thread would then be automatically added to the issue.

We wanted to be able to modify issues directly from email using special syntax like "assignee: jesse" or "status: in progress".

Github integration

We wanted to be able to attach git commits to issues using special syntax in the commit message like "closes #12".
We didn't want to migrate our repository off of Github.

A lot of issue trackers do have email integration, but with one slight quirk. If you reply-all and add the issue tracker, it creates a new issue. If someone else then replies to the thread, a brand new issue gets created! We wanted the issue tracker to be intelligent enough to realize that it was the same issue and update it instead of creating a new one. Redmine didn't do this out of the box, but because it's open source, we edited a few lines and got it to work the way we wanted in a little less than 5 minutes. This may seem like a small, subtle issue, but it's one that we now make use of many times each day so the impact is significant in our daily usage pattern.

A lot of issue trackers integrate well with source control, but they require you to host your repository with the issue tracker! This means that we'd either have to move our repo off of Github or we'd have to setup processes to push our code to two repositories. Pushing to two repositories sounded like a potential nightmare and few issue trackers allow you to integrate with an external repository like Github. What we ended up doing was syncing our Redmine repository directly from Github using a cronjob. Post-commit hooks would have been better, but we left that as a TODO since it would require more than 5 minutes to setup and the cronjob was sufficient for the job.

Redmine's got some great plugins too. We use the backlogs plugins for a great drag-n-drop UI and burndown charts.

Given that we've already made two edits to the issue tracker, we felt great peace of mind choosing an issue tracker that was open source and self-hosted so we can modify it to suit our needs whatever they may be. Oh and also, it's free =).