Adku: February 2011

Tuesday, February 8, 2011

Startup Interviews

I spent almost 6 years at Google interviewing nearly 300 potential Googlers. That inteviewing career started with an interviewing class and some shadowing of fellow engineers before diving in solo. At Google I focused my interview on trying to figure out if the person I was interviewing would have a good shot at being successful at Google. My requirements for Google success changed a bit over time but in general I looked for:

Raw problem solving ability. I generally checked this by asking basically a math/problem solving question disguised as a coding question. It usually required almost no knowledge of any coding beyond what someone might learn in the first few months of coding.
Ability to quickly see multiple solutions, weigh them, verbalize them, and select the most appropriate given the conditions that I've put forth. I'm constantly surprised at how rarely applicants verbalize the different options for solutions to a problem, outline the pros and cons of the solutions and then proceed.
A substantial level of curiosity for anything they have spent time on. "If architectural decisions were made before you joined the team or without you, did you figure out why they were made, how would you have done things differently?", "How much time do you spend looking at library source files to really understand the technology you are using?", etc.
Communication skills. This one is a bit trite but must be mentioned as it is as important as the others listed here. During an interview, the candidate's ability to articulate their thinking process, their background, and their goals is key.
Ability to contribute to overall team happiness. If you are going to spend over 60% of your waking life during the week working, hopefully you enjoy what you are doing and who you are working with. When someone adds to that enjoyment, the result is a happy team, and happy teams are generally more productive, more communicative, and in the end, more successful.

At Adku am I looking for anything different? I'm looking for every one of those qualities and more. The more:

A self starter/entrepreneur personality. "Do you have side projects that you've worked on outside of school/work?", "Have you thought about starting your own company, why didn't you?" I believe the best hires for a startup are entrepreneurs or almost-entrepreneurs.
A coding fearlessness of the unknown. A startup generally won't have an entire team devoted to just one slice of the entire stack. Because of this, it is likely that anyone hired will have to dive into any part of the stack and be extremely productive. If the person can't pick up a new language and/or technology and get a Hello World working within hours, then they are likely the wrong person.
A level of self awareness and honesty. Being a part of a startup is a wild and amazing ride. Efficiency is built not only on the ability to code and code review quickly, but on the ability to recognize issues and course correct as fast as possible. Being successful is based on many things, including your ability to be honest about your mistakes. Accessing this in an interview is difficult and subtle, but it can be done. Applicants, shouldn't "gloss" over details that you are unfamiliar with in the hopes that the interviewer won't realize that you don't have a deep understanding; admit what you understand and what you don't.
Product strategy/tech business curiosity. Product discussion is a very good thing and having every member of the team engaged is that discussion is important. The ability to have a healthy discourse on product/company strategy, both from business and technical point of views, is key.

Does this sound like you? If so, we'd love to talk to you!

Wednesday, February 2, 2011

HBase vs Cassandra

Adku recently decided to migrate part of it's data to a NoSQL database in order to deal with increasing load on our MySQL database. We evaluated many options including MongoDB, Amazon SimpleDB, and a few others, but ultimately narrowed the options to just Cassandra and HBase. We experimented with both databases and evaluated them more deeply and ultimately decided to use Cassandra. This post explains the high level reasons why we chose Cassandra.

Disclaimer: While these decisions apply to Adku, they might not apply to your situation. Always do your own investigation and experimentation before choosing any large part of your system.

Reliability
- We wrote a stress test tool to simulate how the databases would behave under high load. We used the default minimalistic configuration for each database following their respective documentations. Our stress test inserted 1 million rows of the full "Alice in Wonderland" text (~180kb) inserted into each row done continuously with one concurrent thread into a 3 node cluster. Surprisingly, the HBase region servers actually crashed on us consistently. Cassandra never crashed once. Although we would obviously scale the cluster under high loads and crashing nodes, this is still definitely concerning and a large win for Cassandra.
Performance
- We also wrote a simulation tool to test a more realistic scenario to see how each database would respond. Similar to the above, our tool inserted 1 million rows with 2000 characters of ascii text inserted into each row done continuously with one concurrent thread into a 3 node cluster. HBase averaged 507 microseconds per write where Cassandra averaged 480 microseconds per write. We interpreted this as basically equivalent performance so there was no real winner here.
Consistency
- Consistency is not a hard requirement for our specific use case so Cassandra's eventual consistency model is fine for us. If we do end up needing consistency, Cassandra can support it using their configurable CAP model, we would just have to take a performance hit. HBase has consistency so it is technically the winner here, but since it's not a hard requirement for us, it doesn't carry much weight.
Single Point of Failure
- Hadoop's namenode which HBase depends on is a single point of failure. This means that if the namenode goes down, the entire database is unreachable. All Cassandra nodes are identical so there is no single point of failure. This is a win for Cassandra.
Hot Spot Problem
- Our relevant row keys are currently all timestamps. HBase chooses the node to store data on by row key in sorted order. Cassandra by default stores them on a random node in random order. This means that HBase will fall into a hotspot problem where one node is handling most of the write traffic. Cassandra, however, distributes the load across all nodes evenly. This is a win for Cassandra.
MapReduce
- HBase is built on top of HDFS and Hadoop. This means that MapReduce is very easy. Cassandra supports MapReduce, but doesn't support streaming MapReduce so you have to write them in Java. I'm also not sure what the relative performance is. Cassandra supports data locality so that MapReduces tasks end up processing data on the same machine as the MapReduce task so it's possible that performance is comparable, but I haven't done adequate tests. HBase is the winner here, but so far we haven't seen any drawbacks to running MapReduce on Cassandra (besides having to write in Java).
Simpler, Hackable
- Cassandra is a simpler implementation and much easier to hack. This is the same reason why we chose Tornado instead of Apache as our web server and we've actually made quite a few modifications to the Tornado web server as a result. Bugs are also much easier to debug. HBase by comparison is much more complicated and harder to debug and hack. This is a win for Cassandra and we've even already submitted patches back into open source Cassandra.
Community Support
- As of today, there are 175 users in the #cassandra channel on irc.freenode.net. HBase by comparison only has 74. Aside from IRC, it does appear that the Cassandra community is larger and more helpful than HBase. Another win for Cassandra.