1. Oct 3rd, 2007

    Not all keys created equal

    Sam Ruby on Key + Data:

    What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and that metadata is either attached to, or embedded within, that data.

    Perhaps, though I’m not sure they all used quite the same way.

    In Memcached and Dynamo, primary keys are hashed entity names, so you decide on the primary key and use it to create, read or update the record. They’re equivalent to URLs you can GET/PUT, or path names you can store a file at. Not surprising, they API is simply get(id) and put(id, record).

    Given a user name, you can calculate the primary key and retrieve the record directly. You can have multiple keys as well: create a record identified by e-mail with value being the user’s primary key. That’s a two-step lookup, but it’s exactly what a database does to retrieve a record from its index.

    They’re not easy to order, though, there’s no easy way to get all the A-B users. (Not sure about Dynamo, but Memcaches and MySQL hash tables are certainly not ordered). For Memcached that’s not a problem, you don’t ask Memcached for all the A-B users, unless you’re asking for cached results of a previous query, which you ran against the database. Memcached as front-end to the database is still relational all the way down.

    It’s actually possible with certain restrictions to run a lot of applications without ordered indexes. Take my blog for example. Every post needs one primary key (date/slug) which happens to be its URL (or most of it). Separately, I can have a record keyed by year/month that I update (last only) every time I publish a post. With that, given the size of records, I can easily pull all the posts for a month, a year, or even ten posts at a time for paged view. That’s indexing at the application, but it’s not particularly complicated or expensive to implement.

    But how would that work if I’m handling massive amount of records, and I need all those keyed between yesterday 11:45 and today 12:30? There’s something to be said in defense of sparse ordered indexes.

    Quite possibly this can be handled by the application. And to clarify, I’m referring to the application tier here, where you run the code, not the code you have to write. This will be abstracted away like the ORM and database driver you use, not reinvent in every single project. It’s a simple application of map/reduce, and that gets a lot of people excited, but cool and practical don’t always coincide. Map/reduce applied liberally may end up being the next EJB.

    [Correction: As Jan commented below, this is no longer true]

    CouchDB by contrast has a primary key that is a hashed record identifier auto generated by the database. It’s a location, but you don’t name it, so you need a priori knowledge to look it up, typically resolved from a different record (relation). And that record may be an index, which is not the same thing as name-value pair, not when the indexing is done for you, and ordered for free.

    (Yes, I’m assuming those indexes are nicely ordered, no, I did not check that for a fact)

    So I’m not convinced both models are the same, and right now I’m going to err on the auto-generated record identifier and ordered indexes, which is more RDBMS than Memcached.

    1. Oct 3rd, 2007

      Amazon reveals its secret key-data overlords from the planet Cloud - Laughing Meme

      [...] Meanwhile Assaf argues well that not all keys are created equal [...]

    2. Oct 4th, 2007

      Jason

      You may be interested in PHT’s, which are one way to handle range queries in a DHT like memcached. One nice property is they’re actually doubly log since you can jump directly into the middle of the search tree rather than walking down from the root node.

      Ultimately I believe the answer is finding appropriate abstractions to deal with these systems in a better way. Currently I’m enamored by how high performance de-normalized databases resemble cached forward chaining.

    3. Oct 9th, 2007

      Julien Couvreur

      This is an interesting analysis. The general question is what are the kinds of applications which you can build on top of a DHT?
      The fact that Amazon implemented a shopping cart shows that they have found a way to use the hashtable to store some order information. Maybe the key is the userID and the value is an XML blob listing the itemIDs?
      If so, the same approach may work for implementing a blog, or at least a blog which does not have thousands of posts ;-)

    4. Oct 18th, 2007

      Jan

      You can provide your own ids in CouchDB instead of using the auto-generation. :-)

      Cheers,
      Jan

    5. Oct 25th, 2007

      Assaf

      Thanks, I added a correction.

    Your comment, here ⇓

    Or using OpenID