What do dynamo, memcached, Berkley DB, and couchdb have in common with each other, and in many ways with other structures like my hard drive or your mail or the www? Namely that everything is accessed by a primary key, and that metadata is either attached to, or embedded within, that data.
Perhaps, though I’m not sure they all used quite the same way.
In Memcached and Dynamo, primary keys are hashed entity names, so you decide on the primary key and use it to create, read or update the record. They’re equivalent to URLs you can GET/PUT, or path names you can store a file at. Not surprising, they API is simply get(id) and put(id, record).
Given a user name, you can calculate the primary key and retrieve the record directly. You can have multiple keys as well: create a record identified by e-mail with value being the user’s primary key. That’s a two-step lookup, but it’s exactly what a database does to retrieve a record from its index.
They’re not easy to order, though, there’s no easy way to get all the A-B users. (Not sure about Dynamo, but Memcaches and MySQL hash tables are certainly not ordered). For Memcached that’s not a problem, you don’t ask Memcached for all the A-B users, unless you’re asking for cached results of a previous query, which you ran against the database. Memcached as front-end to the database is still relational all the way down.
It’s actually possible with certain restrictions to run a lot of applications without ordered indexes. Take my blog for example. Every post needs one primary key (date/slug) which happens to be its URL (or most of it). Separately, I can have a record keyed by year/month that I update (last only) every time I publish a post. With that, given the size of records, I can easily pull all the posts for a month, a year, or even ten posts at a time for paged view. That’s indexing at the application, but it’s not particularly complicated or expensive to implement.
But how would that work if I’m handling massive amount of records, and I need all those keyed between yesterday 11:45 and today 12:30? There’s something to be said in defense of sparse ordered indexes.
Quite possibly this can be handled by the application. And to clarify, I’m referring to the application tier here, where you run the code, not the code you have to write. This will be abstracted away like the ORM and database driver you use, not reinvent in every single project. It’s a simple application of map/reduce, and that gets a lot of people excited, but cool and practical don’t always coincide. Map/reduce applied liberally may end up being the next EJB.
[Correction: As Jan commented below, this is no longer true]
CouchDB by contrast has a primary key that is a hashed record identifier auto generated by the database. It’s a location, but you don’t name it, so you need a priori knowledge to look it up, typically resolved from a different record (relation). And that record may be an index, which is not the same thing as name-value pair, not when the indexing is done for you, and ordered for free.
(Yes, I’m assuming those indexes are nicely ordered, no, I did not check that for a fact)
So I’m not convinced both models are the same, and right now I’m going to err on the auto-generated record identifier and ordered indexes, which is more RDBMS than Memcached.