1. WideFinder, Ruby Pic, and Scaling Up, Out and Away

    September 30th, 2007

    Scaling Up

    I’m starting with the sequential version of WideFinder. I have two implementations of Ruby, so this will be interesting.

    Ruby 1.8.6 pegs the CPU at 100%, and completes the sample size in 7s. JRuby 1.0.1 goes all the way to 110% CPU utilization, but that’s reported against the JVM, I’m assuming 100% is taken by the processing, the other 10% by overhead, like the garbage collector. Unfortunately, I have to be 50s patient waiting for the results.

    The parallelized version fares worse on CRuby. CRuby doesn’t scale out, so still only 100% CPU utilization, plus the overhead of parallelizing on a single thread. On the other hand, JRuby has fun all the way to 150~160% CPU time.

    On Linux each core is considered a separate CPU, so the theoretical upper limit for a dual-core CPU is 200%. In practice, there are other things running in the background, and the best I’ve seen so far is 180%~190%.

    I’m not sure why JRuby can’t get there. It does in fact stretch its legs all the way to 180%~190% as reported for Java in the first few seconds, while the JIT kicks in. It might in fact be the JIT working the 20% shift. Past the first few seconds, it slows down to a constant rate of 110% (sequential) or 150~160% (parallel).

    Perhaps 160% is the upper limit on reading the log file?

    Scaling Out

    This time I’m turning WideFinder into an N>1 problem, where each “machine” has a log file to process, and one process that drives them all and collects the results. I’m quoting “machine”, since I’m still running this benchmark on a single machine, but now that I’m only coordinating as opposed to feeding them all a single log file, I can scale out easily.

    The sample size is different, so don’t compare processing times with the previous run.

    The first run is single threaded, one log file at a time, single core (CRuby) pegged at 100% and takes 1m28s to complete.

    The second run is parallelized, each process handles a different log file (ten in total), still single core pegged at 100%, completes in 1m30s. 2 seconds difference from pretending to parallelize.

    The third run forks a process to handle each log file, stretching it all the way up to 150-160% and finishing in 46s.

    Interesting because now I’m keeping both cores busy and completing the workload in half the time, even though it’s barely pushing 160%. I’m not exactly sure how to explain this result, other than forking multiple processes makes magic happen.

    Scaling Away

    Erlang has the right concurrency model, too bad about the language.

    I intentionally did not include the Erlang WideFinder in my tests. I’ll just go by the reports that it’s slow, because frankly I don’t care. In the overall scheme of things, you spend more efforts scaling a room full of servers that generate logs than programs that process these logs. I prefer to focus on those servers first, and from what I hear Erlang is up to that task, so the WideFinder performance can be forgiven.

    The point of the exercise was to illustrate the concurrency model, and how you can retrofit it into a variety of languages. When scaling out, it’s as simple as programming against a library. For low-level tasks, it clearly requires the performance and simplicity that comes from changing the underlying VM. And this can be done for Java or Ruby or Python. Though I doubt we’ll see it happen in Java or Ruby, much for the same reason I doubt Erlang will hit a tipping point.

    I think the problem is rooted in how developers react to concurrency. When faced with uncertainty, we resort to hierarchy and structure.

    Concurrency is fundamentally a problem of uncertainty. Order is undetermined. Some people embrace that by building systems that thrive on concurrency, and once you start a world of opportunities opens up. Have a look at Erlang and Pict. But most developers race (no pun intended) to solve it with structural programming. Threads, forks, synchronization blocks, atomic transactions, parallel FORtran loops. Those are all examples where we scope race conditions and face the consequences, only to avoid thinking in terms of message passing.

    I’m waiting for the day when “Synchronization blocks considered harmful” will become the hot topic of the day, and we’ll learn to do without method signatures and without waiting for a return value. Today, though, we’re still forced to do asynchronous, but prefer to think synchronous.

  2. Rounded Corners - 154 (Simon says)

    September 30th, 2007

    If you build it, will they come? This reads much like conclusions in search of data points, but the data points are interesting;

    Another way to measure the impact of web apps is to ask how much time people spend using them. People who use at web applications say that they spend about 22% of their total computing time doing so. That amounts to about 40% of the total time they spend with applications of any sort.

    If you use this, they will come. You won’t read about them on TechCrunch or Mashable, but don’t be fooled. They just don’t need more exposure. They reinvented Web 2.0 before we knew the Web comes in versions. They taught Apple how to design UIs, Google how to scale, and Ruby how to keep it simple. So when Oracle tells you to dump AJAX and use JSF instead, you better listen!

    Just for fun. Chalain (via Reg):

    If Java is the answer, it must have been a really verbose question

    Simon says: don’t. Here’s one Simon says you should learn to play well: don’t make your users play Simon says.

    Products of dreams. This ranges between I’m scared to I want!

    Above, all geek and no LOLcat make for a very dull post.

  3. Pi-cture this: Pi-calculus, Ruby and WideFinder

    September 28th, 2007

    Tim Bray started the WideFinder meme spreading, so I’m going to bite.

    But I’m going to talk about something different. Not a language shoot-out or micro benchmark. I’ll let others talk about that. I’m going to use the WideFinder as an example to show you some principles of concurrent processes and how you can use them in your code. I’ll use Ruby here, link to a Java implementation at the end.

    All the recent interest in coreness and RESTful services, it might be the right time to bring back a favorite subject of mine.

    Process Calculus

    This is something I got involved with around the turn of the century. I like how that sounds, turn of the century, all credible and aw inspiring, don’t you think? Sad part is, it took me a decade before I found out about pi-calculus, and only while trailing down the wrong path. I started with and recommend reading A Calculus of Mobile Processes (Milner, Parrow and Walker, 1989).

    The essential features of process calculus, courtesy of the Wikipedia entry:

    • Representing interactions between independent processes as communication (message-passing), rather than as the modification of shared variables
    • Describing processes and systems using a small collection of primitives, and operators for combining those primitives
    • Defining algebraic laws for the process operators, which allow process expressions to be manipulated using equational reasoning

    Process calculus is to concurrent systems what relational algebra is to databases. And just for saying that, half of you moved on to the next post, and the rest, I can hear you snoring. But hang on, this is seriously cool stuff, especially if you like the science part of computers and like to learn new skills. I’ll do my part by showing more code than concepts.

    In process calculus, we describe everything as concurrent processes interacting through message passing, with the ability to pass channels in messages. We can use that to describe protocols from TCP through HTTP all the way to a Web of services talking to each other. We can also use that to describe low-level constructs like threads, locks, semaphors and such. Also stuff that doesn’t happen in parallel.

    These processes are not distributed services. They’re not operating system processes. They’re not threads. They’re a conceptual way we can describe any and all of these. So the processes I’m going to show you here deal with writing lock-free asynchronous code and letting go of shared state. It doesn’t prescribe a particular way for distributing the workload.

    Why Should I Care?

    I find it fascinating, but I guess you need something a bit more concrete.

    When you’re building concurrent systems, you can work hard or work smart. Threads, locks and other synchronization constructs help you work hard. Smart is avoiding any shared state. Think about it as functional programming for concurrency. And it’s a great way to pull off asynchronous processing and anything you need to scale larger than a method call.

    So that’s one reason.

    It will also help you build distributed systems and deal with distributed fallacies. Those are all easier when you’re using messages to transfer state. Substitute process for resource and you get REST. And you know why that one is important. So there.

    Pi-calculus Cliff Notes

    Pi-calculus was my introduction into process calculus, so I’m going to show it briefly before moving on to code. I do recommend you spend some time reading about it in detail. My apologies to math geeks worldwide, though pi-calculus uses mathematical notation, I’ll approximate it using ASCII.

    The smallest process you can have sends a message on a channel, receives a message from a channel, does something we don’t care to describe (call it, magic) or nothing at all:

    P = x!y   #  Send value y on channel x
    Q = x?y   #  Receive on channel x, store in y
    0         #  Nothing

    The simplest composition is just a sequence of processes, separated with dots:

    P = x?y
    Q = y!z
    S = P.Q

    Therefore:

    S = x?y.y!z

    Interesting? Not so much. So let’s add the parallel composition:

    P = {x} Q | R
    Q = x!y
    R = x?y

    The process P consists of two other processes and a new channel known to both (think of it as a scoped variable). Process Q sends a message on that channel and then reduces to nothing (0). Process R receives that message and reduces to nothing, at which point process P reduces to … I’ll let you guess. Think of reduction as single-stepping through your code.

    We’re almost there, but the world is non-deterministic, so we also need to consider those cases:

    P = x?y.Q | z?y.R

    This just means process P either receives a message on channel x and reduces to Q, or receives a message on channel z and reduces to R. Not both. So now we have conditions.

    (Wondering what happened to good old if? It will take a few minutes to think this through, Google may help, but there is a way to express if with these constructs alone)

    But we still can’t reduce more than once, so let’s introduce the bang:

    !P = !P | P

    The bang just says that we can reduce the process any number of times, once for each message. And that we can use to implement a Web server, 8080?req, and to handle recursion (and therefore loops). Here’s an example for an infinite loop:

    P = x?y.x!y
    Q = {x} !P | x!0

    So with inputs, outputs, conditions and recursion we can express all sort of functions (the lambda part) but also describe concurrency and distributed systems.

    Code Time

    So let’s see what this looks like in code. Obviously we’re working higher level than pi-calculus, we have things like variables, functions, loops, ifs, etc. We only use the process model for concurrent work.

    We can start by writing a simple DSL. I did. One of my first experiences meta-programming in Ruby, ended up as a miserable failure and important lesson: abstractions are good when they help you express more, not when they limit what you can express. Here, an API will do just fine.

    Sending a message to a process:

    process.send *args

    Receiving a message:

    message = receive

    This is always called inside the process.

    We need conditional receives. In pi-calculus we keep the conceptual model simple, but end up with a boatload of channels. In the real world we don’t want to be in the business of tracking all these channels, so we’ll multiplex different messages on the same communication channel (one per process) and use pattern matching to tell them apart:

    receive do |match|
      match.when :foo do |args|
        . . .
      end
      match.when :bar do |args|
        . . .
      end
    end

    Parallel work is easy, since fork is already in use, we’ll call it spawn instead:

    foo = spawn { ... }
    spawn { ..., foo, ... }

    There’s no bang. We don’t need it when we can just spawn, loop and recurse. (Although, when I implented a bang method I ran into scoping issues when using it and just defaulted to spawn; perhaps there’s a better way to bang)

    Speaking of recursion, have you heard of the stack limit? Let’s get around that with stack-less recursion:

      tail { ... }

    Now we’re ready, let’s use all of these to write a WideFinder.

    WideFinder In Ruby Flavored With Pi-C

    First thing I’m going to do is pull out counting and reporting into separate functions. I’m using the same code from Tim’s Ruby WideFinder, so we can get that out of the way and move to the more interesting parts:

    # I map from lines to hash.
    def count(lines)
      lines.inject(Hash.new(0)) do |counts, line|
        if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
          counts[$1] += 1
        end
        counts
      end
    end
    
    # I just output the top-ten results.
    def report(counts)
      puts “Results are in”
      keys = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
      keys[0 .. 9].each do |key|
        puts “#{counts[key]}: #{key}”
      end
    end

    Next we’re going to write a process that reads the log line by line and splits it into chunks, then spawns a process to count lines in each chunk, so all the chunks counting happens in parallel.

    There are so many ways to write that. I decided to pick one that has no shared state or synchronization blocks. But we do need to wait for all counters and collect the results, so we’ll add another process for dealing with that. One maps, the other reduces.

    Here’s a spawn that starts a process (everything in the block) to count a chunk of lines and send the results to the collecting process:

    spawn { collector.send :result, count(lines) }

    I’ll do one better and insist on the process not holding any mutable state, only using messages to change from one state to the other. That’s how you build concurrent systems.

    So no Object, global variables, and all local variables are immutable. There’s more overhead, but I’m not shooting for maximum performance, I’m trying to show you how to think asynchronously.

    So without variables that can change state, we’ll use the oldest trick in the book and recurse. Since the Ruby stack can only take so much abuse, we’ll optimize tail recursion:

    tail { split(collector, limit, source, lines + [source.readline], sets) }

    The process looks like this:

    def split(collector, limit, source, lines = [], sets = 0)
      if lines.size >= limit
        # Over the limit, spawn a new process to count the lines and pass the
        # results back to collector.  The repeat for a new set of lines.
        spawn { collector.send :result, count(lines) }
        tail { split(collector, limit, source, [], sets + 1) }
      elsif source.eof?
        # Send whatever lines we collected so far to collector.  Also tell collector
        # how many sets of results we have.
        spawn { collector.send :result, count(lines) }
        collector.send :expecting, sets + 1
      else
        # Collect the new line, repeat.
        tail { split(collector, limit, source, lines + [source.readline], sets) }
      end
    end

    The collector receives all the results, groups them together and prints out the report. It follows the same process-functional style.

    Since messages may arrive in any order, we can’t tell the collection when we’re done. Instead, we tell it how many result sets to expect and let it figure things out. Here’s what it looks like:

    def collect(counts = {}, sets = 0, expecting = nil)
      if expecting == sets
        # All sets are in, report and return.
        report counts
      else
        receive do |match|
          match.when :result do |_, result|
            # Results from counter, combine with what we already have, and loop back.
            counts = result.keys.inject(counts) { |h,k| h.merge!(k=>(h[k] || 0) + result[k]) }
            tail { collect(counts, sets + 1, expecting) }
          end
    
          match.when :expecting do |_, expecting|
            # If we know how many sets there are, we know when to end.
            tail { collect(counts, sets, expecting) }
          end
        end
      end
    end

    Now we need to tie it all together:

    collector = spawn { collect }
    spawn { split(collector, 10000, ARGF) }

    And run:

    ruby wf.rb o10k.ap
    => Results are in
    42: 2006/09/29/Dynamic-IDE
    8: 2006/07/28/Open-Data
    3: 2003/07/25/NotGaming
    3: 2004/04/27/RSSticker
    2: 2003/09/18/NXML
    2: 2004/10/01/AutumnLeaves
    2: 2006/09/07/JRuby-guys
    2: 2004/02/27/RSS-Unreal
    2: 2003/04/10/Concorde
    2: 2005/12/29/Selling-Art

    Misc

    The entire example is here, and the pic library here.

    If Ruby is not your deal and you’d much rather use Java, have a look at Jacob. It’s a much larger framework, so a big harder to use, but as bonus it can persist state in the database and pull other cool tricks.

  4. Rounded Corners - 153 (Divide by zero)

    September 26th, 2007

    Debug this. Apparently there’s a gender divide related to debugging:

    A couple of years ago, they stumbled upon an intriguing tidbit: Men, it seemed, were more likely than women to use advanced software features, specifically ones that help users find and fix errors. Programmers call this “debugging,” and it’s a crucial step in building programs that work.

    Other data points from ComputerWorld.

    Test-driven living. Personal unit tests:

    Basically, I’ve made a list of personal unit tests: assertions about myself that I’d like to be true.

    Metaphorically speaking. Instead of deciding which software development methodology is the ultimate hammer for your production nails, try picking a metaphor and running with it:

    There are lots of discussions on Software Development “methodologies”. They are codified sets of practices on how to do software development “correctly”. When you get down to the root of each of these methodologies, the root of all of them is a metaphor.

    The real deal. Using protect_from_forgery (originally CsrfKiller) to avoid CSRF attacks in Rails 2.0.

    Special place in hell. I was just going through the SalesForce API, I heard it’s CRUD-over-SOAP, so I went looking for anything resembling conditional GET or PUT. After all, tunneling HTTP over SOAP over HTTP has been done before, and conditional PUTs are great for read consistency, which the rest of the API does quite well.

    I didn’t find what I was looking for. But I did find this:

    UserTerritoryDeleteHeader: Specify a user to whom open opportunities are assigned when the current owner is removed from a territory. If this header is not used or the value of its element is null, the opportunities are transferred to the forecast manager in the territory above, if one exists. If one does not exist, the user being removed from the territory keeps the opportunities.

    Yes. This is a SOAP header. It goes in the header part of the envelope. But it only affects certain payloads. By causing side effects. At the application level.

    Above, the zero way out.

  5. Rounded Corners - 152 (Rails heavy)

    September 23rd, 2007

    Keep ‘em separated. Jonathan Weiss, points out that browsers don’t do REST, and sometimes it’s best to separate different interactions across separate controllers:

    Sometimes the responds_to stuff makes sense. If you mostly expose data, then offering JSON and XML besides XHTML makes sense if by doing this you save your clients the transformation or expose the data to more clients that you could not reach before. But it you have a Web Application with a lot of interaction then coupling the Desktop Browser version with the iPhone version, the mobile version, the JSON version, and the XML version does not make any sense to me.

    Burning calories. Krishna Kotecha has an interesting take on vendorism, in the context of the Rails 2.0 release:

    This all ensures that their 3rd party developers are kept focussed on the vendor’s wares and technology, and not distracted by things such as evaluating other approaches and possibly better techniques. Mastered COM? All change for .Net. Did we say Windows Forms? We meant Avalon WPF (see Joel Spolsky’s take on this in ‘How Microsoft Lost the API War’). Java seems a bit too simple? Here you go guys, sink your teeth into this plethora of J2EE specs - at the end of all of that you might decide POJO (Plain Old Java Objects) is the way to go anyway.

    Deja vu. Also, Krishna’s take on Sun’s strategy: Engage and Contain.

    I remember back in the days there was, what did we call it back then, embrace and extend? Not sure of the details, but I think Microsoft was being accused. Was it Sun doing the accusation? Details are sketchy, but I think something about a new platform for building and delivering applications.

    Authorize away. First draft of OAuth 1.0 is out, Chris Messina explains what it’s all about:

    OAuth takes that approach one step further and extracts the best practices from the popular authentication systems I mentioned above and turns it into one elegant, unified authentication protocol that anyone can implement.

    Friendly reminder. The language you know is 10x better than the language that’s hip. Derek Sivers on switching back from Rails to PHP.

    Picture: net without the neutrality.

  6. Rounded Corners - 151 (Life is messy)

    September 21st, 2007

    The other side of Yahoo. Some of their stuff rocks (Flickr), some not so much (Mash and the recent spam). But kudos on its open source efforts. Doug Cutting on Hadoop. And I’m hearing there are more killer features down the line.

    No setup necessary. James Newkirk explains why you shouldn’t use SetUp and TearDown in NUnit:

    The first and primary complaint is that when I am reading each test I have to glance up to BeforeTest() to see the values that are being used in the test. Worse yet if there was a TearDown method I would need to look in 3 methods. The second issue is that BeforeTest() initializes member variables for all 3 tests which complicates BeforeTest() and makes it violate the single responsibility pattern.

    I don’t have any clear-cut decision, but I’m also experiencing the same dislike for prepping the test in the setup. I’m starting to think setup is only about the environment.

    Life is messy, deal. Bill de hÓra on the search for Java synchronize blocks on Amazon’s S3 service (I put my bet on SETI):

    The way I look at it is that compensating action or dealing with out of order events is kind of like dealing with domain business “logic”, insofar as there is that real world messiness to contend with.

    script/generate undo. I’d like to see more of that.

    So how much in zero, exactly? Handy guide to calculate how much trans fat you can fit in food products labeled with “0g Trans Fat”.

    Above, Dave Astels has a “Microsoft moment”. 

  7. Read Consistency: Dumb Databases, Smart Services

    September 20th, 2007

    Write Consistency

    This is quite a common scenario. You have a shared resource, and a lot of applications tapping into that shared resource. The shared resource holds state, the applications read and change that state. So far nothing particularly fancy. What things would you worry about?

    Off the top of my head:

    • Access control – Exerting control over who can modify and see what.
    • Concurrency control – Prevent race conditions from messing things up.
    • Types – Make sure everyone has a unified interface into the state.
    • Integrity checks – Don’t let one application ruin the day for other applications.

    And one way to deal with all of this is by enforcing write consistency.

    First, let’s define write. I’d like to talk about the application perspective, and what it wants to do. It might be updating a single record, or a bulk update. It might create one record and update another, and maybe these two are controlled by a single transaction. Either way, there’s some point at which the application says “these are the changes I’m making, please write them down.”

    Write consistency is about reaching a consistent state at the end of the write. The simplest definition I’m going to offer is this: a write followed by a read returns the same state. This is of course the general case, we can make for some allowances, for example, the case where you write less data than you read: default values, auto generated keys, etc. But conceptually those are all the same, at the end of the write you can predict what will come out of a subsequent read.

    So that’s write consistency. What happens if I don’t have write consistency? I might we writing ten new records, follow up with a read, and only retrieve five of them; it takes a second read to find the other five. I can say that over some period of time I reached a consistent state, but it didn’t happen in the write itself.

    Read Consistency

    There’s a part of me that wakes up in the morning and wishes every application I develop will only ever have to deal with write consistency. Imagine if all your Web resources had cascading deletes: no more broken links! Imagine search engines updating their indexes as the contents gets updated.

    Then there’s the other part of me that’s looking forward to building new applications that exist only because of situations that have no write consistency. Again, like the Web.

    A while back I got frustrated with the limited vocabulary I had to describe these kind of scenarios. What we call a lack of ‘framework for reasoning’. Whatever design decisions I would make were very opportunistic, with no coherent set of practices I can take from one project to another. Worst, it involved a lot of hand waving and broad generalization of the like, “When you’re using Google …,” or “The RESTful way …”.

    So I decided to create two buckets. I looked at the overall characteristics, and found one bucket that deals with write consistency, and another that deals with read consistency. Those are not opposite or mutually exclusive, and there’s also a third bucket for everything else, but I’m not concerned with that. Here I’m only going to talk about things that
    fall in one of these two buckets.

    So what is read consistency? Read consistency is being able to construct a consistent state by reading it. And I use it umbrella term to describe all that makes it possible: resolving references across resources, specific error codes, version numbers, etc. Even the way we would do updates in a read consistent environment: no locks, conflict detection, compensation, etc.

    By illustration, I’m going to use read consistency to perform a search and return a list of records, but only after I prune the search results to discard dead-ends (404’s in HTTP parlance). I don’t have a write consistency guarantee that deleting a record immediately removes it from the index, or that updating a record immediately reflects it in the index. But I can easily create a consistent state by only retrieving existing records. And if I want to be more fancy, I can even match the record’s version number against the index, or do subsequent filtering by running the search again on every record I find.

    You might be wondering why I would need a new term to describe something that’s quite common, something we intuitively do every time we use a search engine. This all go back to framework for reasoning, or being able to
    explain what I mean when I talk about ___.

    Let’s look at this from another point of view. Would I use a search engine that returns links to non-existing records? My initial instinct says “no, why would I care to search for something that doesn’t exist?”. But I just talked about that use case in detail. Am I contradicting myself? I really don’t care for a search engine that never returns useful results. But I do care for a search engine that return relevant results, even if there’s a time lag between updates to the collection of records and the search index. I’m interested in it, because it’s merely a consistency problem that I can resolve at reading time.

    Enough With The Search Engines

    That’s my initial reaction whenever I read about the mythical search engine use case. I’m thinking two things: content and search. But in practice, I don’t have a big content problem to solve, most often I get to deal with structured data, and what works well for one doesn’t work as well for the other. And search is interesting, but hardly a top priority. In most applications it’s a second or third category feature.

    So why am I talking about search? Because of all the examples I came up with, this seemed like the most neutral case, one that’s void of any ‘we do it this way!’ complications that happen when you start talking about orders or financing or customers. Think of it as a Lorem Ipsum use case for data management.

    It also raises two important points.

    I’m personally sick of structured search forms. I don’t care that the database has ten data fields you can query on, I want one search bar, just like search engine. Doing smart search on structured data is a simple problem to solve, have a look at Lucene and how people are using it. So the first point is that we need to look at application design beyond the narrow view of entity-relationship diagrams, and imagine things that are possible beyond Visual Basic forms.

    The second point is, I think most people intuitively understand how search works. You keep the data in one place, you keep the index in another, and the index is updated asynchronously to catch up with the data. You can do certain things to speed it up, like pinging the search engine, or mapping out your data. That’s the simplest, most intuitive example for read consistency. In fact, asynchronously updating indexes is a feature that read consistency databases offer, and we’ll get to talk about that too.

    Shared Resources

    Some database servers are designed for the sole purpose of storing data. But I think most of you are far more familiar, and more actively involved with using, database servers designed to solve a different type of problem. I’m talking about database servers designed to be used as shared resources.

    In a typical client-server environment — the one in which the relational databases of today spent their formative years — we have a lot of different applications hitting the same resource, sharing data through it. Storage is not good enough, we also need to handle the coordination problem. Coordination in this sense is making sure that you don’t have one application ruining the day for everyone else. And so our first instinct is to centralize by moving as much business logic as possible into the database itself. We let the database enforce compliance on its clients.

    What kind of business logic? We can enforce every order to contain a date field that must exist and must hold a datetime value. We can enforce every order to always reference an existing customer, so you can no longer delete the customer and keep the order hanging around, but we’ll make it easy to delete a customer and all their orders. Those are all declarative business rules.

    We also have imperative rules, which we implement using triggers and stored procedures. In some cases, we’re going to decide the rules are complicated enough that no application has privileges to update order records directly, but instead needs to use stored procedures. In other cases, we’ll just write the entire application inside the database, and use clients as dumb terminals. Performance is another coordination problem, and we can control that too using stored procedures and views.

    Web of Services

    In a loosely coupled environment, we’re going to find a different pattern. Here we have a variety of applications accessing a variety of services. I’m using these two terms, rather than client and server, because it’s easier to conceptualize how these roles play together. An application is something I have visibility into, the codebase I’m designing or working on, and services are black-boxes of functionality. Of course, what is application to me may be a service to you, and vice versa.

    Both applications and services need to store state, possibly using a database. But the important distinction here is that we’re pursuing a loosely coupled architecture, so we no longer share data directly through the database. Each service is independent, it exposes an interface that we’re going to use to retrieve and change state. We just downsized databases from the role of shared resources to mere storage engines.

    Where am I going to put my business logic now? I favor the application. Since the database is no longer a point of coordination, I don’t get as much benefit form moving my business logic into the database server. In fact, the last thing I want is the unpleasant reminder of the days we wrote COBOL programs and deployed them on mainframes. And that allegory is not by mistake, today’s database design traces all the way back.

    I much rather write validation logic inside my application where I can return errors like ‘555-XYZ is not a valid phone number’ rather than ‘SQL error 704: invalid column value’. It’s easier to develop, maintain, package and reuse, not to mention the variety of i18n/l10n libraries I can use. And by the same token, I’d much rather write any complex update, query or business rule I need using a modern day programming language.

    Scalability

    So let’s talk for a second about Google. I have nothing to say. I’m only bringing it up so we can get it out of the way, because it seems like a common and misguided knee-jerk reaction every time someone brings up scalability. It’s a false dichotomy that your scaling needs are either large, or non-existent. The problem I have is not being Google. The problem I have is not being Google.

    Me: Relational databases are hard to scale.
    DBA:
    Phfft. What do you know!
    Me:
    Well, my server is about to max out at 1TB.
    DBA:
    Piece of cake.
    DBA:
    Get a budget approved for a bigger server, once it ships, come back and we’ll schedule a migration.
    DBA:
    The last project that needed a bigger server got it done in a couple of weeks.
    Me:
    I already have one server, I don’t have budget to replace it for a bigger one.
    DBA:
    Then get another just like it.
    DBA:
    Here’s some material to explain the difference between read-only slaves and master-master, and how it will affect your code.
    DBA:
    Once it ships, come back so we can provision rack space, and then we’ll schedule a day to install and synchronize the two.
    Me:
    I see.
    Me:
    Hey, I’m just wondering.
    DBA:
    Yes?
    Me:
    Say I used Amazon S3 to store my data, and just hit 1TB. What would I need to do then?
    DBA:
    Hmm. Nothing. I guess?

    The other side of scalability is the unconscious design decisions we make when we’re conservative about what we store. Think of all the applications you didn’t know you could develop before you saw a demo of AJAX. Think how you designed Web apps back them, and how you design them today. The same thing is going to happen with your database.

    For me, this will enable new type of applications I wouldn’t even imagine today. Right now, my design decisions are focused on limiting the data I store, and discarding it as quickly as possible. I’m being conservative with the disk space, unfortunately, also conservative with application features. That’s about to change.

    Smart Databases, Dumb Databases

    Smart databases, I’m borrowing the term from smart clients, do more than just handle data. They combine data with business logic. Some is declarative, some is imperative, and it all comes from the need to solve the coordination problem of a shared resource.

    Dumb databases, on the other hand, just store and retrieve data. They contain no business logic, zero, zilch, none. The only thing they can do is access and modify records efficiently.

    A lot of applications developed today use smart database servers in this minimal form. They shift all the imperative logic into the application, and replicate the declarative logic in both places. That’s a usage pattern, but that’s not a dumb database. A dumb database will not contain declarative logic, it won’t know what to make of it.

    If you take a database server and dumb it all the way down, you end up with a glorified file system, of which we have enough. The dumb databases I’m talking about have two other interesting characteristics that separate them from smart databases and file stores. They’re particularly good at dealing with read consistency. And they’re particularly good at delegating to the application in all manners relating to logic.

    So let’s look at what a read consistency dumb database looks like.

    Open Schemas

    Database schemas serve two purposes. One is to enforce rules on the structure, values and semantics of the data they store. For the benefit of the database and as a unified interface for all applications accessing the shared resource. My application already deals with that much better than any CREATE TABLE can do, and if you’re using an ORM or similar technology you already captured all that information inside your application as well. I don’t feel a particular need to replicate this logic in two places, nor joy at migrating schema changes.

    The second purpose is to increase the density of bald spots on my head. The original design traces back to the days when we stored years as double digits, and considered fixed-length CHAR fields a feature rather than a bug. Those days are gone. Although modern databases allow you more flexibility in the form of BLOBs, array fields and such, it’s clear that they really don’t like it that much and penalize you for doing so.

    So the first feature a dumb database has is no schema definitions. That part is delegated to the application.

    Versions and Generations

    Write consistency databases can scale out in one of two ways. You replicate the database, but you have to keep both replicas identical, so writes don’t scale, and you can only have as much storage as can fit on a single node. The other option is partitioning, which is read consistency on top of a write consistency database.

    A read consistency database scales with ease. Partitioning is something that happens, not something you have to work hard for, writes scale as well as reads, and if you run out of space, you just add another database. The price you pay for more data is the cost of space, think Amazon S3 if you need to visualize the economics of it. So if you can expand to fit all available space, what would you do?

    I’m going to quote the wise GMail: “Don’t delete, archive!” You just store and store and store. Of course you don’t have to turn into a pack rat and store data that will never be used, and you do need to delete stuff, retention policies and all that. But since you can afford the space, your default mode of operation would be ’store at will, lazy on delete’.

    So now you can start keeping versions around, the same way a Wiki retains all previous edits. Turns out being lazy with deletes solves the read consistency problems very well. You can use generational counters to retrieve a view of the database at a particular point in time. I call those generational counters, rather than versions, because they may span multiple records from different tables.

    Separately, we’re going to use versions to help clients cache data and perform conditional updates. This is a common enough pattern that we expect the database to handle it for us, on every single table, offering Last-Modified and ETag on the cheap.

    Update Feeds

    Like I said before, I don’t have much love for triggers and stored procedures, I’d much rather use a proper programming language. So how do we get those to happen in the application?

    Going back to the big picture, we have applications hitting multiple services at the same time, and I personally don’t believe in network fallacies, so I’m going to care for latency by doing as much work as possible outside the request-response cycle. Request comes in, I do the minimum amount of work on the inputs, store the minimum amount of data, and quickly send back a 200 (OK), 201 (Created) or whatever other status fits the bill.

    First instinct would be to suggest a message queue. Not a bad idea, but let’s get something straight: it’s a design pattern. Design patterns are the way we work around limitations of the original design, by introducing boilerplate complexity. So let’s instead tackle it at the database level by introducing update feeds.

    We get two kind of update feeds. Push feeds are callback to the application that inform it of individual writes (create, update, or delete). This is done asynchronously, so it does not extend the write or require any locking of resources. Still it’s blocking, so we’re going to preserve it for simple and priority updates, and one specific case that we’ll cover shortly. It is done at least once for each write, so we have a guarantee that a push feed will always see the most recent updates to the database.

    Pull feeds allow the application to grab all the recent updates and process them at once. Since the database keeps track of Last-Modified, it’s a simple matter to catch up on all the updates since the last pull. Like push feeds, we can determine all the recent create/update/delete writes on a table. The difference is that we are pulling, so we can perform longer units of work, or decide to pull at different intervals that depend on the workload.

    Push and pull feeds are great for a variety of uses. One we talked about is minimizing the response time, performing the bulk of the work asynchronously, an architectural pattern supported by the database. Another is chaining updates together, which I’m exploring in the context of pushing updates to multiple services and handling complex transactions with compensation. I’ll talk about this at length in a future post.

    We can also use pull feeds to collect updates from existing tables and use those to populate computed tables. Computed tables are one way we can trade space for time, using more space to store duplicate data, but improving query time. There are enough uses for this in typical applications that do not cross over into the territory of OLAP (e.g. ranking, recommendations, social graphs). If the words map-reduce cross your mind, then you obviously know of one particular implementation for handling this type of workload.

    Keep in mind that, unlike queries returning records, update feeds return events. While you can use queries to retrieve records based on their timestamp, you need update feeds to determine when records are deleted. Besides the use cases we described above, you can also use update feeds for replication and for indexing records outside the database. Remember the mythical search engine scenario? Update feeds are an easy way to feed structured data into a search engine.

    Update feeds give us two important characteristics: a database that delegates all the logic to the application, and that acts as the primary place for storing the application state. It’s also a critical feature for handling indexes.

    The Relational Model

    I use the relational model principles to design, analyze and make predictions about the database. I use it to decide when to store data in 3rd normal form, and when to denormalize liberally. Yes, I denormalize data! Which is why I’m still thinking relational model, even though I’m talking about something other than a relational database. But if you are looking for a database that enforces 3rd normal form, and optimizes for tabular data, then you’ll be disappointed.

    So now example time:

    GET /orders/123
    
    <order>
      <item>
        <link>/products/456</link>
        <quantity>5</quantity>
      </item>
      <item>
        <link>/products/789</link>
        <quantity>1</quantity>
      </item>
      <total>15.99</total>
      <created-by>assaf.labnotes.org</created-by>
    </order>
    
    GET /products/456
    
    <product>
      <text>LOLcat picture frame</text>
      <price>4.99</price>
    </product>

    This is an elephant. Some people look at it and see XML data, some people look and see service calls, some people look and see relations. I think they’re all there.

    I brought up this example to illustrate several points. First of which, is that everything you know about data and relations still holds. In this example I wanted to illustrate how I can join data pulled over HTTP from a Web service. The other two points deal with the way we’re going to model our entities. Differently.

    In modeling the entities, I realized that orders and products are distinct with weak ties between them. They may in fact be offered by different services, or stored in different databases, so they might as well be in different tables. Goes without mention that I won’t even dream of duplicating product details inside the order, or listing orders inside a product record.

    But in modeling the order entity, I made two different decisions. The first, is to calculate the order total and store the computed result in the order itself. Was that a good idea or short sighted on my part? I won’t argue either way, but if you do have an opinion in the matter, preferably a strong one, then you’re using your relational model instincts to reason about read consistency databases. I just wanted to illustrate that all that we know is still useful.

    The other decision I made was to store line items inside the order itself. I realized I have no compelling use case to keep those separate. When I add or remove a line item, I’m changing the order, I expect the order to have a new version and updated timestamp. When I delete the order, I assume all the line items will go away. And when I query the order, I intend to find all the line items there, without resorting to Cartesian join and result-set gymnastics.

    So I designed the order entity from that perspective, and simplified the application logic. I also created three issues that we’ll talk about next: indexes, updates and conflicts.

    Asynchronous Indexing

    As the number of orders grow in size, I’m going to face a problem. How can I find all the orders related to a product without scanning through the entire orders table?

    If I used a relational database, I would break the line items and orders into separate tables. One reason is to allow fine grain updates into the order. Another is the constraint imposed on indexes: an index is derived by reducing a table row into a set of fields, ordering these fields, and sorting over the collection.

    Since I’m designing for a read consistency database that scales extremely well for writes, I’m not too concerned about the granular updates. I much rather optimize for reads (more frequent) by preserving data locality. As for indexes, well, that’s a separate issue.

    Remember that my dumb database can hold no declarative logic, it can’t by itself decide what goes in the index. All the dumb database is able to do is store the indexes and use them efficiently to retrieve records, but it needs the application to decide what data goes in the index. This is a special case for update feeds: the database delegates write events to the application, and the application resolves each event into a set of index records.

    This sounds a little bit complicated, so let’s work that into our example. I’m going to define an index by giving it a name in the database, and a callback function that, given an order, will return a list of product URLs. I can use access that index with a product URL to find all the orders that contain that product. The database does all the heavy lifting, but the index structure is decided by the application.

    With that index, I get efficient queries on my orders, without having to create and maintain a separate line item table for the sole purpose of indexing. I only need to design indexes that support my queries.

    I happen to think asynchronous indexes are a powerful feature that simplifies entity management. Here’s another example. Given the same list of orders, I’m going to define a function that only selects completed orders, takes the difference between completed and created dates, and reduce that into a ‘days to complete’ index. I can now find all orders that took five days to complete, without having to store computed values in the table, or see NULLs in my index.

    Asynchronous indexes have three interesting properties. The decision on what and how to index is done by the application, which also means they are sparse indexes. And they are updated asynchronously, much like a search index, which reduces contention on writes.

    Transactions

    Two things you need to know about read consistency databases: 1) there are no locks, and 2) there are no locks.

    You may already decided that it’s impossible to build a database without some sort of locking mechanism. Perhaps, although some voices from the functional programming world may argue otherwise. Either way, what I mean by no locks is that you can’t use one action to block another, and you certainly can’t deadlock. This makes the database and application that uses it a parallel problem. And parallel problems yield nicely to multi-core CPUs and banks of inter-connected nodes.

    There’s a mythical example that explains how relational databases work. It involves an atomic transaction that moves data from one account (debit balance) to another (credit balance), in such a way that both happen together without intermediate results. In the past I used that as an example to illustrate the role of atomic transactions in storage. Nowadays I use this as an example to illustrate a bit of social engineering. Don’t laugh.

    The point of this example is to confuse database transactions with financial transactions, using something that affects us directly: our checking account. Banks don’t work that way, in real life financial transactions are much different. In fact, the transaction in our checking account is a record of the money changing hands. The sum of these records is the bank account. And we can use these records to calculate a snapshot and store it as the daily balance, or present the current balance from daily balance and pending transactions combined. A classical application of read consistency.

    For a large number of applications, what you need are the ability to make progress, dodge race conditions, and end up with a consistent view.

    So the first thing we need to understand is how we develop applications for real life scenarios. In real life scenarios, we’re going to deal with incremental state changes (e.g. credit card charges take time to clear), resolve conflicts as they happen (e.g. order ready to ship, when we lost the last item we had in stock), and coordinate outcomes at higher levels (two phase commit for airplane tickets and hotel rooms? show me). Those all fit well within our read consistency model.

    While we like the database to be dumb, we don’t tolerate stupidity: we can’t stand for lost updates. That’s a critical requirement we can address in a variety of ways. Idempotent writes, so clients can retry those until successful (see below). Reliable storage, through traditional mechanisms like RAIDs, logs and geographical fail-over. At-least-once semantic, that one is important since we do a lot of work asynchronously, so it’s baked into the update feeds (see above).

    Lost updates is also a term that describes one update overwriting another. We don’t have locks, but we’re going to use conditional updates instead (relatives of optimistic locks). We already identified this as a feature offered by the database itself in the form of versions and cheap ETags. Since we tend to handle coarse grain entities, the ones that represent our units of data, we can often perform the equivalent of a transaction in a single update. We can also use update feeds to chain two updates together, so a change in one record will be reflected in another.

    There’s still an issue of pushing updates to multiple independent entities. This, it turns out, is a much larger architectural problem to solve. How do you push updates reliably into your ERP and CRM, when those are independent services? So we start thinking in terms of versions, ordering of operations, chained updates and compensation. At this point I’m going to wave my hands a little. It’s a really interesting topic, but much larger in scope for this post, so I’ll defer it to some other time.

    There are obviously applications for which this is not enough, and applications for which you would prefer the convenience of ACID transactions. But for a large class of applications that are sensitive about the correctness of their data, and must handle it reliably, a read consistency database would work just fine.

    Identities

    Let’s start with the basic stuff. Each record has a unique identifier created by the database. These identifiers are opaque, you can’t use them to infer order or locality, but you can certainly use them for equality. Nothing new so far, but that’s not the only type of identity we have to contend with.

    How can we create a record exactly once? Remember that we don’t have transactions, but we do have conditional updates, and conditional updates are slightly different from optimistic locks. We use conditional updates to update a record only when it has a certain value we know, typically from a previous read, but sometimes any value will do.

    So we’re going to ask the database to allocate a new record identifier for us, fairly cheap request. Then we’re going to make an update on the condition that the record doesn’t already exist. If you’re familiar with HTTP, think of a GET to a known resource (e.g. /orders/new), extracting the URL out of the Location header, and using it to make a PUT with If-None-Match set to ‘*’. In short, we’re only going to make a successful update if no one else beat us to the punch, including any previous attempt we made before. Create once.

    So what does this have to do with identity?

    A common scenario is one where the entity has an identity, different from the unique identifier created by the database. Imagine for example that we’re creating user accounts, we decide on the username as identity, therefore no two users can have the same username. We better be able to do that.

    Let’s revisit asynchronous indexes. They’re updated asynchronously, duh, so we’re opening up to a race condition in which two writers create two records, and the index updates to point at both. We can decide this a read consistency issue, and simply formulate our query to ignore all but the first record. Or, given that our indexes are sparse, decide to create an index entry once, and the first record wins.

    This works nicely for asynchronous updates, since race conditions are rare and we don’t care much for a few orphaned records. But what if we’re doing something synchronously: the user is waiting for us to confirm the new account, or ask them to pick a different user name? We can block. Create the record, wait for it to show up in the index, decide if it’s the same record as the one created, and return the appropriate response. That’s one option.

    Let’s look at another use for conditional updates. We’re going to first allocate a new record identifier, then update the index to point there on the condition that no index entry exists, and then update the record on the condition that no record exists. Why do we need both conditions? On the chance that the index entry was previously created, and then abandoned before creating the record. Might have been us in a different thread. So if we do find an index entry, we’ll use that to make a conditional update.

    All this, without locks. And obviously, it’s abstracted by the client library, so we don’t need to run the entire sequence, just ask to create a record with identity, or create-new/update-existing. But it’s helpful to know how this is handled by the database.

    “Wait!”, you say, “so you can update indexes directly, why didn’t you say that before?” Because the more you work with traditional databases, the more you’re conditioned to think of storage as a synchronous problem. And that’s wrong. There’s a world of possibilities out there that comes from thinking about and solving problems asynchronously. Read consistency databases open that door, but it’s also important to understand how to use them properly. So I wanted to focus more about this new frame of mind, the reaffirming habits of the past.

    Conflict Resolution

    Data partitioning means placing different subsets of the data in different places. Data partitioning is free in the sense that you don’t have to work to make it happen inside the application. We’re going to leave it up to the database (or a proxy, think about that) to decide how to distribute individual records, how to locate them and combine results, and how to shuffle data around when adding new nodes. All the features we covered so far make this transparent to the application.

    However, data partitioning is not enough for all workloads, sometimes we need replication. Replication brings with it a different problem, that of network partitioning. When the network partitions, it’s possible to make an update in one replica, but read a stale value from another. It is also possible to perform independent updates on replicas of the same records.

    Read consistency helps us deal with out-of-sync replicas at read time, but we still need to solve independent writes and reconcile those. Again, we’re going to use update feeds to delegate conflict resolution to the application, it’s just another type of update events.

    Time To Junk The RDBMS?

    That depends on how you’re using it, and I’m the last to suggest you junk your RDBMS just because a new shiny object comes around and becomes the sound-byte of the day. If it ain’t broke, build something new.

    But if anything I wrote sounds vaguely familiar because you somehow managed to dumb your RDBMS into storing structured data in BLOBs, added versions and timestamps on all records, grappled with minimizing transactions and locks, denormalized data like there’s no tomorrow, or relied too much on a message queue, then time to rethink. Are you using a hammer to polish your china? (Tip: not a good idea, invest in soft cloth)

    The thing about relational databases, dumbing them down doesn’t create a dumb database that you can scale easily, and doing read consistency on top of write consistency is two problems to solve. It’s still a shared resource programmed in COBOL pretending to be a mainframe from the day structured data would fit nicely in tabular form. Which, granted is perfectly fine for a lot of applications. And insufficient for others.

  8. Rounded Corners - 150 (OPTIONS)

    September 17th, 2007

    Drive the point home. Via DailyKos:

    The Edwards proposal would cut off health care for the president, Congress and all political appointees in mid 2009, if a universal health care plan for all Americans has not been passed by then.

    Finally, a policy I can understand.

    There’s also that. Scott Rosenberg summarizes the Big Ball of Mud architecture:

    Despite the best efforts of “best practices” advocates and methodology gurus, mud is everywhere you look in the software field. … Their answer: “People build big balls of mud because they work. In many domains, they are the only things that have been shown to work.”

    Yep.

    Do some evil? What’s up with Yahoo these days? First they have to scare you into getting a Yahoo Mash account. And then spam on your behalf:

    I imported my address book from GMail into Yahoo Mail today - and was HORRIFIED when Yahoo proclaimed that it was now going to spam all my contacts and tell them about my “new” Yahoo e-mail address.

    What happened to their strategy to build compelling services?

    Recently installed:

    PulseAudio: A sound server to fix the deficiency that is aRts/ESound/ALSA. Works nicely, at least with Amarok (natively supported in the latest Xine engine).

    CompizFusion: The new Beryl.

    DD-WRT v23: Adds real time bandwidth graphs, I presume rendered in SVG, works flawlessly on Firefox. And DNS tunneling.

    Google Presentation: Just kidding. This one just installed itself. Limited themes (read: they’re all ugly), but it does export nicely into HTML.

    OPTIONS in Rails. Curiosity killed a few hours. The RESTful routing implementation in Rails 2.0 is not for the faint of heart, but with a few trials and errors, and Ruby is great for trying things out, I managed to spit out the right Allowed headers for each OPTIONS request. So yes, Rails can support this obscure and fairly unused HTTP feature.

    Unfortunately, this all happens before filters (authentication, checks, etc), and I assume most developer will be oblivious and never bother to override the default behavior. I think this will be more trouble than it’s worth, so I filed it under”nice to know it can be done”, but no plans to release working code.

    What did I learn? Rails 2.0 code is as thick as milkshake, but does not resist adding new features in odd places. And OPTIONS is a lottery feature.

    Above, an Internet safety tips.

  9. Rounded Corners - 149 (We didn’t think this would be a problem)

    September 15th, 2007

    Both sides of the fence. Rafe Colburn:

    I’m beginning to feel like every time I touch anything, I have planted the seeds for a future outage.

    The more systems administration tasks I perform, the more I understand why systems administrators tend to hate programmers.

    This day in particular, I take no particular pride in either one. Not my administrative abilities (can’t upload images to my own blog), or the pain I inflict on others with the software I build (bug fix coming up!). Though, I intentionally do administer my own stuff, it helps in designing code others would use.

    And on that note: never underestimate sane defaults.

    Live and learn. Interesting comments on Tim Bray’s mod_atom post, which I want to take out of the Atom context because they’re not specific to one particular format.

    David Megginson on Namespaces in retrospect:

    … in retrospect, we got too far in front of implementors’ requirements and delivered a spec to solve problems someone might have some day in the future, instead of problems people actually had at the time.

    And:

    I liked the final Namespace spec, even though it wasn’t what I had originally argued for, but when you have a spec that almost *everyone* ignores or gets wrong (XSLT and SOAP excepted), it might be time to acknowledge that the problem is the spec instead of the implementors.

    Reinier Zwitserloot on XML as binary format:

    That wasn’t the case, and as a result, we’ve got this mess. It’s unfortunate that 99% of all so-called ‘XML-based’ standards are actually not XML in the vox populi definition - doing whacky XML tricks like non-default namespaces, or user-defined entities, breaks everything.

    You try, you learn, and next time around you do something else.

    Yahoo 360, Reincarnate. I have to prefix this by saying, I do like random invites I get, a chance to explore. Some I stick with, some I don’t, can’t tell in advance,mostly it’s as logical as deciding to say like peanut butter or dislike S’mores. But Yahoo Mash and me seem to have started on the wrong foot.

    On Facebook people send you friend invites. On Yahoo Mash, you’ve got to watch out for your friends!

    What the e-mail says: Soandso started a profile for you on Mash! It’s good to be loved! ;)What it really means: Soandso took possession of your public identity. Is that ok with you? Of course not! So register to reclaim your online identity (and enjoy our ads)!

    As a courtesy, Yahoo 360Mash creates a default profile that you can then reclaim:

    What the profile says: This is you.

    What it really says: This is you on drugs. You post hand drawn pictures of the worst kind. You write bad poetry, the likes of which not seen since 3rd grade.

    Still, bad taste has never stopped anyone. This may be the most happening spot in town.

    What I want to do next: Check Soandso’s profile and their network.

    What I have to do next: “Claim your profile before wandering around on Mash!”

    Maybe some other time.

    Apropos, social networking rehab.

    Yay to DRM. For some reason, iTunes decided I need to re-authorize an album before I can listen to it. I barely have any DRMed stuff, but a shuffle play of the library bumped into one of those. Anyway, re-authorize I did, and now that track is authorized on 2 out of 5 machines. I’m just wondering where is that other machine I’m supposed to own and authorized before? Maybe an iTunes bug?

    And vendor lock-in. Speaking of no DRM, that’s because my primary machine runs Linux. It’s the one playing music the most, and syncs with the iPod. Turns out, that may not last for long. Though you can’t believe everything you read on the Web, and this link comes courtesy of BoingBoing which tends to favor the-sky-is-falling posts, if it happens to be true, it might signal the end of a love affair between me and the in-your-pocket Apple product line.

    Apropos, the iFlop.

  10. Rounded Corners - 148 (Magic trick)

    September 14th, 2007

    Magic tricks. Steven Frank explains how bugs are like magic, which besides being a good approximation, also makes me feel so much better. My code, not a piece of crap anymore, but something of Houdinic proportions. DailyWTF, ever so useful, illustrates by example what Steven is talking about.

    Me neither. I couldn’t have said it better, so I’ll just summarize the main points of Elinor Mills’ “Want to ‘converse’ with advertisers? Me neither”:

    I can’t help but view conversational marketing as a thinly veiled attempt by the ad industry to insinuate itself into the popular social media craze. Calling it a “conversation” makes it sound benign and implies that it is consensual.

    And:

    The most genuine conversation occurs when it is started by the consumer/reader or the blogger. A blog post about a product or company that elicits a response from the company is very effective, said Barak Berkowitz, chairman and chief executive of Six Apart.

    Read the article and have fun spotting the irony. (What is it about journalists and elephant blindness?)

    Erlang for dummies. Nice introduction to Erlang, if you ever hope of reading other people’s Erlang code. This post won’t teach you what’s good about Erlang, but help you follow when other people talk about it via examples.

    I’m personally not in the Erlang camp, but I like the source of influence, and thinking of going artistic on some of its process handling capabilities.

    Public service announcement. Judging by the headers, I think someone at Google left an SMTP server wide open. Past couple of days, spammers were sending a gazillion e-mails masquerading as labnotes.org, and it all seems to be routing through Google, even though there is no such labnotes.org account. Definitely not coming from my machine. Apologies in advance if you’re on the receiving end.

    Innovate, don’t litigate. I should change that to ‘innovate or bust’. SCO is going to the cleaners.

    Above, I sympathize with the generation that will have to clean up this mess.