1. Sep 30th, 2007

    WideFinder, Ruby Pic, and Scaling Up, Out and Away

    Scaling Up

    I’m starting with the sequential version of WideFinder. I have two implementations of Ruby, so this will be interesting.

    Ruby 1.8.6 pegs the CPU at 100%, and completes the sample size in 7s. JRuby 1.0.1 goes all the way to 110% CPU utilization, but that’s reported against the JVM, I’m assuming 100% is taken by the processing, the other 10% by overhead, like the garbage collector. Unfortunately, I have to be 50s patient waiting for the results.

    The parallelized version fares worse on CRuby. CRuby doesn’t scale out, so still only 100% CPU utilization, plus the overhead of parallelizing on a single thread. On the other hand, JRuby has fun all the way to 150~160% CPU time.

    On Linux each core is considered a separate CPU, so the theoretical upper limit for a dual-core CPU is 200%. In practice, there are other things running in the background, and the best I’ve seen so far is 180%~190%.

    I’m not sure why JRuby can’t get there. It does in fact stretch its legs all the way to 180%~190% as reported for Java in the first few seconds, while the JIT kicks in. It might in fact be the JIT working the 20% shift. Past the first few seconds, it slows down to a constant rate of 110% (sequential) or 150~160% (parallel).

    Perhaps 160% is the upper limit on reading the log file?

    Scaling Out

    This time I’m turning WideFinder into an N>1 problem, where each “machine” has a log file to process, and one process that drives them all and collects the results. I’m quoting “machine”, since I’m still running this benchmark on a single machine, but now that I’m only coordinating as opposed to feeding them all a single log file, I can scale out easily.

    The sample size is different, so don’t compare processing times with the previous run.

    The first run is single threaded, one log file at a time, single core (CRuby) pegged at 100% and takes 1m28s to complete.

    The second run is parallelized, each process handles a different log file (ten in total), still single core pegged at 100%, completes in 1m30s. 2 seconds difference from pretending to parallelize.

    The third run forks a process to handle each log file, stretching it all the way up to 150-160% and finishing in 46s.

    Interesting because now I’m keeping both cores busy and completing the workload in half the time, even though it’s barely pushing 160%. I’m not exactly sure how to explain this result, other than forking multiple processes makes magic happen.

    Scaling Away

    Erlang has the right concurrency model, too bad about the language.

    I intentionally did not include the Erlang WideFinder in my tests. I’ll just go by the reports that it’s slow, because frankly I don’t care. In the overall scheme of things, you spend more efforts scaling a room full of servers that generate logs than programs that process these logs. I prefer to focus on those servers first, and from what I hear Erlang is up to that task, so the WideFinder performance can be forgiven.

    The point of the exercise was to illustrate the concurrency model, and how you can retrofit it into a variety of languages. When scaling out, it’s as simple as programming against a library. For low-level tasks, it clearly requires the performance and simplicity that comes from changing the underlying VM. And this can be done for Java or Ruby or Python. Though I doubt we’ll see it happen in Java or Ruby, much for the same reason I doubt Erlang will hit a tipping point.

    I think the problem is rooted in how developers react to concurrency. When faced with uncertainty, we resort to hierarchy and structure.

    Concurrency is fundamentally a problem of uncertainty. Order is undetermined. Some people embrace that by building systems that thrive on concurrency, and once you start a world of opportunities opens up. Have a look at Erlang and Pict. But most developers race (no pun intended) to solve it with structural programming. Threads, forks, synchronization blocks, atomic transactions, parallel FORtran loops. Those are all examples where we scope race conditions and face the consequences, only to avoid thinking in terms of message passing.

    I’m waiting for the day when “Synchronization blocks considered harmful” will become the hot topic of the day, and we’ll learn to do without method signatures and without waiting for a return value. Today, though, we’re still forced to do asynchronous, but prefer to think synchronous.

    1. Sep 30th, 2007

      Charles Oliver Nutter

      Just FYI, there’s a couple facts about JRuby you should know:

      1.0.1 didn’t compile more than about 50% of Ruby syntax, which equated to a far smaller percentage of whole methods that would compile. That means 1.0.1 is still usually running interpreted.

      Also, there’s still an unfortunate bottleneck in JRuby regexp process, since we have to represent strings as byte[] and all Java regex engines work with char[]. On JRuby trunk JRuby is less than 2x slower than Ruby, largely because of this bottleneck. On non-regexp benchmarks, it’s always faster than Ruby.

      I’d expect JRuby to continue improving as we knock down the remaining bottlenecks, so check back or try trunk right now.

    2. Oct 1st, 2007

      Mikael Lind

      “The second run is parallelized, each process handles a different log file (ten in total), still single core pegged at 100%, completes in 1m30s. 2 seconds difference from pretending to parallelize.”

      Should “process” not actually be “thread” here? I am also curious about why multi-threading (if that is what you are describing) is slower than multi-processing here. Have you considered asynchronous I/O?

    3. Oct 1st, 2007

      Assaf

      Charles, the pingpong test runs twice as fast on CRuby/YARV as JRuby. I did not try the JRuby compiler yet, wouldn’t be surprised if it ends up beating CRuby in future release.

    4. Nov 3rd, 2007

      Tim Bray

      I don’t see a link to the actual code… post it or email it to me and I’ll try it on the big many-core SPARC.

    5. Nov 30th, 2007

      Assaf

      Mikael, the code is written to run “processes”, in the pi-calculus notion of a process. Some of the tests run them as separate threads (green or native, depending on the VM) in the same OS process, and some run them in separate OS processes. But design wise, they’re modeled as independent processes.

      Tim, code e-mailed your way.