1. Git forking for fun and profit

    April 30th, 2008

    Originally written and posted to the Buildr developers mailing list. In the past few weeks all changes were pushed using git-svn and a synchronized Git repository hosted on Github. It rocks!

    Apache built a great infrastructure around SVN, lots of sweat and tears went into making it happen, and at first I felt like we’re circumventing all of that. But the longer I thought about it, the more I realized that Git is just more social than SVN, and that’s exactly what Apache is about.

    So there you go. Feel free to adapt to whatever project you’re working on.

    On The Morality of Forking

    One thing I love about open source is that it gives you the right to fork. Don’t like how the project is managed? Want to take it in a different direction? Tired of seeing a broken trunk and re-fixing the same typos? Copy it over and start a new project. The Apache project started life that way.

    In open source culture, forking is often used as a four-letter word. One fork means two different code bases. What happens next depends on the tools you use, but typically keeping these two forks synchronized, sharing changes and bug fixes, can be a pretty daunting if not impossible task. Even branching in the same code repository is a tricky maneuver — when was the last time you did an SVN branch only to fix a one-line bug?

    There’s a high bar to forking, so people don’t do it lightly. We generally prefer not to, reserving forking to dead projects and irreconcilable artistic differences. Being forked is a stigma you don’t want on your project’s resume.

    At least it used to be that way. Back in the days of ugly source control systems, forking would lead to all sorts of nasty side effects. So I want to correct that impression and explore a different kind of forking — one that’s fun, healthy and a way to build a better community around the forked project.

    Forking Alone

    The way SVN works, and I assume you’re familiar with SVN, you check the code out of a central repository and into your working copy. You make changes locally, and when you’re done, you commit those back to the central repository (or hand someone a patch to commit on your behalf).

    The working directory is an offline copy of “the real thing”, a local cache for easy editing. Officially we’re all working on the same code base, whatever edits you do in the privacy of your own computer is your business (until you commit).

    Distributed version control works in a different way. I’m going to talk exclusively about Git for reasons that will become obvious later, but the same goes for whatever distributed version control you decide to use.

    When you source control is distributed, there’s really nothing else to do all day but fork. To get anything meaningful to happen you start by cloning the central repository to your hard disk. What you end up with is a full blown copy, history and all. Clone, branch, or whatever else you call it, it really is a fork.

    Now that you’re working with your very own fork, you can branch, commit, rollback, merge and do all sorts of interesting things on your own repository. You can also fetch the latest changes from the central repository, and push the work you’ve done back to the central repository, or send a patch that someone else can pull in.

    Forking in private is most of what you do, but not all.

    Forking In Public

    Open source is great, open source with open development is teh awesome. Open development is done in public. You don’t go hiding in a cave only to emerge a few months later with a big code drop. Do the work where everybody else gets to check it out, participate and hopefully contribute.

    We don’t look too kingly on anti-social behavior. On the other hand, tools like Git are great for cave digging and dwelling. Why would I think forking is such a good idea?

    To begin with, social problems are not solved with technology. The point of a source control system is to make development easier, not annoy people into socializing. That should come from a fun, creative and supportive community.

    Git is wonderful for committers. Rule #1 of source control: don’t break the trunk! When you break the trunk, everybody else has a bad day. They can’t get any work done.

    But during development we often reach this point when you’ve got something incomplete, perhaps broken, but significant enough that you’ll want to check it in. You want that checkpoint because it allows you to move forward and experiment with different ideas. Worse case, you can always roll back. The ability to take these checkpoints and make local commits and branching without breaking trunk is quite powerful. Use it wisely.

    Git is also wonderful for those of you who are not committers (yet). You can get to be a committer by racking up karma points. You get more points form major contributions. Major contributions require a lot of work, you’ll want to source control it, you’ll want to involve other developers so they can help make it happen. With SVN you can’t do that until you get your committer status.  Catch-22.

    So fork in public. You do that by setting up a public repository, synchronizing it with the central repository, and pushing your changes to your public repository. Other developers can then clone your repository, check the code you’re working on, use it and test it, send you patches and even push changes to your repository, working together away from the trunk.

    When you’re done working on a big enough change you can merge all these changes into a set of patches and send them over for inclusion in the central repository. That way you can contribute as little or as much as you like without waiting for SVN access, even better, you can share these features with others while waiting for them to be included upstream.

    Forking is Fun

    So let’s review some of the things you get out of using Git.

    You can branch, commit, go back in history and do all sorts of useful things offline. Offline means you can do them on a plane or a train, which some people think is really cool. Even if you’re always connected, you’ll love how everything happens so damn fast. It’s like strapping a T3 line straight into the ethernet port.

    You don’t have to bother anyone else. You can branch as often as you need to, which is damn useful when you’re working on two things at the same time. (Yes, SVN had branches since forever.  Not the same thing.) Actually, I recommend branching any time you’re working on something: branch, change, commit, merge.  Did I mention, fast?

    You don’t have to hold on commits until everything works. You can write test cases and commit, get some code working against these tests and commit, get more code working and commit again. When you’re finally ready with that changset is when you push it into the central repository, by which point you won’t be breaking any trunk. Frequent commits are great if you like to experiment with new ideas, or share work in progress.

    If you’re doing something big, you can fork the central repository and get other people working against that fork, helping to make it happen. You don’t need to be a committer to start off on a major contribution, you don’t have to wait for a patch inclusion before others can start using your code. Best, you don’t need to bother with SVN branches, which ironically are harder to synchronize with trunk than using Git.

    Come to think of it, just giving all that power to contribute to developers who are not yet committers is a killer feature, and why I’m writing this piece to begin with.

    I realize all this fun stuff might be hard to imagine, might not even sound plausible, if you’re used to the SVN way of doing things, but once you let go of centralized source control, everything in the universe will start making sense.

    Parenting and Custody

    I started by saying open source provides the right to fork. Each open source license expresses that right in a different way, the one we use is the Apache Software License. That works as long as we provide all the code under the terms of the ASL, which we can do since every contribution included an agreement to let Apache distribute it under the ASL.

    To avoid all sorts of nasty custody fights, which we really don’t have time to deal with, we’re trying to get software done, we have to make sure all the code coming in is accompanied by the Contributor License Agreement. We have two ways of doing that.

    Some code comes directly from committers, all of which signed the CLA, easy. Other code comes from patches, which go through JIRA. JIRA gives you the option to CLA the patch before uploading it, telling committers they can go right ahead and add it to the code base. That way we have a commit trail showing who contributed what.

    So you understand why the official source repository has to be hosted by Apache, and while we’re waiting for Git to happen, right now we’re stuck with SVN. No Git for us? Turns out, it’s not such a big problem.

    For starters, you can always use git-svn to clone the SVN repository and then use Git instead of SVN. You get all the awesomeness of Git and an easy way to keep it consistent with trunk. I’ll explain how to do this when we’ll talk about the mechanics of forking.

    If you find someone you trust who’s already managing a Git clone that’s synchronized with SVN, you can clone their Git repository. I use Victor’s Git repository.

    If you’re working on something big, you’ll probably want to fork in public, creating a remote repository that others can tap into. Lots of options to choose from, the one I use, because it’s wonderful and I don’t want to host one myself, is Github. If you want to clone someone else’s repository to create your own public repository, you’ll love the “fork” button. Can’t miss it.

    Now, all of this introduces an interesting problem. Say you decide to work on something big enough that you need a public repository. You also need other people to help you, by contributing their changes and fixes. You want to bring all that code into the central repository so this cool feature shows up in all future releases, but the code is now a mishmash of contributions from different people.  How would you get something like that approved?

    Here are three things you can do to help us approve these contributions:

    1.  If in doubt, ask. Mailing list is the best place. We’ll revise these guidelines as we learn what works best.

    2.  Keep an ongoing commit trail. If you accept a patch from someone else, commit it and include an attribution in the commit message (see here for guidance). When you use git format-patch, it creates one patch for each commit, we apply these individually, preserving the commit trail. Pushing does the same (but double check).

    3.  Ask contributors to sign the CLA. It’s quick, it’s easy and you don’t have to be a committer. Check the list of committers and non-committers who already signed the CLA.

    The Mechanics of Forking

    So let’s discuss the few ways in which you can fork, starting with git-svn:

    $ git svn clone http://svn.apache.org/repos/asf/incubator/buildr/trunk buildr -r <revision>

    Apache maintains one huge repository shared by all projects, and while Git will only clone the history for a given project, it will need some time to process through an endless stream of revisions. How long? Really long. Most likely you don’t need all that history going back to the very first day, so just clone from a recent revision, it will only take a couple of minutes.

    Since all projects share the same repository, svn info will show you two revision numbers. The first, the actual SVN revision number, is not the one you want — cloning it will fetch nothing. The second, the “Last Changed Rev” is the one you want to clone from.

    Check that it worked:

    $ cd buildr
    $ ls
    .....
    $ git svn info
    Path: .
    URL: http://svn.apache.org/repos/asf/incubator/buildr/trunk
    .....

    You’ll want to set your name/e-mail so they show up in all commits, which you can do on each Git repository, or once using the –global option:

    $ git config --global user.name Assaf
    $ git config --global user.email assaf@apache.org

    To pull updates from SVN and fix (rebase) all your local commits against the most recent SVN update:

    $ git svn rebase

    Best way to work on a new feature is to start with a new branch:

    $ git checkout -b teh-awesome
    $ git branch
      master
    * teh-awesome

    Do some work, commit as often as necessary and when you’re done, rebase these commits against the latest changes from SVN, and generate some patches:

    $ git svn rebase
    $ git format-patch origin

    You’ll get one patch file per commit. Depending on what you did to get here, that could be a boatload of patches, so you might want to roll together (squash) some commits, or even change their order (that way, we’ll think you wrote test cases ahead of the code!) Check the documentation, git rebase -i master is your friend. 

    Cloning someone else’s repository is just as easy, for example:

    $ git clone git://github.com/vic/buildr.git
    $ cd buildr

    This time around you’re working against a Git remote repository, so you grab updates using git fetch/pull and rebase accordingly. Everything else involving branching and patching works the same way.

    You can also work with both at the same time. A local repository that clones a remote repository, the one you’re using to share your work with others, and also synchronizes with SVN trunk. (Bet you didn’t know, but your local repository can synchronize with several remote repositories)

    You’ll want to start with git svn clone and then add the remote repository using git remote add. Or just use the buildr-git script, courtesy of Victor, which sets up everything to work just right, and adds useful commands like git apache-fetch, git apache-pull and git synchronize.

    There’s a few command line options you can set to use this script with any other Apache project.

    So go, have fun, and Git away!

     

  2. Buildr goes Apache, and see you next week at ApacheCon

    November 8th, 2007

    Quick update. Last week, yes I am that behind on my posts, Buildr was voted for incubation under Apache. Kudos to InfoQ for reporting the news even before we got to count the votes. We expect the movers to come in next week and help move the site, SVN and mailing list to our new digs.

    As you can read from the comments, some people think it’s a daft idea (my words) to write a build system in Ruby for building Java (and Scala) code. And some are still holding on to the ideal of programming in XML. And as much as I’d hate to admit it, even more mistake the F5 key for a build process.

    Either way, a lot of developers out there don’t mind language plurality, and judging from the mailing list responses, love what they see. And it’s always a pleasant surprise when you see a high quality patch coming from a developer who just picked up Ruby by hacking away some Buildr code. It is, that easy.

    Couple of challenges remaining. There are some unresolved issues regarding licensing and releases, but we’ll figure those out as we go along, and hopefully get to present what we learned at a future ApacheCon.

    And two things are still bothering me. The Buildr mailing list is hosted on Google Groups, and now I’m spoiled and hate having to go back to the 90’s (Apache uses ezmlm). Also, as we go through incubation the site URL will change twice. Say goodbye to PageRank and hello to broken links.

    Maybe it’s time to start discussing a more modern infrastructure at Apache.

    So the first full-on Ruby project at Apache. I hope it’s the sign of things to come. Apache is still known for Java/C hegemony, not a conscious decision, just a fact, but it does affect perception, so let’s change that.

    And before I forget. I will be attending ApacheCon next week in Atlanta, if you have some time come and say hi.

  3. Build Testing For The Rest Of Us

    July 6th, 2007

    Nick Sieger, on testing your Rakefiles:

    Perhaps someone out there will run with this idea and take up the challenge and write a Rakefile completely in a test-driven or behaviour-driven style. It’s always been a sore point for me with Make, Ant, Maven, and virtually every other build tool in existence that you have no other way of automatically verifying your build script is doing what you intended without manually running it and inspecting its output – it just feels so dirty!

    I don’t know of many people who actually test their builds and automated tasks. But I do know someone who manages to break the builds every so often. Like that time I experimented and accidentally erases the LICENSE file, and then made a release with an empty license. Oops. Or that time I moved stuff around and ended releasing a WAR that passed the integration tests, but still missed some critical files. Or that time the other day … well, you get the point.

    So I started thinking, what would it look like if I tested the build file. And I started with the simplest thing that could possibly work:

    check do
      Zip::ZipFile.open(package(:jar)) do |jar|
        fail "No MANIFEST.MF" unless jar.entries.include?("META-INF/MANIFEST.MF")
        license = jar.read("META-INF/LICENSE")
        fail "Empty license" unless license =~ /Apache License/
        classes = jar.entries.select { |entry| entry.to_s =~ /org/apache/ode/utils/.*class/ }
        fail "No classes" if classes.empty?
      end
    end

    That one is more defensive than foresight, it checks for problems I ran into self inflicted before, to make sure they won’t happen again. But it’s a good start.

    If you used Ruby for any length of time you’ll immediately recognize two key characteristics of this code. It tests stuff. And it’s crap. A month from now I’ll want to add something else and look at the code and wonder what the hell it does. It’s write only. And that’s not good enough for Ruby.

    I’m a big fan of RSpec, so I decided to write the same thing using RSpec and see what it would look like:

    check do
       describe package(:jar) do
        it "should contain MANIFEST.MF" do
          package(:jar).should contain("META-INF/MANIFEST.MF")
        end
        if "should contain an Apache license file" do
          package(:jar).file("META-INF/LICENSE").should contain(/Apache License/)
        end
        it "should contain classes" do
          package(:jar).should contain("org/apache/ode/utils/*.class")
        end
      end
    end

    Much. Better. Now at least I know what I’m testing. Because, hate it as we may, tests must be maintained.

    But it’s longer, and a few tests like this will quickly turn any build file into a haystack. There’s some redundancy. When you work with RSpec you’re writing test cases outside the code, so you need to organize them into logical units: contexts. We don’t need that here. We already have a context that happens when we build something. We don’t need to isolate it, set it up, or tear it down. So let’s get rid of unnecessary describe:

    check do
      it "should contain MANIFEST.MF" do
        package(:jar).should contain("META-INF/MANIFEST.MF")
      end
      if "should contain an Apache license file" do
        package(:jar).file("META-INF/LICENSE").should contain(/Apache License/)
      end
      it "should contain classes" do
        package(:jar).should contain("org/apache/ode/utils/*.class")
      end
    end

    So we got rid of the contexts but not entirely, we still have something like that, only it happens to be the object we’re testing. I call them subjects. And we write expectations against the subject. So let’s separate the descriptive part, where we decide on the subject and say what it should, and the code that complains if it doesn’t:

    check package(:jar), "should contain MANIFEST.MF" do
      it.should contain("META-INF/MANIFEST.MF")
    end
    check package(:jar).file("META-INF/LICENSE"), "should contain an Apache license file" do
      it.should contain(/Apache License/)
    end
    check package(:jar), "should contain classes" do
      it.should contain("org/apache/ode/utils/*.class")
    end

    You might recognize that we moved away from RSpec, which is perfectly fine, we’re testing the build not running unit tests in isolation from the code. But we are using the ever so sleek should and should_not, and the niceness of expectations and custom matchers.

    What about turning the entire build file upside down and make it behavior-driven? I think that would work. But the key to testing is saying the same thing twice, once to make it happen and once to prove that it works. Flipping it around would change the syntax but prove nothing. So we do want the duplicity of code that builds and expectations that match.

    If you’re using Buildr 1.2, you just got this cool feature (documented here). If you’re using Rake, I can’t imagine it would be too hard to rip the code and use it elsewhere. And if anyone is interested in getting this into the next Rake release, please do!

  4. Patience, Buildr docs coming up

    May 7th, 2007

    Right now if you go to the Buildr web page, it redirects you to the RDocs. As Steve Ivy says:

    This reminds me of the bad old days when all Java code just shipped with the JavaDocs and that was “enough”. The API is not the documentation, folks. It’s a useful part of the documentation, but it’s not enough.

    I agree. RDocs are not acceptable documentations.

    RDocs are reference material, information you need after you know what you’re looking for, and just need to quickly find it.

    As part of developing the code, I also wrote a getting started guide for Buildr. It ended up reading like a WS-* spec. Bad. And it fell out of sync with the last set of changes to Buildr. Also bad.

    So I’m rewriting it from scratch. To be readable, usable and friendly.

    Except an official Buildr announcement later this week that will include usable documentation. One you can actually learn from. Hopefully, one you’ll enjoy reading from.

  5. Buildr, or when Ruby is faster than Java

    May 3rd, 2007

    219781763_29850fce52.jpg

    Somewhere in my ever expanding list of drafts I’ll never get to finish is another post about the economics of Ruby, and how raw performance is less of a problem when you’re bound by the database, spend less on development, and can optimize in the large. Basically, regurgitating the same justifications I used to explain Java a decade ago.

    But this is not that post.

    Today, I’m going to talk about something else, and share with you an interesting discovery from working on Buildr. There will be no language theologies or abstractions of performance, just the facts.

    It started when Maven hit the fan, and we had enough of the clunky XML pseudo-code, unreliable builds and maintenance straight out of Elm street. So we replaced Maven with Buildr, a build system written in Ruby. To make our life easier, we designed Buildr to be a drop-in replacement for Maven. We’re building the same code, running the same tests, compiling the same XMLBeans, creating the same Hibernate schemas, sharing the same remote and local repositories.

    All this to say, they’re black box equivalent. Feed them the same project, and they generate the same JARs, WARs and distro files.

    But they’re not entirely equivalent.

    Off the bat, we downsized 5,443 lines of XML abuse spread over 52 files, into a single build script weighting a measly 485 lines. It’s amazing what a real language, with proper variables and (gasp!) functions and objects, can do. Now that our build is down to a healthy BMI, it even looks sexier.

    It works repeatedly, and works like we expect it to, so we spend no time fighting the build, and more time writing new code, fixing test cases and anything else we’re supposed to be doing. Let me tell you a secret: we even get to go home earlier.

    But that’s not all.

    Ladies and gentlemen, the moment you’ve all been waiting for. Where we get to talk raw horsepower and 0-60 times. But before we get to numbers, let’s explain what they mean.

    Remember, we’re using Ruby, not exactly the fastest programming language. We’re using it to build Java code, so we run the Ruby VM alongside the Java VM. There’s a lot of dependency management going on, you need that for reliable builds. And, when it comes to implementation, we chose clear and maintainable over fast and furious.

    So let’s put expectations in perspective. “Fast” means not much slower than Maven. The question is, how close did we hit our target?

    [assaf@casper ode]$ time rake clean install test=off
    real    1m1.827s
    user    0m38.228s
    sys     0m3.464s
    [assaf@casper ode]$ time mvn install  -Dmaven.test.skip=true -o
    real    1m58.082s
    user    2m23.287s
    sys     0m10.160s

    Swooooosh.

    Buildr does a full build at 50% the latency of Maven. Twice as fast.

    Consistently.

    Let’s try something else:

    [assaf@casper bpel-runtime]$ touch src/main/java/org/apache/ode/bpel/intercept/ThrottlingInterceptor.java
    [assaf@casper bpel-runtime]$ time rake build test=no
    real    0m5.340s
    user    0m4.612s
    sys     0m0.534s
    [assaf@casper bpel-runtime]$ touch src/main/java/org/apache/ode/bpel/intercept/ThrottlingInterceptor.java
    [assaf@casper bpel-runtime]$ time mvn clean compile -Dmaven.test.skip=true -o
    real    0m14.740s
    user    0m24.684ssys     0m0.649s

    I measured full builds and partial builds, compiled different modules, measured downloads and uploads, test cases and what not. In every single case, Buildr performed as well as, or faster than Maven. Buildr flew through the partial build in 6 seconds, then sat there waiting a full minute for Maven to catch up!

    Wow.

    Of course, we’re not measuring raw Ruby against pure Java. We’re comparing one implementation against another, where they both do the same thing. Black box equivalent. That’s a real life benchmark.

    Language speed tests are as relevant as promises made by politicians on election day. What really matters is the type of solution you can build, the effort it takes to build and maintain it, and how well it behaves.

    We know the Ruby-based solution performs significantly faster, is much more reliable, requires less work to use and maintain, and took all of 3 months from concept to working release.

    Ruby might be slow, but what you build with it can be devilish fast.

    (*) All tests run multiple times, to make sure we get repeatable results. Test machine runs 2GHz Duo Core 2, Fedora 6, JDK 1.5.11 and Ruby 1.8.5. The non-trivial test case.

    Photo by xxxtoff