Scraping with style: scrAPI toolkit for Ruby

There’s a lot of ways to scrape HTML.

There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.

Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.

So how about scraping HTML with style?

I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.

Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.

Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.

Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.

Easy to write is easy to test and easy to fix.

What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.

Here’s an example that scrapes auctions from eBay:

ebay_auction = Scraper.define do
  process "h3.ens>a", :description=>:text,
              :url=>"@href"
  process "td.ebcPr>span", :price=>:text
  process "div.ebPicture >a>img", :image=>"@src"

  result :description, :url, :price, :image
end

ebay = Scraper.define do
  array :auctions

  process "table.ebItemlist tr.single",
              :auctions => ebay_auction

  result :auctions
end

And using the scraper:

auctions = ebay.scrape(html)

# No. of auctions found
puts auctions.size

# First auction:
auction = auctions[0]
puts auction.description
puts auction.url

This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.

To install the Gem:

gem install scrapi

To get the bleeding edge code from SVN:

svn co http://labnotes.org/svn/public/ruby/scrapi

If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.

If you’re using it in your application let me know, I’d like to link to it.

Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.

Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.

Rdocs are here.

Update: Having problems figuring out the right CSS Selector to use? Try Firequark:

Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.

107 thoughts on “Scraping with style: scrAPI toolkit for Ruby

  1. I DIG IT! I’ve incorporated scrAPI into a project I’m working on, and am happy with how simple it makes scraping multiple elements on a single page, but what about scraping multiple pages?

  2. Help! When I run a script with scrapi required I get a msgbox saying:

    “The procedure entry point ruby_snprintf could not be located in…msvcrt-ruby191.dll”

  3. I disagree with your scraper code being readable just because someone knows CSS. I have been using CSS for years and your code is unreadable at first glance. It is obfuscated to the power of 10.

    I don’t find it useful. Sorry!

    watir-webdriver can scrape in a far more readable way.

  4. I’m in the Same situation as “Soc88″ and others above, any recommendation would be warmly received! Thanks “Luke” for the link on more information. Best regards.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>