There’s a lot of ways to scrape HTML.
There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.
Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.
So how about scraping HTML with style?
I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.
Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.
Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.
Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.
Easy to write is easy to test and easy to fix.
What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.
Here’s an example that scrapes auctions from eBay:
ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src" result :description, :url, :price, :image end ebay = Scraper.define do array :auctions process "table.ebItemlist tr.single", :auctions => ebay_auction result :auctions end
And using the scraper:
auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions puts auction.description puts auction.url
This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.
To install the Gem:
gem install scrapi
To get the bleeding edge code from SVN:
svn co http://labnotes.org/svn/public/ruby/scrapi
If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.
If you’re using it in your application let me know, I’d like to link to it.
Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.
Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.
Update: Having problems figuring out the right CSS Selector to use? Try Firequark:
Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.