
There’s a lot of ways to scrape HTML.
There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.
Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.
So how about scraping HTML with style?
I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.
Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.
Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.
Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.
Easy to write is easy to test and easy to fix.
What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.
Here’s an example that scrapes auctions from eBay:
ebay_auction = Scraper.define do
process "h3.ens>a", :description=>:text,
:url=>"@href"
process "td.ebcPr>span", :price=>:text
process "div.ebPicture >a>img", :image=>"@src"
result :description, :url, :price, :image
end
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single",
:auctions => ebay_auction
result :auctions
end
And using the scraper:
auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions[0] puts auction.description puts auction.url
This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.
To install the Gem:
gem install scrapi
To get the bleeding edge code from SVN:
svn co http://labnotes.org/svn/public/ruby/scrapi
If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.
If you’re using it in your application let me know, I’d like to link to it.
Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.
Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.
Update: Having problems figuring out the right CSS Selector to use? Try Firequark:
Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.
Ruby gets a stylish HTML scraper – scrAPI
Like Your Work » Blog Archive » links for 2006-07-14
Labnotes » Blog Archive » links for 2006-07-19
HTML Blog » Êîìïîíåíòû äëÿ Ruby
High Earth Orbit » Blog Archive » scrAPI – Microformat Parsing in Ruby
Labnotes » Mashups: In The Spirit of Simplicity
High Earth Orbit » Blog Archive » Converting table-based Calendars to hCalendar
Ruby/Rails UI Scraping at Matt Didcoe
Mi viaje en tren » Blog Archive » Avistamientos #4
the ryan king » Progress
Uses for Rcrawl
Uses for Rcrawl at Digital Duckies
epicblog» Blog Archive » scrAPI and redirects without complete URL
Ninjawords – a fast online dictionary… fast like a ninja » eightpence – Phil Crosby
Notes » links for 2006-11-29
hone.wornpath.net : Web Scraping in Ruby!
R&D Party :: Apartment 2D
Sean Mountcastle » April NoVA Ruby Users Group
Blog of BigSmoke » Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot
Attack of the Website Scrapers | The BookmarkMoney Blog
The sixth sense – » Mashups?Web ???????