
There’s a lot of ways to scrape HTML.
There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.
Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.
So how about scraping HTML with style?
I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.
Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.
Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.
Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.
Easy to write is easy to test and easy to fix.
What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.
Here’s an example that scrapes auctions from eBay:
ebay_auction = Scraper.define do
process "h3.ens>a", :description=>:text,
:url=>"@href"
process "td.ebcPr>span", :price=>:text
process "div.ebPicture >a>img", :image=>"@src"
result :description, :url, :price, :image
end
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single",
:auctions => ebay_auction
result :auctions
end
And using the scraper:
auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions[0] puts auction.description puts auction.url
This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.
To install the Gem:
gem install scrapi
To get the bleeding edge code from SVN:
svn co http://labnotes.org/svn/public/ruby/scrapi
If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.
If you’re using it in your application let me know, I’d like to link to it.
Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.
Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.
Update: Having problems figuring out the right CSS Selector to use? Try Firequark:
Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.

Matthieu Riou
July 11th, 2006 at 7:49 pm
iolaire
July 12th, 2006 at 5:55 am
Ruby gets a stylish HTML scraper - scrAPI
July 12th, 2006 at 7:40 am
Assaf
July 12th, 2006 at 8:47 am
iolaire
July 12th, 2006 at 10:26 am
Chris
July 12th, 2006 at 10:41 am
Assaf
July 12th, 2006 at 10:55 am
Assaf
July 12th, 2006 at 11:03 am
Chris
July 12th, 2006 at 11:11 am
Dennis W
July 12th, 2006 at 12:47 pm
Assaf
July 12th, 2006 at 12:50 pm
Assaf
July 12th, 2006 at 12:56 pm
Michael Campbell
July 12th, 2006 at 1:56 pm
Assaf
July 12th, 2006 at 2:15 pm
Andre Lewis
July 12th, 2006 at 8:24 pm
Like Your Work » Blog Archive » links for 2006-07-14
July 13th, 2006 at 5:29 pm
Joshua Sierles
July 17th, 2006 at 2:34 am
Assaf
July 17th, 2006 at 9:06 am
Labnotes » Blog Archive » links for 2006-07-19
July 19th, 2006 at 11:58 am
HTML Blog » Êîìïîíåíòû äëÿ Ruby
July 20th, 2006 at 5:19 am
Michael Campbell
July 26th, 2006 at 6:14 am
Assaf
July 26th, 2006 at 10:09 am
Assaf
July 26th, 2006 at 10:12 am
Michael Campbell
July 27th, 2006 at 5:46 am
Assaf
July 27th, 2006 at 11:40 am
Brook
July 28th, 2006 at 3:28 pm
Assaf
July 29th, 2006 at 4:19 pm
High Earth Orbit » Blog Archive » scrAPI - Microformat Parsing in Ruby
August 4th, 2006 at 4:49 pm
Labnotes » Mashups: In The Spirit of Simplicity
August 21st, 2006 at 10:51 am
Bill Kudrle
August 30th, 2006 at 11:58 am
Assaf
August 31st, 2006 at 1:19 am
john
September 3rd, 2006 at 5:25 pm
Assaf
September 3rd, 2006 at 7:20 pm
High Earth Orbit » Blog Archive » Converting table-based Calendars to hCalendar
September 6th, 2006 at 9:06 pm
Ruby/Rails UI Scraping at Matt Didcoe
September 10th, 2006 at 6:37 am
Mi viaje en tren » Blog Archive » Avistamientos #4
September 12th, 2006 at 12:53 pm
the ryan king » Progress
September 15th, 2006 at 11:13 pm
Vish
September 27th, 2006 at 3:19 pm
Uses for Rcrawl
September 28th, 2006 at 9:22 pm
Uses for Rcrawl at Digital Duckies
October 1st, 2006 at 12:52 pm
epicblog» Blog Archive » scrAPI and redirects without complete URL
November 3rd, 2006 at 2:02 pm
Assaf
November 6th, 2006 at 1:07 pm
Ninjawords - a fast online dictionary… fast like a ninja » eightpence - Phil Crosby
November 10th, 2006 at 3:31 pm
Notes » links for 2006-11-29
November 28th, 2006 at 9:34 pm
sting
December 13th, 2006 at 10:28 am
sting
December 13th, 2006 at 10:07 pm
sting
December 13th, 2006 at 10:15 pm
hone.wornpath.net : Web Scraping in Ruby!
December 14th, 2006 at 7:58 pm
Assaf
December 17th, 2006 at 12:49 pm
Camilo
March 11th, 2007 at 1:06 pm
Assaf
March 14th, 2007 at 1:51 pm
R&D Party :: Apartment 2D
March 29th, 2007 at 4:51 pm
Sean Mountcastle » April NoVA Ruby Users Group
April 19th, 2007 at 5:59 am
http://blog.bigsmoke.us/
April 29th, 2007 at 9:47 am
Assaf
April 29th, 2007 at 4:43 pm
http://blog.bigsmoke.us/
April 30th, 2007 at 6:32 am
Blog of BigSmoke » Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot
May 2nd, 2007 at 6:25 am
Mike H
May 24th, 2007 at 6:29 am
http://blog.bigsmoke.us/
July 12th, 2007 at 8:21 am
Assaf
July 12th, 2007 at 10:47 am
Nakul
September 5th, 2007 at 11:17 pm
Assaf
September 5th, 2007 at 11:46 pm
Nick. Bhanji
September 20th, 2007 at 9:10 am
Assaf
September 20th, 2007 at 10:58 pm
Nick. Bhanji
September 21st, 2007 at 8:13 am
Assaf
September 21st, 2007 at 9:27 am
Nick Bhanji
September 22nd, 2007 at 7:05 am
Nick
October 2nd, 2007 at 12:59 pm
SEO
October 11th, 2007 at 8:15 pm
Dakota
October 16th, 2007 at 10:47 pm
Michael Hartl
November 18th, 2007 at 5:20 pm
Assaf
November 30th, 2007 at 6:37 pm
Colin Z
December 11th, 2007 at 9:02 am
Assaf
December 11th, 2007 at 10:50 am
Brian
December 15th, 2007 at 12:39 am
Ra
December 16th, 2007 at 4:24 pm
Mona
December 19th, 2007 at 11:45 pm
Fabrizio
December 22nd, 2007 at 2:04 am
Ra
December 27th, 2007 at 4:18 am
Ra
December 27th, 2007 at 5:54 am
Michael Staton
December 27th, 2007 at 1:55 pm
Michael Staton
December 27th, 2007 at 2:07 pm
Michael Staton
December 27th, 2007 at 4:38 pm
Michael Staton
December 31st, 2007 at 10:21 am
Assaf
January 1st, 2008 at 7:05 pm
James Burt
January 5th, 2008 at 10:04 pm
Al
January 15th, 2008 at 1:50 pm
Assaf
January 16th, 2008 at 12:20 pm
jouer en ligne
April 25th, 2008 at 4:35 pm
John
April 28th, 2008 at 8:48 pm
John
April 29th, 2008 at 4:44 am
Timo
May 8th, 2008 at 7:52 am