Based on your feedback, I decided to change the behavior so processing rules no longer “consume” the element they process. Instead, if you decide that you don’t want to process that element (and its children) with any other rule, either call the
skip method, or pass the argument
:skip=>true. The old behavior was premature optimization (bad), the new one is more explicit and easier to control.
Out of that, I extracted a Microformats helper for Rails. And it was only reasonable I use one piece of code to produce the output, another piece of code to test it. So I wrote a simple hAtom scraper using scrAPI. It’s an early release that does hAtom and very basic hCard, but it’s worth checking out. It’s also an example of how to write scrapers, I incorporated a few tips and tricks in there.
You can find it in
Last notable change is the addition of a
collect() method that gets called before
result(). It turned out essential, for example, when working with hAtom, if the update date/time is missing it defaults to published. That all happens during
Duane Merrill on mashups:
A new breed of Web-based data integration applications is sprouting up all across the Internet. Colloquially termed mashups, their popularity stems from the emphasis on interactive user participation and the monster-of-Frankenstein-like manner in which they aggregate and stitch together third-party data. The sprouting metaphor is a reasonable one; a mashup Web site is characterized by the way in which it spreads roots across the Web, drawing upon content and functionality retrieved from data sources that lay outside of its organizational boundaries.
And check out the links at the end, under products and technology. He mentions SOAP and he mentions scrAPI.
It’s a good introductory level article, but I do have issues with some of the conclusions. So I’ll use this forum to share my thoughts.
It’s Not Your Daddy’s Scraper
Let’s start with scraping. As Duane puts it:
Screen scraping is often considered an inelegant solution, and for good reasons. It has two primary inherent drawbacks. The first is that, unlike APIs with interfaces, scraping has no specific programmatic contract between content-provider and content-consumer. … Web sites have a tendency to overhaul their look-and-feel periodically to remain fresh and stylish, which imparts severe maintenance headaches on behalf of the scrapers because their tools are likely to fail.
As theory goes, look and feel changes often, and anytime you change it, anything related to look and feel breaks. At least, that’s the theory.
How about the real world?
Changing the UI will always break the users. As people get accustomed to a site and learn how to navigate it, they become comfortable with the UI. Change the UI and you ruin their experience. The big sites know that. Check out Amazon, eBay, Craigslist, Google. When was the last time they changed their look and feel?
For good reason.
As you’re building a site, you change the UI often. You’re still experimenting, adding features, looking for focus. But brand new sites are not interesting to scrape. Amazon in its first year? You could get more information from a $99 catalog CD. As Amazon grew and attracted more users, it added reviews, recommendations, price comparison, small merchants. It became interesting to scrape, it also became stable.
Small changes do happen routinely, and some scrapers are very susceptible to moving pixels around. Good scrapers don’t. Most sites have a framework into which they put content pieces. Those pieces are identified by structure, type and ID. Those are anchors that remain stable, even as new content is added, moved around, or the level of details change.
In my experience, if you find the right anchors into the HTML, you’re going to get the same stability from the UI as you would from the API. Both will change, but both will change equally as often.
Duane also states:
The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software, colloquially known as scrAPIs. The dearth of such APIs and toolkits is largely due to the extremely application-specific needs of each individual scraping tool. This leads to large development overheads as designers are forced to reverse-engineer content, develop data models, parse, and aggregate raw data from the provider’s site.
And I’m glad to say we moved past that.
scrAPI brings scraping to the level of CSS. If you can style the HTML â€“ and who can’t? â€“ you can scrape it. Using CSS selectors for scraping, which has none of the browser incompatibility woes, is drop dead simple to learn and use. For me, simpler than either SOAP or RDF.
Check Out The Semantics
On to the Semantic Web:
Enter the Semantic Web, which is the vision that the existing Web can be augmented to supplement the content designed for humans with equivalent machine-readable information. In the context of the Semantic Web, the term information is different from data; data becomes information when it conveys meaning (that is, it is understandable).
I’m not a believer. I blogged about it before, so you know where I stand on the big experiment we call the Semantic Web. It makes a great piece for Scientific American, right next to string theory and potential for life on Mars.
The Web is not broken, so there’s no need to fix it.
Do a view source on this blog (Ctrl+U for those of you using Firefox). What do you see? You’re seeing an HTML feed (hAtom to be precise).
It’s the ordered list element marked with
hfeed. It’s made of items, properly marked with
hentry. Want to find the post title? In scrAPI you’d do â€œ
.hfeed .hentry .entry-titleâ€. Incidentally, the CSS does the same thing to style the title’s color and font. How about the published date? Find â€œ
.hfeed .hentry .publishedâ€ and get the value of the
title attribute. It’s an ISO timestamp, indifferent to localization and formatting.
This blog is semantically rich, yet it’s plain HTML. You got that right, it’s not even XHTML.
There’s a lot of semantics on the Web, and with the work done by Microformats, it’s only getting better. The best part: it’s DRY. And we love DRY, don’t we?
You pull the data from the database, spit out HTML that holds the data, wraps it in content, ready for presentation. The entire template is one page of PHP. It doesn’t need to duplicate the data into different files of various content types just to please the plurality of potential data formats. It doesn’t stress over XSLT, twist around RDF graphs, or struggle with SOAP envelopes. One format that sums it all up.
And it happens to be one format that millions of people know how to tweak, that search engines index, that available tools can view, print, e-mail, save, load and interact with.
As discussed in the previous section, deriving parsing and acquisition tools and data models requires significant reverse-engineering effort.
The last time I did some â€œsignificant reverse-engineering effortâ€, I was looking at eBay auction lists. I reverse engineered them with view source, and fifteen minutes later had a working scraper. It would take me longer just to hunt their API documentation, figure out the SOAP and create stubs.
Mashups Are About People With Needs
But how quickly we forget.
Now this is where I get to rant a bit and hope for things to turn out better this time. And it’s not specific to this article, I read worse from architecture pundits, but the slippery slope starts here:
Before mashups can make the transition from cool toys to sophisticated applications, much work will have to go into distilling robust standards, protocols, models, and toolkits. For this to happen, major software development industry leaders, content providers, and entrepreneurs will have to find value in mashups, which means viable business models. API providers will need to determine whether or not to charge for their content, and if so, how (for example, by subscription or by per-use).
Mashups have a lot of issues with them. Duane lists a few, but I can think of many more.
There are no industry standard protocols, no common formats, no integration frameworks, source are not provisioned, QoS not negotiated, there’s no end-to-end governance or even the slightest form of active collaboration between consumers and producers. The architecture sucks.
For a reason.
Now, all of that can change if we want it to. But what happens when you bring in the architects? When you get the standard bodies to control, when you follow vendor release cycles, when you govern and provision, when you hold biz dev meetings? You end up solving all of those issues, and recreating enterprise integration.
Why? Do we really need the long cycles, the high barriers, the costs, the red tape, the â€œonly projects with significant ROIâ€?
Mashups capture our collective imagination because they’re everything enterprise integration is not. They’re here, now, they work, they’re cheap, the scratch an itch. Mashups are the Lotus 123 of the Web. The let people solve their own problems in real time, not wait for time sharing the mainframe.
Mashups are about people with needs, people who need now over later or never. If you keep it simple and cheap, you can scratch your own itch.
Want to plot real estate prices against Google maps for better visibility into the market? You can get IT to find the budget, architect, negotiate QoS and turn around in a year or two. Or you can go and do it yourself. It will take a day or two. And if it breaks and you can no longer use Google Maps, it will take another day to switch to Yahoo.
If you want to learn about mashups and learn from mashups, learn this one thing. Keep it simple and just do it.
NB: There are a few things that enterprises should be concerned about that don’t affect your weekend fun toy mashup. For example, intellectual property. There’s a risk bringing it in, and of information seeping out, but those are not specific to mashups and are easily solved. My point is, if you try to erect too many barriers and bring up complexity, you’re losing all that mashups have to offer.
Ignore the buzz, pick the real lessons from the world of mashups.
scrAPI is now available as a Ruby Gem. To install:
gem install scrapi
Thanks to the always helpful RubyForge for hosting.
I’m still going to do incremental fixes in SVN, so if you want the bleeding edge, get it from SVN (
http:labnotes.org/svn/public/ruby/scrapi). If you want to wait for official releases, Gems are your friends.
This is also an official announcement for release 1.1.2 which includes a bug fix for
last-of-type and adds supports for multiple negations. For example:
process "h1, h2"
Will process every element that is either header level 1, or header level 2. On the other hand:
Will process all elements but these headers. You can do a lot of interesting things with the :not pseudo class.
To get CSS pseudo classes working for scrAPI and assert_select, I had to rewrite the CSS selector parser. I’m sure there’s a Lex and Yacc for Ruby somewhere, but I ended up with a much simpler solution. One I can actually read and fix.
I ended up using sub! and blocks.
Starting with the current expression, I simply sub! the token I’m looking for, testing if it exists and removing it at the same time, reducing the expression by one token.
To test if :empty comes next:
if statement.sub!(/^:empty, "") @pseudo << some code next end
To deal with tokens that have values, I use blocks. For example:
next if statement.sub!(/^#(w+)/) do |match| id = $1 attributes << ["id", id] "" end
Again, test for a match, do something with the token, and reduce the expression.
In CSS selectors, identifiers, class names, attributes and pseudo classes can come in any order. So a loop repeats on the expression until it doesn’t find any token it recognizes, or there’s nothing left to parse.
Inside the loop, I could use if and elsif, but I found it’s easier to keep the code readable (less indentation) by repeating on each match and breaking at the end. So the loop looks something like:
while true next if statement.sub!(/^#(w+)/) do |match| # handle ID "" end next if statement.sub!(/^.(w+)/) do |match| # handle class name "" end # And so forth. break end
If you’re interested, you can check the code here.
Major update to both libraries.
I added a full test suite, and in the process caught and fixed a few bugs, like case sensitive (where it shouldn’t), group selectors not working as expected, and a few other small gotchas.
I also added pseudo classes from CSS 3. Pseudo classes are a bit tricky to explain, so let me show with some examples:
Selects every other (odd) row in the table.
Selects the first six rows in the table.
Selects the sixth row in the table.
Selects the first row in the table.
Will almost work like you expect it to, but only if the paragraph is the first element in the div. Otherwise, it selects nothing.
Will select the first paragraph in the div, ignoring any elements that are not a paragraph.
Will select all the paragraphs in the div, except those that have the class “post”.
Will select all the paragraphs except the ones that are empty.
Will select all paragraphs that have a single link, no paragraphs that have zero, two or more links.
You can install assert_select as a plugin with:
./script/plugin install http://labnotes.org/svn/public/ruby/rails_plugins/assert_select
To download the scrAPI toolkit for Ruby:
gem install scrapi
svn export http://labnotes.org/svn/public/ruby/scrapi
And if you have cool tricks for scraping that you’d like to share, leave a comment or e-mail me. I’d like to collect them all into a tips & tricks post (with attribution, of course).
To view this presentation you’ll need a computer to which you can download, install and run a Web browser.
Wait … you already have that?
Isn’t HTML great?
That’s half the presentation right there. For all the hype around Web services, SOAP and RDF, sometimes its the simple protocols that do the most. So let’s use some HTML/HTTP.
I start by busting a few of the myths surrounding HTML/HTTP and scraping in general. In real life it works better than you’d expect.
How come? Because simple solutions are more resilient than we give them credit. And what’s easy to build, it also easy to test and easy to fix.
The second part shows how to scrape eBay auctions in ten lines of code. There’s also a very brief introduction to Ruby, just enough to understand the examples. If you know Ruby, skip that. If you don’t know Ruby, change that!
And it ends with metrics from a real live application. You know the one I’m talking about.
I wrote the framework so I can spend less time writing scrapers, and more timing working on features that matter. So have fun and explore.
N.B. Someone asked if you could use scrAPI for testing Web UIs. Yes. But you might find assert_select easier to use. It shares the same code and style (pun intended) but geared towards test cases.
There’s a lot of ways to scrape HTML.
There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.
Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.
So how about scraping HTML with style?
I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.
Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.
Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.
Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.
Easy to write is easy to test and easy to fix.
What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.
Here’s an example that scrapes auctions from eBay:
ebay_auction = Scraper.define do process "h3.ens>a", :description=>:text, :url=>"@href" process "td.ebcPr>span", :price=>:text process "div.ebPicture >a>img", :image=>"@src" result :description, :url, :price, :image end ebay = Scraper.define do array :auctions process "table.ebItemlist tr.single", :auctions => ebay_auction result :auctions end
And using the scraper:
auctions = ebay.scrape(html) # No. of auctions found puts auctions.size # First auction: auction = auctions puts auction.description puts auction.url
This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.
To install the Gem:
gem install scrapi
To get the bleeding edge code from SVN:
svn co http://labnotes.org/svn/public/ruby/scrapi
If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.
If you’re using it in your application let me know, I’d like to link to it.
Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.
Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.
Update: Having problems figuring out the right CSS Selector to use? Try Firequark:
Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.
To get Flickr comments working, I implemented a custom parser that scrapes Flickr pages. Flickr has great styling, but their pages lack in semantic markup.
At the Flickr 2.0 party, I brought this up with Eran, who asked why I’m not using the Flickr API. I didn’t know there was a Flickr API for comments. So I opened the Flickr Hacks book — they were giving some away at the party — and looked at the table of contents. Sure enough, they dedicated a few pages to comments. The “official” Flickr API looks something like this:
- You read the HTML page.
- You lookup the comments table.
- You parse each individual row.
The table rows are the comments. Except for being written in Perl, the Flickr hack looks exactly like the co.mments Flickr scrapper.
When I heard about scrAPIs at MashupCamp, I imagined this is exactly what a scrAPI is. A scrAPI uses HTTP transport, HTML parsing and some custom code for making sense of the data. Each scrAPI has its own custom code, depending on the service being used and what data you’re looking for.
In short, there’s an API, it just requires a little bit of scraping.
Flickr has a scrAPI, as does WordPress, MoveableType, Blogger, MetaFilter and a whole set of other sites. And co.mments uses the scrAPIs of these services (and others) to keep track of new comments.
ScrAPIs and Microformats
How are scrAPIs different from microformats? A microformat is not specific to the underlying service. The microformat for events is the same, whether you’re reading events from Upcoming.org, WordPress or any other source. Microformats work across services.
A scrAPI can use microformats. You can create a scrAPI for events that depends on the use of hCalendar to read event information. A scrAPI can also be specific to a service. A scrAPI for grabbing driving directions from Google Maps, actions items from Basecamp, next week’s events from 30Boxes.
Yesterday I had coffee with Thor Muller. Thor’s idea of scrAPIs is a bit different. He proposes that a scrAPI is the API provided by a scrapper. Not the service being scrapped, but the piece of software doing the scrapping.
In his model, Flickr and Basecamp don’t have a scrAPI. But if you develop a piece of software that grabs contacts from both Flickr and Basecamp, and offer an API to access that information, you’ve created a contacts scrAPI.
In both web and enterprise cases, there’s a better solution: build a layer around the non-API-enabled site/application, and provide an API to allow multiple applications to access the underlying application’s data without each of them having to do site/screen scraping.
Scraping or Scrapper
If we want to make the Web more usable, we need some way to use unofficial APIs. APIs need to be developed and maintained, and that comes at the expense of other features. Sometimes the best tradeoff is to offer no official API.
But there are ways to build applications that are scrape-friendly. Use semantic HTML, structure the content, apply microformats whenever possible. There’s a lot of good information available out there for Web developers. For using ID attributes to identify microcontent, to using semantic markup like dt/dd and blockquote, to smart (semantic) use of class names.
On the other hand, there’s tremendous value in scraping services and libraries that operate across a number of services. Instead of scraping invidiual services, one at a time, you tap on to a service or use an existing library. All the data without the pain.
And it’s easy to imagine an ecosystem of services and libraries that provide scrapping services. In fact, when co.mments opens up its API, it will be doing exactly that. As Thor says:
Some people have asked how important it is that scrAPIs be open source. Put simply, a scrAPI is simply a screen scraper with an open API. But because of the nature of maintaining a scrAPI of any complexity, parsing pages that may change with some frequency, it should ideally harness open source-style collaboration by the developers that use it.
I summarized the pros and cons of each approach:
|Scraping pros:||Scraping cons:|
|Makes more structured data available to the scrapper.||Requires a change to the service.|
|Quick and easy to develop.||Not every service provider gets it.|
|One-off and ad hoc applications are possible.||You’re still dealing with one application at a time.|
|Scrapper pros:||Scrapper cons:|
|One API to work across a variety of services.||Only applies to the 20%.|
|No change to the service.||Limited by scraper-unfriendly services.|
|Open source development model.||Economic value?|
This is not an either/or proposition. Both are valuable and both need to exist. They solve different problems.
But I’m interested to know what you think. Where do you think the value is for the community?
Update: Holly Ward is also joining the scrAPI conversation.