To get Flickr comments working, I implemented a custom parser that scrapes Flickr pages. Flickr has great styling, but their pages lack in semantic markup.
At the Flickr 2.0 party, I brought this up with Eran, who asked why I’m not using the Flickr API. I didn’t know there was a Flickr API for comments. So I opened the Flickr Hacks book — they were giving some away at the party — and looked at the table of contents. Sure enough, they dedicated a few pages to comments. The “official” Flickr API looks something like this:
- You read the HTML page.
- You lookup the comments table.
- You parse each individual row.
The table rows are the comments. Except for being written in Perl, the Flickr hack looks exactly like the co.mments Flickr scrapper.
Scraping APIs
When I heard about scrAPIs at MashupCamp, I imagined this is exactly what a scrAPI is. A scrAPI uses HTTP transport, HTML parsing and some custom code for making sense of the data. Each scrAPI has its own custom code, depending on the service being used and what data you’re looking for.
In short, there’s an API, it just requires a little bit of scraping.
Flickr has a scrAPI, as does WordPress, MoveableType, Blogger, MetaFilter and a whole set of other sites. And co.mments uses the scrAPIs of these services (and others) to keep track of new comments.
ScrAPIs and Microformats
How are scrAPIs different from microformats? A microformat is not specific to the underlying service. The microformat for events is the same, whether you’re reading events from Upcoming.org, WordPress or any other source. Microformats work across services.
A scrAPI can use microformats. You can create a scrAPI for events that depends on the use of hCalendar to read event information. A scrAPI can also be specific to a service. A scrAPI for grabbing driving directions from Google Maps, actions items from Basecamp, next week’s events from 30Boxes.
Scrapper APIs
Yesterday I had coffee with Thor Muller. Thor’s idea of scrAPIs is a bit different. He proposes that a scrAPI is the API provided by a scrapper. Not the service being scrapped, but the piece of software doing the scrapping.
In his model, Flickr and Basecamp don’t have a scrAPI. But if you develop a piece of software that grabs contacts from both Flickr and Basecamp, and offer an API to access that information, you’ve created a contacts scrAPI.
Thor makes compelling arguments for his idea. He’s also raising the point that there could be an ecosystem of scrAPIs, including unifying services and open source libraries.
Sandy Kemsley has an interesting post about scrAPIs that also makes the same point:
In both web and enterprise cases, there’s a better solution: build a layer around the non-API-enabled site/application, and provide an API to allow multiple applications to access the underlying application’s data without each of them having to do site/screen scraping.
Scraping or Scrapper
If we want to make the Web more usable, we need some way to use unofficial APIs. APIs need to be developed and maintained, and that comes at the expense of other features. Sometimes the best tradeoff is to offer no official API.
But there are ways to build applications that are scrape-friendly. Use semantic HTML, structure the content, apply microformats whenever possible. There’s a lot of good information available out there for Web developers. For using ID attributes to identify microcontent, to using semantic markup like dt/dd and blockquote, to smart (semantic) use of class names.
On the other hand, there’s tremendous value in scraping services and libraries that operate across a number of services. Instead of scraping invidiual services, one at a time, you tap on to a service or use an existing library. All the data without the pain.
And it’s easy to imagine an ecosystem of services and libraries that provide scrapping services. In fact, when co.mments opens up its API, it will be doing exactly that. As Thor says:
Some people have asked how important it is that scrAPIs be open source. Put simply, a scrAPI is simply a screen scraper with an open API. But because of the nature of maintaining a scrAPI of any complexity, parsing pages that may change with some frequency, it should ideally harness open source-style collaboration by the developers that use it.
I summarized the pros and cons of each approach:
| Scraping pros: | Scraping cons: |
| Makes more structured data available to the scrapper. | Requires a change to the service. |
| Quick and easy to develop. | Not every service provider gets it. |
| One-off and ad hoc applications are possible. | You’re still dealing with one application at a time. |
| Scrapper pros: | Scrapper cons: |
| One API to work across a variety of services. | Only applies to the 20%. |
| No change to the service. | Limited by scraper-unfriendly services. |
| Open source development model. | Economic value? |
This is not an either/or proposition. Both are valuable and both need to exist. They solve different problems.
But I’m interested to know what you think. Where do you think the value is for the community?
Update: Holly Ward is also joining the scrAPI conversation.
links for 2006-05-17
Labnotes » Blog Archive » The Rule of Least Power
Like Your Work » Blog Archive » links for 2006-07-17