
Duane Merrill on mashups:
A new breed of Web-based data integration applications is sprouting up all across the Internet. Colloquially termed mashups, their popularity stems from the emphasis on interactive user participation and the monster-of-Frankenstein-like manner in which they aggregate and stitch together third-party data. The sprouting metaphor is a reasonable one; a mashup Web site is characterized by the way in which it spreads roots across the Web, drawing upon content and functionality retrieved from data sources that lay outside of its organizational boundaries.
And check out the links at the end, under products and technology. He mentions SOAP and he mentions scrAPI.
I’m honored.
It’s a good introductory level article, but I do have issues with some of the conclusions. So I’ll use this forum to share my thoughts.
It’s Not Your Daddy’s Scraper
Let’s start with scraping. As Duane puts it:
Screen scraping is often considered an inelegant solution, and for good reasons. It has two primary inherent drawbacks. The first is that, unlike APIs with interfaces, scraping has no specific programmatic contract between content-provider and content-consumer. … Web sites have a tendency to overhaul their look-and-feel periodically to remain fresh and stylish, which imparts severe maintenance headaches on behalf of the scrapers because their tools are likely to fail.
As theory goes, look and feel changes often, and anytime you change it, anything related to look and feel breaks. At least, that’s the theory.
How about the real world?
Changing the UI will always break the users. As people get accustomed to a site and learn how to navigate it, they become comfortable with the UI. Change the UI and you ruin their experience. The big sites know that. Check out Amazon, eBay, Craigslist, Google. When was the last time they changed their look and feel?
For good reason.
As you’re building a site, you change the UI often. You’re still experimenting, adding features, looking for focus. But brand new sites are not interesting to scrape. Amazon in its first year? You could get more information from a $99 catalog CD. As Amazon grew and attracted more users, it added reviews, recommendations, price comparison, small merchants. It became interesting to scrape, it also became stable.
Small changes do happen routinely, and some scrapers are very susceptible to moving pixels around. Good scrapers don’t. Most sites have a framework into which they put content pieces. Those pieces are identified by structure, type and ID. Those are anchors that remain stable, even as new content is added, moved around, or the level of details change.
In my experience, if you find the right anchors into the HTML, you’re going to get the same stability from the UI as you would from the API. Both will change, but both will change equally as often.
Duane also states:
The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software, colloquially known as scrAPIs. The dearth of such APIs and toolkits is largely due to the extremely application-specific needs of each individual scraping tool. This leads to large development overheads as designers are forced to reverse-engineer content, develop data models, parse, and aggregate raw data from the provider’s site.
And I’m glad to say we moved past that.
scrAPI brings scraping to the level of CSS. If you can style the HTML – and who can’t? – you can scrape it. Using CSS selectors for scraping, which has none of the browser incompatibility woes, is drop dead simple to learn and use. For me, simpler than either SOAP or RDF.
Check Out The Semantics
On to the Semantic Web:
Enter the Semantic Web, which is the vision that the existing Web can be augmented to supplement the content designed for humans with equivalent machine-readable information. In the context of the Semantic Web, the term information is different from data; data becomes information when it conveys meaning (that is, it is understandable).
I’m not a believer. I blogged about it before, so you know where I stand on the big experiment we call the Semantic Web. It makes a great piece for Scientific American, right next to string theory and potential for life on Mars.
The Web is not broken, so there’s no need to fix it.
Do a view source on this blog (Ctrl+U for those of you using Firefox). What do you see? You’re seeing an HTML feed (hAtom to be precise).
It’s the ordered list element marked with hfeed. It’s made of items, properly marked with hentry. Want to find the post title? In scrAPI you’d do “.hfeed .hentry .entry-titleâ€. Incidentally, the CSS does the same thing to style the title’s color and font. How about the published date? Find “.hfeed .hentry .published†and get the value of the title attribute. It’s an ISO timestamp, indifferent to localization and formatting.
This blog is semantically rich, yet it’s plain HTML. You got that right, it’s not even XHTML.
There’s a lot of semantics on the Web, and with the work done by Microformats, it’s only getting better. The best part: it’s DRY. And we love DRY, don’t we?
You pull the data from the database, spit out HTML that holds the data, wraps it in content, ready for presentation. The entire template is one page of PHP. It doesn’t need to duplicate the data into different files of various content types just to please the plurality of potential data formats. It doesn’t stress over XSLT, twist around RDF graphs, or struggle with SOAP envelopes. One format that sums it all up.
And it happens to be one format that millions of people know how to tweak, that search engines index, that available tools can view, print, e-mail, save, load and interact with.
As discussed in the previous section, deriving parsing and acquisition tools and data models requires significant reverse-engineering effort.
Maybe.
The last time I did some “significant reverse-engineering effortâ€, I was looking at eBay auction lists. I reverse engineered them with view source, and fifteen minutes later had a working scraper. It would take me longer just to hunt their API documentation, figure out the SOAP and create stubs.
Mashups Are About People With Needs
But how quickly we forget.
Now this is where I get to rant a bit and hope for things to turn out better this time. And it’s not specific to this article, I read worse from architecture pundits, but the slippery slope starts here:
Before mashups can make the transition from cool toys to sophisticated applications, much work will have to go into distilling robust standards, protocols, models, and toolkits. For this to happen, major software development industry leaders, content providers, and entrepreneurs will have to find value in mashups, which means viable business models. API providers will need to determine whether or not to charge for their content, and if so, how (for example, by subscription or by per-use).
Mashups have a lot of issues with them. Duane lists a few, but I can think of many more.
There are no industry standard protocols, no common formats, no integration frameworks, source are not provisioned, QoS not negotiated, there’s no end-to-end governance or even the slightest form of active collaboration between consumers and producers. The architecture sucks.
For a reason.
Now, all of that can change if we want it to. But what happens when you bring in the architects? When you get the standard bodies to control, when you follow vendor release cycles, when you govern and provision, when you hold biz dev meetings? You end up solving all of those issues, and recreating enterprise integration.
Why? Do we really need the long cycles, the high barriers, the costs, the red tape, the “only projects with significant ROI�
Mashups capture our collective imagination because they’re everything enterprise integration is not. They’re here, now, they work, they’re cheap, the scratch an itch. Mashups are the Lotus 123 of the Web. The let people solve their own problems in real time, not wait for time sharing the mainframe.
Mashups are about people with needs, people who need now over later or never. If you keep it simple and cheap, you can scratch your own itch.
Want to plot real estate prices against Google maps for better visibility into the market? You can get IT to find the budget, architect, negotiate QoS and turn around in a year or two. Or you can go and do it yourself. It will take a day or two. And if it breaks and you can no longer use Google Maps, it will take another day to switch to Yahoo.
If you want to learn about mashups and learn from mashups, learn this one thing. Keep it simple and just do it.
NB: There are a few things that enterprises should be concerned about that don’t affect your weekend fun toy mashup. For example, intellectual property. There’s a risk bringing it in, and of information seeping out, but those are not specific to mashups and are easily solved. My point is, if you try to erect too many barriers and bring up complexity, you’re losing all that mashups have to offer.
Ignore the buzz, pick the real lessons from the world of mashups.

