Scraping !=
letter * 4
:right_tools => "easy"
HTML/HTTP
Simple, everywhere
Scraping: Google or Evil
Data, Structure, Stability
Myth #1: It's not data
- Products: Amazon
- Prices: eBay
- Facts: Wikipedia
- Links: del.icio.us
- Events: Upcoming
Myth #2: It's not structured
<div id="sku-123456" class="product">
<span class="quantity">2</span>
<a href="http://bracke.ts">Angle brackets</a>
for <span class="price">$50</span>
</div>
=> 2 Angle brackets for $50
=> { :quantity=>2, :what=>"Angle brackets",
:price=>"$50", :url=>"http://bracke.ts" }
Myth #3: It's not stable
Changing the UI breaks the users
Scraping 101
Service discovery
(Ctrl+U)
<form method="post"
action="/settings/assafarkin/export">
<input id="showtags" type="checkbox"
name="showtags" checked="checked" />
<label for="showtags">include my tags</label>
Service call
(curl, net/http)
curl -b _user=assafarkin~~~~~~~~~~~~~
-d export=showtags=true\&showextended=true
http://del.icio.us/settings/assafarkin/export
> ~assaf/backup/delicious.html
Structure,
Type,
ID
Structure
<head>
<title>Scrape me</title>
</head>
Type
<span class="price">$50</span>
ID
<div id="sku-12345">
With style...
Structure
html>head>title
Type
.price
ID
#sku-12345
Select element
process ".price", :price=>:text
Extract value
process ".price", :price=>:text
Store it
process ".price", :price=>:text
Code please...
process "html>head>title", :title=>:text
process "#sku-12345", :product=>:text
process ".price", :price=>:text
=> { :title => "Scrape me",
:product => "Angle brackets",
:price => "50% off" }
Ruby and Meta Programming
Just another language ...
class HelloWorld
def say_hello
puts "Hello world"
end
end
class MyScraper < Scraper::Base
process "title", :title=>:text
process ".price", :price=>:text
end
MyScraper.rules.size => 2
MyScraper.scrape(html) =>
#<struct #<Class:0xb7d9ace0>
title="Scrape me", price="$50">
Methods ...
def truncate(element)
text(element)[0...50]
end
process "div.description", :excerpt=>:truncate
And blocks ...
process "div.description" do |element|
self.excerpt = text(element)[0...50]
end
Auction time!
class EBay < Scraper::Base
array :auctions
process "table.ebItemlist tr.single",
:auctions=>EBayAuction
result :auctions
end
class EBayAuction < Scraper::Base
process "h3.ens>a", :description=>:text,
:url=>"@href"
process ".ebcPr>span", :price=>:text
process ".ebPicture img", :image=>"@src"
end
pp EBay.scrape(html) =>
#<struct
description="APPLE iPOD nano 4GB MP3 PLAYER WHITE 1 DAY AUCTION",
auction="http://cgi.ebay.com/APPLE-iPOD-nano-4GB-MP3-PLAYER-WHITE-1-DAY-AUCTION_W0QQitemZ160001444906QQihZ006QQcategoryZ118268QQrdZ1QQcmdZViewItem",
price="$150.02",
image="ipod-nano_files/1600014449068080_0.jpg">,
#<struct
description="Apple iPod Nano 4GB MP3 Player 4 GB Black 1K Song i Pod",
auction="http://cgi.ebay.com/Apple-iPod-Nano-4GB-MP3-Player-4-GB-Black-1K-Song-i-Pod_W0QQitemZ200002160358QQihZ010QQcategoryZ118267QQrdZ1QQcmdZViewItem",
price="$200.00",
image="ipod-nano_files/2000021603588080_0.jpg">,
. . .
# Get all auctions from the page
auctions = EBay.scrape(html)
# Size of array
puts "Found " + auctions.size
# First item
auction = auctions[0]
puts "First auction:"
puts auction.description
puts auction.url
In The Real World
What I learned
- Easy to write: few lines of code.
- Resilient: broke twice.
- Easy to fix: < 30 min.
- Better than 200Kb/sec.
Watch for
- Bad HTML
- Really bad HTML
- Redirects
- Funky URLs
- Caching
module Scraper
class Base
# Information about the HTML page scraped. A structure with the following
# attributes:
# * url -- The URL of the document being scraped. Passed in
# the constructor but may have changed if the page was redirected.
# * original_url -- The original URL of the document being
# scraped as passed in the constructor.
# * encoding -- The encoding of the document.
# * last_modified -- Value of the Last-Modified header returned
# from the server.
# * etag -- Value of the Etag header returned from the server.
PageInfo = Struct.new(:url, :original_url, :encoding, :last_modified, :etag)
class << self
# :call-seq:
# process(symbol?, selector, values?, extractor)
# process(symbol?, selector, values?) { |element| ... }
#
# Defines a processing rule. A processing rule consists of a selector
# that matches element, and an extractor that does something interesting
# with their value.