The UI is the API

Scraping with Ruby

Assaf Arkinhttp://labnotes.org


Scraping !=
letter * 4

:right_tools => "easy"

HTML/HTTP

Simple, everywhere

Scraping: Google or Evil

Data, Structure, Stability

Myth #1: It's not data

Myth #2: It's not structured

<div id="sku-123456" class="product">
  <span class="quantity">2</span>
  <a href="http://bracke.ts">Angle brackets</a>
  for <span class="price">$50</span>
</div>

=> 2 Angle brackets for $50

=> { :quantity=>2, :what=>"Angle brackets",
     :price=>"$50", :url=>"http://bracke.ts" }
  

Myth #3: It's not stable

Changing the UI breaks the users

Scraping 101

Service discovery
(Ctrl+U)

<form method="post"
  action="/settings/assafarkin/export">
  <input id="showtags" type="checkbox"
    name="showtags" checked="checked" />
  <label for="showtags">include my tags</label>

Service call
(curl, net/http)

curl -b _user=assafarkin~~~~~~~~~~~~~
     -d export=showtags=true\&showextended=true
     http://del.icio.us/settings/assafarkin/export
     > ~assaf/backup/delicious.html

Structure,
Type,
ID

Structure

<head>
  <title>Scrape me</title>
</head>

Type

<span class="price">$50</span>

ID

<div id="sku-12345">

With style...

Structure

html>head>title

Type

.price

ID

#sku-12345

Select element

process ".price", :price=>:text

Extract value

process ".price", :price=>:text

Store it

process ".price", :price=>:text

Code please...

  process "html>head>title", :title=>:text
  
  process "#sku-12345", :product=>:text

  process ".price", :price=>:text

  => { :title => "Scrape me",
       :product => "Angle brackets",
       :price => "50% off" }
  

Ruby and Meta Programming

Just another language ...

class HelloWorld

  def say_hello
     puts "Hello world"
  end

end
class MyScraper < Scraper::Base

  process "title", :title=>:text
  process ".price", :price=>:text

end

MyScraper.rules.size => 2

MyScraper.scrape(html) =>
  #<struct #<Class:0xb7d9ace0>
  title="Scrape me", price="$50">

Methods ...

def truncate(element)
  text(element)[0...50]
end

process "div.description", :excerpt=>:truncate

And blocks ...

process "div.description" do |element|
  self.excerpt = text(element)[0...50]
end

Auction time!

class EBay < Scraper::Base

  array :auctions

  process "table.ebItemlist tr.single",
          :auctions=>EBayAuction

  result :auctions

end
class EBayAuction < Scraper::Base

  process "h3.ens>a", :description=>:text,
          :url=>"@href"

  process ".ebcPr>span", :price=>:text

  process ".ebPicture img", :image=>"@src"

end
pp EBay.scrape(html)  =>

#<struct
 description="APPLE iPOD nano 4GB MP3 PLAYER WHITE 1 DAY AUCTION",
 auction="http://cgi.ebay.com/APPLE-iPOD-nano-4GB-MP3-PLAYER-WHITE-1-DAY-AUCTION_W0QQitemZ160001444906QQihZ006QQcategoryZ118268QQrdZ1QQcmdZViewItem",
 price="$150.02",
 image="ipod-nano_files/1600014449068080_0.jpg">,
#<struct
 description="Apple iPod Nano 4GB MP3 Player 4 GB Black 1K Song i Pod",
 auction="http://cgi.ebay.com/Apple-iPod-Nano-4GB-MP3-Player-4-GB-Black-1K-Song-i-Pod_W0QQitemZ200002160358QQihZ010QQcategoryZ118267QQrdZ1QQcmdZViewItem",
 price="$200.00",
 image="ipod-nano_files/2000021603588080_0.jpg">,
 . . .
# Get all auctions from the page
auctions = EBay.scrape(html)

# Size of array
puts "Found " + auctions.size

# First item
auction = auctions[0]

puts "First auction:"
puts auction.description
puts auction.url
  

In The Real World

What I learned

Watch for

module Scraper

  class Base

    # Information about the HTML page scraped. A structure with the following
    # attributes:
    # * url -- The URL of the document being scraped. Passed in
    #   the constructor but may have changed if the page was redirected.
    # * original_url -- The original URL of the document being
    #   scraped as passed in the constructor.
    # * encoding -- The encoding of the document.
    # * last_modified -- Value of the Last-Modified header returned
    #   from the server.
    # * etag -- Value of the Etag header returned from the server.
    PageInfo = Struct.new(:url, :original_url, :encoding, :last_modified, :etag)

    class << self

      # :call-seq:
      #   process(symbol?, selector, values?, extractor)
      #   process(symbol?, selector, values?) { |element| ... }
      #
      # Defines a processing rule. A processing rule consists of a selector
      # that matches element, and an extractor that does something interesting
      # with their value.
  

scrAPI toolkit

http://labnotes.org