1. Scraping with style: scrAPI toolkit for Ruby

    July 11th, 2006

    There’s a lot of ways to scrape HTML.

    There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.

    Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.

    So how about scraping HTML with style?

    I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.

    Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.

    Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.

    Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.

    Easy to write is easy to test and easy to fix.

    What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.

    Here’s an example that scrapes auctions from eBay:

    ebay_auction = Scraper.define do
      process "h3.ens>a", :description=>:text,
                  :url=>"@href"
      process "td.ebcPr>span", :price=>:text
      process "div.ebPicture >a>img", :image=>"@src"
    
      result :description, :url, :price, :image
    end
    
    ebay = Scraper.define do
      array :auctions
    
      process "table.ebItemlist tr.single",
                  :auctions => ebay_auction
    
      result :auctions
    end

    And using the scraper:

    auctions = ebay.scrape(html)
    
    # No. of auctions found
    puts auctions.size
    
    # First auction:
    auction = auctions[0]
    puts auction.description
    puts auction.url

    This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.

    To install the Gem:

    gem install scrapi

    To get the bleeding edge code from SVN:

    svn co http://labnotes.org/svn/public/ruby/scrapi

    If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.

    If you’re using it in your application let me know, I’d like to link to it.

    Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.

    Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.

    Rdocs are here.

    Update: Having problems figuring out the right CSS Selector to use? Try Firequark:

    Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.

    1. Matthieu Riou

      Congratulations, that’s really cool! I might scrape your blog soon, seems easier than RSS :)

    2. iolaire

      Can you comment on how fast this toolkit is? I’ve used Rubyful Soup http://www.crummy.com/software/RubyfulSoup/ on a project and it seems very generally slow. So I’m wondering if this would be a good alternative.

    3. Ruby gets a stylish HTML scraper - scrAPI

      [...] The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, RubyfulSoup, REXML, and WWW::Mechanize, but none of these are as immediately useful as scrAPI.. so why? [...]

    4. Assaf

      iolaire,

      My test suite uses a mix of several scrapers and pages that I pulled from different Web sites, a good reflection of what you’ll find on blogs. I’m using Tidy for the cleanup.

      Running it on a 1.8GHz Duo Core, I get to process around 210Kb/s.

    5. iolaire

      Assaf, Thanks, I’ll give it a try in the future.

    6. Chris

      This definitely looks impressive. However, what needs require’d to use scrAPI? I built the gem and have require ’scrapi’, but it fails when trying to load Tidy.

    7. Assaf

      Chris,

      gem install tidy

      installs the Tidy library for Ruby. But it also needs to fine the Tidy DLL (Windows) or shared library (Linux). It tries to look for those in the library tidy, so that’s one place to include them.

      Or, set the path using Tidy.path = ‘…’

    8. Assaf

      I added these instructions to the README file:

      By default scrAPI uses Tidy to cleanup the HTML.

      You need to install the Tidy Gem for Ruby:
      gem install tidy

      And the Tidy binary libraries, available here:

      http://tidy.sourceforge.net/

      By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That’s one place to place the Tidy library.

      Alternatively, just point Tidy to the library (if you’re running Linux it may already be installed) with:

      Tidy.path = “….”

      For testing purposes, you can also use the built in HTML parser. It’s useful for testing and getting up to grab
      s with scrAPI, but it doesn’t deal well with broken HTML. So for testing only:

      Scraper::Base.parser :html_parser

    9. Chris

      Any idea about OS X support? I pointed Tidy.path to the location, but it fails with a LoadError.

    10. Dennis W

      This looks very nice. Could you please post some other examples? I am a newbie in ruby but this might motivate me. Thanks

    11. Assaf

      On my Mac, which has the default Ruby/Tidy installation that comes with OS/X, this works:

      Tidy.path = “/usr/lib/libtidy.dylib”

      (Ruby is 1.8.2, Tidy only says Dec 2004)

    12. Assaf

      Dennis,

      I’ll do a follow up post with more examples. Check the feed, probably in a few days.

    13. Michael Campbell

      How does this handle HTML that *ISN’T* CSS styled? It just… doesn’t? Non-styled HTML is “invisible” to it?

    14. Assaf

      Michael,

      The HTML doesn’t have to be styled. At all.

      For example:

      process “html>head>title”, :title=>:text

      gets the title of the page. There’s no way to style the title.

      Anything that’s inside the HTML you can scrape, you don’t need a CSS stylesheet. What scrAPI uses is the CSS selector syntax so it’s real easy to write scraping rules.

      The bonus point is that pages that are styled, are also easy to scrape because the HTML developer took care of making the HTML elements easy to identify. They did it for the purpose of styling, but they made life easy for those scraping the page.

      If you look at the eBay example, eBay styles the auction description using:

      h3.ens {
      ….
      }

      And the scraper looks up the auction link using:

      process “h3.ens>a”, :description=>:text

      Notice those are two different selectors, one picks the header (for styling), another picks the anchor (for scraping).

      But because eBay styles their headers, they make them easy to identify using the class “ens”.

    15. Andre Lewis

      Assaf, this is very cool. I’m sure I will use this on a future project.

    16. Like Your Work » Blog Archive » links for 2006-07-14

      [...] Labnotes » Blog Archive » Scraping with style: scrAPI toolkit for Ruby (tags: ruby RubyOnRails rails scraper) [...]

    17. Joshua Sierles

      Are there plans to support HTML attributes? Much of my scraping goes up against old sites with font tags and other goodies, but no CSS in sight.

    18. Assaf

      Joshua,

      You can definitely use attributes. For font:

      process “font[size=3]”

      Color:

      process “table[bgcolor=#ffddee]”

      Or even better, case insensitive:

      process “table[bgcolor=?]“, /#ffddee/i

    19. Labnotes » Blog Archive » links for 2006-07-19

      [...] Scraping with style: scrAPI toolkit for Ruby Scraping HTML with style. A toolkit for writing scrapers using CSS selectors and a scraping DSL. (tags: projects ruby scraping) [...]

    20. HTML Blog » Êîìïîíåíòû äëÿ Ruby

      [...] Scrapi Áèáëèîòåêà äëÿ ðàçáîðà HTML-ñòðàíèö ñ ïîìîùüþ CSS-ñåëåêòîðîâ. Èäåéíî ãîðàçäî áîëåå óäîáíàÿ øòóêà íåæåëè ðåãóëÿðíûå âûðàæåíèÿ, õîòÿáû óæå ïîòîìó, ÷òî CSS-ñåëåêòîðû ãîðàçäî ïîíÿòíåå äëÿ ÷åëîâåêà. Ñêîðîñòü, êàê ÿ ïîäîçðåâàþ, îñòàâëÿåò æåëàòü ëó÷øåãî. Ïðèìåð èñïîëüçîâàíèÿ: [...]

    21. Michael Campbell

      Assaf, I’ve found this to be very powerful so far; like it a lot. But, I think we really need some more concrete examples. The doc in the code is getting me by…a little, but there’s still a lot I don’t understand.

      Any chance of getting some more complete doc out, and a gem?

    22. Assaf

      Michael,

      I definitely need to work on examples, and if anyone can contribute some, I’d love to post them on the blog (with attribution).

      Gem coming up.

    23. Assaf

      I just did a new release that adds :nth-child, :first-child, :empty and more pseudo class goodness. If you haven’t used CSS 3 yet, check out this post. There’s a lot of cool things you can do with the new features.

      I also fixed some bugs I found along the way, and added a CSS selector test suite.

    24. Michael Campbell

      Assaf,

      For us Windows users, the sourceforge windows port blurb says:

      “Hash Table Versions

      For large documents, the hash table versions can be *much* faster. Tidying a complex ~40MiB XML document took 18 minutes to finish, while with the hash table version it took just 14 seconds. For small documents you won’t notice much difference. It comes with a catch though - if you declare new elements after a parsing run (via setting TidyBlockTags etc), it can break. As long as you don’t do that or just keep from re-using a TidyDoc, you are safe.”

      Fast is always good, so was wondering if scrapi is set up such that the hashtable versions will work?

    25. Assaf

      Michael,

      scrAPI doesn’t declare new elements (you can check the default options list in /lib/scraper/reader.rb), and uses the Tidy Gem which AFAIK doesn’t reuse documents. So you can use the hash table version.

    26. Brook

      First off I’m new to Ruby, just started playing with it so please excuse a stupid question.

      For the example above, I inserted the code into a file.
      Installed the GEM
      Moved the libtidy.so over to the proper directory (on linux, so /usr/local/lib)

      Ran the script as show above and I get this error.

      Please advise what I’m missing, must be simple.

      “/home/myuser/Desktop/scrapi_test.rb:1: uninitialized constant Scraper (NameError)”

      Thanks,

    27. Assaf

      Brook,

      You need to start with:
      require “lib/scrapi”

      to load the Scraper and other related classes before you can use them.

    28. High Earth Orbit » Blog Archive » scrAPI - Microformat Parsing in Ruby

      [...] Talking on #microformats I was pointed to LabNotes newer incarnation of a parser: scrAPI. It’s a much more generic HTML parser/scraper, that can handle getting data from HTML by structure, class, or id. Here is Assaf’s presentation at Mashup Camp II where he gives some good tutorials and discussion about the API. [...]

    29. Labnotes » Mashups: In The Spirit of Simplicity

      [...] And check out the links at the end, under products and technology. He mentions SOAP and he mentions scrAPI. [...]

    30. Bill Kudrle

      I downloaded the code and am impressed with the sophistication. But being relatively new to Ruby, the comments in reader.rb and base.rb still leave many questions, however. It would be SO helpful to just have a SIMPLE example that works. For example, I get scraper_test.rb to work without errors, but am not sure how to then apply that to the above simple example for EBay. For the EBay example above I am not sure what the html value should be (this is where it gives me an error when I try to run it). No doubt to you it is obvious, but to me I am still at the beginning stage.

      Great code it seems, but a couple of simple examples would be TREMENDOUSLY helpful. If I could make it through some simple examples, perhaps I could help you with some more documentation. I am using scraping in my job and am motivated to use your approach rather than the HTree approach that I am using now. Your approach seems to be more elegant if I could just get the first base hit.

    31. Assaf

      Bill,

      To run the eBay example, pick up one of the eBay search results URLs and call ebay.scrape(url).

      For example:
      auctions = ebay.scrape(URI.parse(”http://search.ebay.com/ipod-nano_
      W0QQcatrefZC6QQfromZR3QQfsooZ1QQfsopZ1QQkeywordZonQQsacatZQ2d1QQstrkwZipod “))

    32. john

      this thing was helpful too to understand a little more how toextract stuff.. .but seriously now, for us beginners? How about a step by step guide?

      Thank!

      http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/MicroformatParser

    33. Assaf

      John,

      I’m going to need help with that, just not enough time to write good enough documentation.

      Any volunteers to help write up a scrAPI guide?

    34. High Earth Orbit » Blog Archive » Converting table-based Calendars to hCalendar

      [...] Employing some slick Ruby scripting - and using the very useful scrAPI from Assaf we can define scrapers to walk over the multiple days, and then within those days grab each of the sessions. These are then output into proper hCalendar format like: [...]

    35. Ruby/Rails UI Scraping at Matt Didcoe

      [...] Links: StreetEasy - http://www.streeteasy.com scrAPI toolkit for rails - http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/ Ruby on Rails podcast - http://podcast.rubyonrails.com  [...]

    36. Mi viaje en tren » Blog Archive » Avistamientos #4

      [...] Y este si esta hecho para “scrapping”. [...]

    37. the ryan king » Progress

      [...] In addition to starting school again (for the last time). I’ve been working hard. I’ve reengineered the technology behind our microformats search at Technorati (currently still in our kitchen, aka ‘labs’). It’s been a fun project especially because I’ve gotten to work with a fun language, framework and libaries (this one too). [...]

    38. Vish

      I’ll help write the docs, if you don’t have a volunteer already. That way I’ll get my doubts cleared too.:-)

      Write me at the included email address to let me know.

    39. Uses for Rcrawl

      [...] I mentioned in a previous post, Announcing Rcrawl, that it uses Assaf Arkin’s scrAPI toolkit for Ruby for link extraction. Now take that a step further and write your own scrAPI, and use Rcrawl to gather the HTML for you. [...]

    40. Uses for Rcrawl at Digital Duckies

      [...] I mentioned in a previous post, Announcing Rcrawl, that it uses Assaf Arkin’s scrAPI toolkit for Ruby for link extraction. Now take that a step further and write your own scrAPI, and use Rcrawl to gather the HTML for you. [...]

    41. epicblog» Blog Archive » scrAPI and redirects without complete URL

      [...] I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done. [...]

    42. Assaf

      I’ve added a redirect patch to the code in SVN, will be part of the next Gem release.

    43. Ninjawords - a fast online dictionary… fast like a ninja » eightpence - Phil Crosby

      [...] It’s built on Ruby on rails (surprise), and I’m using the poorly documented but excellent scrapi page scraping library to pull definitions from wiktionary that my local dictionary is missing. [...]

    44. Notes » links for 2006-11-29

      [...] Labnotes » Scraping with style: scrAPI toolkit for Ruby (tags: crawler ruby programming) [...]

    45. sting

      Hi…scrAPI doesn’t seem to support double byte html pages….what’s the inside encoding it’s using? I tried and every Chinese characters transformed to ucs encoding….

    46. sting

      Hi…I’ve studied Tidy…it seems tidy problem..I would like to pass the option “char-encoding: raw” to Tidy but I don’t know how. I tried to edit reader.rb and no luck… Any advise?

      Sting

    47. sting

      hi…sorry to update so soon…I’ve got it working…Thanks. It just need a reboot for my RoR application to see the change of reader.rb

    48. hone.wornpath.net : Web Scraping in Ruby!

      [...] Posted by hone Fri, 15 Dec 2006 02:58:12 GMT I needed to scrape nba.com for my database final project. I could either write my own regular expressions like I did for my Information Retrieval class or use tools out there like HTree + REXML which will be like parsing a XML document using REXML. A friend recommended Beautiful Soup which is a python scraper. Ruby has one called RubyfulSoup, but I heard it’s fairly slow, though I don’t know the extent of that and I don’t think I’m going to do massive scraping. I decided to use scrAPI, which loked fairly simple to use. Hpricot looked well developed as well and seems to be better documented. I can’t seem to find a lot of scrAPI documentation at the moment. Comments [...]

    49. Assaf

      sting,

      I pass the page encoding from the HTTP response/HTML meta to Ruby, but for the output I use ASCII encoding. Tidy will then create HTML entities for non-ASCII characters. I haven’t yet experimented with double-byte encoding in Ruby, I might do after Rails 1.2 comes out.

    50. Camilo

      Hello, thanks for a fine gem. I’m wondering, how can I deal with nil values in the result set. I’m scraping addresses from am html page of definition-lists (). The odd is without any content which means that the scraped results show a post code in an city field for example. Am I making sense?
      Thanks

    51. Assaf

      Camilo, I’m not sure what you mean by that. Maybe an example?

    52. R&D Party :: Apartment 2D

      [...] set my sights on menupages.com for the next Earthify script. However, with my discovery of ScrAPI and the gloriously easy scraping that it allows, I am now considering moving to Ruby with the [...]

    53. Sean Mountcastle » April NoVA Ruby Users Group

      [...] RegExps), POOH (Plain Old Open-URI and Hpricot),WWW::Mechanize, scRUBYt, WATIR and FireWatir, and scrAPI.  Of the examples shown, Hpricot looks like an excellent HTML parser (though it does require [...]

    54. http://blog.bigsmoke.us/

      Are scrAPI’s API docs published anywhere? I can’t find a link at all, while I noticed that there *is* some documentation in scrAPI’s code.

    55. Assaf

      http://content.labnotes.org/rdoc/scrapi/.

    56. http://blog.bigsmoke.us/

      Ah, thanks for the URL. :-)

    57. Blog of BigSmoke » Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot

      [...] I was crushed, I came across a reference to a Ruby scraper with decent support for CSS3 selectors: scrAPI. Credits for this discovery go to the documentors of scRUBYt, a featurefull scraper layered on top [...]

    58. Mike H

      Awesome.

      Phases of scrAPI usage:

      1. Elation - Wow, this is so easy and powerful. I’m gonna scrape the world!!!!!

      2. Despair - What the hell is the syntax for the selectors, I’m so confused, and there are no docs

      3. Elation - scrAPI has great test coverage, you can learn everything you need to know about the selectors from the tests.

      Seriously, this is awesome. Many thanks.

    59. http://blog.bigsmoke.us/

      Something that isn’t clear from the API docs: is it possible to turn SGML/XML character entities into their character equivalents when scraping? When you’re scraping data for anything else than reinclusion in an HTML document, this is quite essential.

    60. Assaf

      When you’re using Tidy, you can control the output character encoding. Common entities like & and < are always converted, other entities only if they’re supported by the character set.

    61. Nakul

      Thanks for the great gem Assaf.
      I just ended up making the other part easier (getting CSS Selectors easy way from firebug).

    62. Assaf

      Nakul, thanks, this is super cool.

      Updated the post to include a link to Firequark.

    63. Nick. Bhanji

      when I run the code included below, under macosx — i do not get any errors. However, when I run the same code under Linux (centos5, fedora core 7) I get error. I checked the gems installed, they are the same. the following is the error:

      http://careerbuilder.com/Jobseeker/Jobs/JobResults.aspx?
      IPath=QH&ch=&rs=&s_rawwords=computer%20technician&
      amp;s_jobtypes=ALL&s_freshness=30&s_education=DRNS&
      amp;s_freeloc=sarasota%2Cfl&qsbButton=Find+Jobs+%3E%3E&sd=2
      Scraper::Reader::HTTPRedirectLimitError: Scraper::Reader::HTTPRedirectLimitError
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:112:in `read_page’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:163:in `read_page’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:876:in `request’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:861:in `document’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:749:in `scrape’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:347:in `scrape’
      from ./script/../config/../config/../app/models/job_search.rb:132:in `get_jobs_CB’
      from (irb):3

      the HTTP string I built based on the input for the user (ie. computer%20technician)

      the function that is being called is the following:

      def get_jobs_CB(str = SEARCH_STRING )
      job_site = CB_SEARCH_STMT1 + remove_spaces(str) + CB_SEARCH_STMT2

      puts job_site

      listing = Scraper.define do
      process “td > a:nth-child(#{CB_TITLE_COL})”, :title => :text, :url => ‘@href’
      process “td:nth-child(#{CB_COMPANY_COL})”, :company => :text
      process “td:nth-child(#{CB_LOCATION_COL})”, :location => :text
      process “td:nth-child(#{CB_POSTED_COL})”, :posted => :text

      result :title, :company, :location, :posted, :url
      end

      listed_jobs = Scraper.define() do
      array :data
      process “table#JL_D tr”, :data => listing
      result :data
      end

      jobs = listed_jobs.scrape(URI.parse(job_site))

      @job_list = check_empty_list(jobs)
      @job_list = change_date_format(@job_list) if !@job_list.nil?

      return @job_list
      end

      Can someone point to me what I am doing wrong or what is missing.

      Thanks in advance.

      Nick.B

    64. Assaf

      Nick, only thing I can think of is that these two environments might run different versions of Ruby, and the Net::HTTP library might behave differently.

    65. Nick. Bhanji

      I updated the capistrano on both machines and using same version — is there any way i can check the version of Net library?

    66. Assaf

      That one depends on the version of Ruby you’re using, it’s a core library, so just ruby -v. I used it successfully with 1.8.5 and 1.8.6.

    67. Nick Bhanji

      i am using ruby 1.8.6, downloaded fedora core7

    68. Nick

      Hi Assaf,
      I’m thrilled to try this out! However, I am brand new to RoR. Is a step by step guide on the way or do you know anyone who has one already?
      I don’t really know where to do anything and think that this is an awesome project to learn the new language/framework.
      Thanks for your hard work!

    69. SEO

      Thats a very cool peace of code. Thank you for sharing.

      I am trying to scrape pages that have internal links to MP3 files, How can I make the scraper download these files to my local machine?
      Any idea?

      Thank you,
      Ed

    70. Dakota

      Thats cool work!
      Thank you assaf.

    71. Michael Hartl

      This looks like an awesome gem, but the content on the site runs off the screen to the right, rendering it virtually unreadable. Thought you’d like to know.

    72. Assaf

      Thanks, Dakota. And sorry about that Michael, one of the comment ran too long, trimmed it and the style is back to normal.

    73. Colin Z

      Hi Asaf,

      I love the simplistic style of the API. It reminds me quite a bit of XSLT templates.

      Have you thought about adding support for XPath selectors in addition to css?

    74. Assaf

      XPath is more complicated than CSS, so that would end up with a different API, not an API I’d want to use.

    75. Brian

      Wow, thats great tool. This is exactly what I needed for Research project.

      Great Job Assaf.

    76. Ra

      Wonderful work!
      One question.
      I have an href and I want extract substrings from it “x,y,z = href.scan(/(\d+).+(\d+-\d+-\d+)\+(\d+%3[Aa]\d+%3[Aa]\d+)/).flatten” and I want x,y,z be part of the structure returned by result, how?

      10x in advance.

    77. Mona

      Thanks Assaf, that is very useful work. I used to use hpricot. But I guess your script is far better.

      Thanks

    78. Fabrizio

      Hi!
      Where do I have to set Tidy.path = “/usr/lib/libtidy.dylib” exactly?
      Can you please post some example of a trivial script?
      Thanks

    79. Ra

      Hi Assaf,
      iterating over a lot of pages lib eat a lot of memory, have you an idea about this behaviour ?
      I’m staring to debug however an help is appreciated.

    80. Ra

      Sorry Assaf. My fault, was a problem with mechanize and history.

    81. Michael Staton

      I’ve been using Hpricot and Mechanize for a little while, and I like scrAPI much better. However, there’s a few things i’m at a loss for. For instance, how would I return an array of objects with multiple attributes?

      array :dept, :url
      process “a”, :dept => :text,
      :url => “@href”

      result :dept, :url

      this returns all depts and then all the urls. how can i get it to return an array of both?

    82. Michael Staton

      There’s also an easy way to filter your results in Hpricot that I’d like to use,

      new_doc/”a:contains(’faculty’)”

      but trying to use Hpricot returns the error

      scrapi_test.rb:5:in `initialize’: No such file or directory - ‘designated_url” (Errno::ENOENT)
      from scrapi_test.rb:5:in `open’
      from scrapi_test.rb:5

      I’ve been trying to use attribute_match but have been unsuccessful.

    83. Michael Staton

      Now I’m getting /usr/local/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/html/selector.rb:454:in `select’: undefined method `tag?’ for “http://www.berkeley.edu/catalog/”:String (NoMethodError)
      from scrapi_test.rb:53
      from scrapi_test.rb:52:in `each’
      from scrapi_test.rb:52

      for

      right = HTML::Selector.new “a[href*=dept]“

    84. Michael Staton

      how do you use attribute match?

    85. Assaf

      Michael, for handling arrays you need to create a scraper for the array, and inside that use a scraper for the individual attributes. The eBay example in the post shows how to do it.

      The error you’re getting is from passing a string to the select method. The method expects an HTML document/element (HTML::Document or HTML::Tag).

    86. James Burt

      Thanks Assaf, this is a very useful piece of information. Would be glad if there is a dummy guide for beginners like me.

      Thanks

    87. Al

      Interesting reading. I am just researching for a TV comparison site / project. My diy code is stuck on 302 redirects, when requests are made by non IE clients.

    88. Assaf

      Al, the read_page method takes an option (:user_agent) that allows you to specify a different user agent. By default it doesn’t send any User-Agent header, and some servers ignore requests without it.

    89. jouer en ligne

      Thats cool work!
      Thank you assaf.

    90. John

      Hi, I really like what you’ve put together here and am trying to use it in a project but am running into a little trouble. I’m trying to define a Scraper dynamically setting the selector for it based on a variable, but that doesn’t seem to work. Everything seems to be class methods so creating a new Scrapper class and setting instance variables doesn’t work, putting the variable into a Scraper.define block doesn’t work either because the contents of the block get evaluated later on and the variable used in the block has no meaning at that point. Any idea how to get around this, any help at all would be really appreciate.

    91. John

      Nevermind, I figured it out, I was being stupid.

    92. Timo

      This has probably saved me a few hours of searching. Thanks, Assaf!

    Leave a Reply | Trackback | Track with co.mments

    Where's my comment? I get too much comment spam, so I have to moderate comments. Damn those spammers. If you don't see your comment immediately, be patient. I'll approve it the minute I see it. Want to know when your comment shows up, or check if anyone responded? Track it.

    Or using OpenID