1. Jul 11th, 2006

    Scraping with style: scrAPI toolkit for Ruby

    There’s a lot of ways to scrape HTML.

    There’s regular expression, they deal well with text. But HTML is not just text, it’s markup. So you have to deal with elements that are implicitly closed, or out of balance. Attributes are sometimes quoted, sometimes not. Nested lists and tables are a challenge. Good regular expressions take a lot of time to write, and are impossible to read.

    Or you can clean up the HTML with Tidy, get a DOM and walk the tree. The DOM is much easier to work with, it’s a clean markup with a nice API. But you have to do a lot of walking to find the few elements you’re scraping. That’s still too much work.

    So how about scraping HTML with style?

    I’m talking about CSS selectors. CSS selectors have a simple, elegant syntax. And if you write HTML you already know CSS, so that’s one thing less to learn.

    Good chance the elements you’re scraping are also styled, so they’re ment to be found with CSS. And, surprisingly, selectors are good anchors into the page. You can change the HTML a lot before the scraper breaks.

    Last year I started using CSS selectors for the microformat parser. It worked very well for microformats, which are better structured than your average HTML page.

    Earlier this year I came out with co.mments. It scrapes comments from blog posts, and that means dealing with some complex HTML and hundreds of scraping rules. I wrote a few scrapers to deal with the different blogging platforms, always looking for ways to make them easier to write.

    Easy to write is easy to test and easy to fix.

    What emerged is a framework for writing scrapers. It uses CSS selectors and a small set of methods that define the processing rules. You’ll be surprised how much you can do with a few lines of code.

    Here’s an example that scrapes auctions from eBay:

    ebay_auction = Scraper.define do
      process "h3.ens>a", :description=>:text,
                  :url=>"@href"
      process "td.ebcPr>span", :price=>:text
      process "div.ebPicture >a>img", :image=>"@src"
    
      result :description, :url, :price, :image
    end
    
    ebay = Scraper.define do
      array :auctions
    
      process "table.ebItemlist tr.single",
                  :auctions => ebay_auction
    
      result :auctions
    end

    And using the scraper:

    auctions = ebay.scrape(html)
    
    # No. of auctions found
    puts auctions.size
    
    # First auction:
    auction = auctions[0]
    puts auction.description
    puts auction.url

    This is the first official release. The code is stable, I’m using it in production and all the key features work. The documentation definitely needs more work, and it needs more examples.

    To install the Gem:

    gem install scrapi

    To get the bleeding edge code from SVN:

    svn co http://labnotes.org/svn/public/ruby/scrapi

    If you have ideas for improvement or like to help with documentation and examples, I would appreciate that.

    If you’re using it in your application let me know, I’d like to link to it.

    Update: I just added the every useful pseudo classes (:nth-child, :empty, etc), check here for more details.

    Update: By popular demand, scrAPI is now available as a Gem. Thanks to RubyForge for hosting. You can still get bleeding edge from the Labnotes SVN, but if you want stable, Gems are your friends.

    Rdocs are here.

    Update: Having problems figuring out the right CSS Selector to use? Try Firequark:

    Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information.

    1. Jul 11th, 2006

      Matthieu Riou

      Congratulations, that’s really cool! I might scrape your blog soon, seems easier than RSS :)

    2. Jul 12th, 2006

      iolaire

      Can you comment on how fast this toolkit is? I’ve used Rubyful Soup http://www.crummy.com/software/RubyfulSoup/ on a project and it seems very generally slow. So I’m wondering if this would be a good alternative.

    3. Jul 12th, 2006

      Ruby gets a stylish HTML scraper – scrAPI

      [...] The indefatigable Assaf Arkin has done it again by developing a new Ruby HTML scraping toolkit, scrAPI. Peter Szinek recently wrote a popular article about scraping from Ruby using Manic Miner, RubyfulSoup, REXML, and WWW::Mechanize, but none of these are as immediately useful as scrAPI.. so why? [...]

    4. Jul 12th, 2006

      Assaf

      iolaire,

      My test suite uses a mix of several scrapers and pages that I pulled from different Web sites, a good reflection of what you’ll find on blogs. I’m using Tidy for the cleanup.

      Running it on a 1.8GHz Duo Core, I get to process around 210Kb/s.

    5. Jul 12th, 2006

      iolaire

      Assaf, Thanks, I’ll give it a try in the future.

    6. Jul 12th, 2006

      Chris

      This definitely looks impressive. However, what needs require’d to use scrAPI? I built the gem and have require ’scrapi’, but it fails when trying to load Tidy.

    7. Jul 12th, 2006

      Assaf

      Chris,

      gem install tidy

      installs the Tidy library for Ruby. But it also needs to fine the Tidy DLL (Windows) or shared library (Linux). It tries to look for those in the library tidy, so that’s one place to include them.

      Or, set the path using Tidy.path = ‘…’

    8. Jul 12th, 2006

      Assaf

      I added these instructions to the README file:

      By default scrAPI uses Tidy to cleanup the HTML.

      You need to install the Tidy Gem for Ruby:
      gem install tidy

      And the Tidy binary libraries, available here:

      http://tidy.sourceforge.net/

      By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That’s one place to place the Tidy library.

      Alternatively, just point Tidy to the library (if you’re running Linux it may already be installed) with:

      Tidy.path = “….”

      For testing purposes, you can also use the built in HTML parser. It’s useful for testing and getting up to grab
      s with scrAPI, but it doesn’t deal well with broken HTML. So for testing only:

      Scraper::Base.parser :html_parser

    9. Jul 12th, 2006

      Chris

      Any idea about OS X support? I pointed Tidy.path to the location, but it fails with a LoadError.

    10. Jul 12th, 2006

      Dennis W

      This looks very nice. Could you please post some other examples? I am a newbie in ruby but this might motivate me. Thanks

    11. Jul 12th, 2006

      Assaf

      On my Mac, which has the default Ruby/Tidy installation that comes with OS/X, this works:

      Tidy.path = “/usr/lib/libtidy.dylib”

      (Ruby is 1.8.2, Tidy only says Dec 2004)

    12. Jul 12th, 2006

      Assaf

      Dennis,

      I’ll do a follow up post with more examples. Check the feed, probably in a few days.

    13. Jul 12th, 2006

      Michael Campbell

      How does this handle HTML that *ISN’T* CSS styled? It just… doesn’t? Non-styled HTML is “invisible” to it?

    14. Jul 12th, 2006

      Assaf

      Michael,

      The HTML doesn’t have to be styled. At all.

      For example:

      process “html>head>title”, :title=>:text

      gets the title of the page. There’s no way to style the title.

      Anything that’s inside the HTML you can scrape, you don’t need a CSS stylesheet. What scrAPI uses is the CSS selector syntax so it’s real easy to write scraping rules.

      The bonus point is that pages that are styled, are also easy to scrape because the HTML developer took care of making the HTML elements easy to identify. They did it for the purpose of styling, but they made life easy for those scraping the page.

      If you look at the eBay example, eBay styles the auction description using:

      h3.ens {
      ….
      }

      And the scraper looks up the auction link using:

      process “h3.ens>a”, :description=>:text

      Notice those are two different selectors, one picks the header (for styling), another picks the anchor (for scraping).

      But because eBay styles their headers, they make them easy to identify using the class “ens”.

    15. Jul 12th, 2006

      Andre Lewis

      Assaf, this is very cool. I’m sure I will use this on a future project.

    16. Jul 13th, 2006

      Like Your Work » Blog Archive » links for 2006-07-14

      [...] Labnotes » Blog Archive » Scraping with style: scrAPI toolkit for Ruby (tags: ruby RubyOnRails rails scraper) [...]

    17. Jul 17th, 2006

      Joshua Sierles

      Are there plans to support HTML attributes? Much of my scraping goes up against old sites with font tags and other goodies, but no CSS in sight.

    18. Jul 17th, 2006

      Assaf

      Joshua,

      You can definitely use attributes. For font:

      process “font[size=3]”

      Color:

      process “table[bgcolor=#ffddee]”

      Or even better, case insensitive:

      process “table[bgcolor=?]“, /#ffddee/i

    19. Jul 19th, 2006

      Labnotes » Blog Archive » links for 2006-07-19

      [...] Scraping with style: scrAPI toolkit for Ruby Scraping HTML with style. A toolkit for writing scrapers using CSS selectors and a scraping DSL. (tags: projects ruby scraping) [...]

    20. Jul 20th, 2006

      HTML Blog » Êîìïîíåíòû äëÿ Ruby

      [...] Scrapi Áèáëèîòåêà äëÿ ðàçáîðà HTML-ñòðàíèö ñ ïîìîùüþ CSS-ñåëåêòîðîâ. Èäåéíî ãîðàçäî áîëåå óäîáíàÿ øòóêà íåæåëè ðåãóëÿðíûå âûðàæåíèÿ, õîòÿáû óæå ïîòîìó, ÷òî CSS-ñåëåêòîðû ãîðàçäî ïîíÿòíåå äëÿ ÷åëîâåêà. Ñêîðîñòü, êàê ÿ ïîäîçðåâàþ, îñòàâëÿåò æåëàòü ëó÷øåãî. Ïðèìåð èñïîëüçîâàíèÿ: [...]

    21. Jul 26th, 2006

      Michael Campbell

      Assaf, I’ve found this to be very powerful so far; like it a lot. But, I think we really need some more concrete examples. The doc in the code is getting me by…a little, but there’s still a lot I don’t understand.

      Any chance of getting some more complete doc out, and a gem?

    22. Jul 26th, 2006

      Assaf

      Michael,

      I definitely need to work on examples, and if anyone can contribute some, I’d love to post them on the blog (with attribution).

      Gem coming up.

    23. Jul 26th, 2006

      Assaf

      I just did a new release that adds :nth-child, :first-child, :empty and more pseudo class goodness. If you haven’t used CSS 3 yet, check out this post. There’s a lot of cool things you can do with the new features.

      I also fixed some bugs I found along the way, and added a CSS selector test suite.

    24. Jul 27th, 2006

      Michael Campbell

      Assaf,

      For us Windows users, the sourceforge windows port blurb says:

      “Hash Table Versions

      For large documents, the hash table versions can be *much* faster. Tidying a complex ~40MiB XML document took 18 minutes to finish, while with the hash table version it took just 14 seconds. For small documents you won’t notice much difference. It comes with a catch though – if you declare new elements after a parsing run (via setting TidyBlockTags etc), it can break. As long as you don’t do that or just keep from re-using a TidyDoc, you are safe.”

      Fast is always good, so was wondering if scrapi is set up such that the hashtable versions will work?

    25. Jul 27th, 2006

      Assaf

      Michael,

      scrAPI doesn’t declare new elements (you can check the default options list in /lib/scraper/reader.rb), and uses the Tidy Gem which AFAIK doesn’t reuse documents. So you can use the hash table version.

    26. Jul 28th, 2006

      Brook

      First off I’m new to Ruby, just started playing with it so please excuse a stupid question.

      For the example above, I inserted the code into a file.
      Installed the GEM
      Moved the libtidy.so over to the proper directory (on linux, so /usr/local/lib)

      Ran the script as show above and I get this error.

      Please advise what I’m missing, must be simple.

      “/home/myuser/Desktop/scrapi_test.rb:1: uninitialized constant Scraper (NameError)”

      Thanks,

    27. Jul 29th, 2006

      Assaf

      Brook,

      You need to start with:
      require “lib/scrapi”

      to load the Scraper and other related classes before you can use them.

    28. Aug 4th, 2006

      High Earth Orbit » Blog Archive » scrAPI – Microformat Parsing in Ruby

      [...] Talking on #microformats I was pointed to LabNotes newer incarnation of a parser: scrAPI. It’s a much more generic HTML parser/scraper, that can handle getting data from HTML by structure, class, or id. Here is Assaf’s presentation at Mashup Camp II where he gives some good tutorials and discussion about the API. [...]

    29. Aug 21st, 2006

      Labnotes » Mashups: In The Spirit of Simplicity

      [...] And check out the links at the end, under products and technology. He mentions SOAP and he mentions scrAPI. [...]

    30. Aug 30th, 2006

      Bill Kudrle

      I downloaded the code and am impressed with the sophistication. But being relatively new to Ruby, the comments in reader.rb and base.rb still leave many questions, however. It would be SO helpful to just have a SIMPLE example that works. For example, I get scraper_test.rb to work without errors, but am not sure how to then apply that to the above simple example for EBay. For the EBay example above I am not sure what the html value should be (this is where it gives me an error when I try to run it). No doubt to you it is obvious, but to me I am still at the beginning stage.

      Great code it seems, but a couple of simple examples would be TREMENDOUSLY helpful. If I could make it through some simple examples, perhaps I could help you with some more documentation. I am using scraping in my job and am motivated to use your approach rather than the HTree approach that I am using now. Your approach seems to be more elegant if I could just get the first base hit.

    31. Aug 31st, 2006

      Assaf

      Bill,

      To run the eBay example, pick up one of the eBay search results URLs and call ebay.scrape(url).

      For example:
      auctions = ebay.scrape(URI.parse(”http://search.ebay.com/ipod-nano_
      W0QQcatrefZC6QQfromZR3QQfsooZ1QQfsopZ1QQkeywordZonQQsacatZQ2d1QQstrkwZipod “))

    32. Sep 3rd, 2006

      john

      this thing was helpful too to understand a little more how toextract stuff.. .but seriously now, for us beginners? How about a step by step guide?

      Thank!

      http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/MicroformatParser

    33. Sep 3rd, 2006

      Assaf

      John,

      I’m going to need help with that, just not enough time to write good enough documentation.

      Any volunteers to help write up a scrAPI guide?

    34. Sep 6th, 2006

      High Earth Orbit » Blog Archive » Converting table-based Calendars to hCalendar

      [...] Employing some slick Ruby scripting – and using the very useful scrAPI from Assaf we can define scrapers to walk over the multiple days, and then within those days grab each of the sessions. These are then output into proper hCalendar format like: [...]

    35. Sep 10th, 2006

      Ruby/Rails UI Scraping at Matt Didcoe

      [...] Links: StreetEasy – http://www.streeteasy.com scrAPI toolkit for rails – http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/ Ruby on Rails podcast – http://podcast.rubyonrails.com  [...]

    36. Sep 12th, 2006

      Mi viaje en tren » Blog Archive » Avistamientos #4

      [...] Y este si esta hecho para “scrapping”. [...]

    37. Sep 15th, 2006

      the ryan king » Progress

      [...] In addition to starting school again (for the last time). I’ve been working hard. I’ve reengineered the technology behind our microformats search at Technorati (currently still in our kitchen, aka ‘labs’). It’s been a fun project especially because I’ve gotten to work with a fun language, framework and libaries (this one too). [...]

    38. Sep 27th, 2006

      Vish

      I’ll help write the docs, if you don’t have a volunteer already. That way I’ll get my doubts cleared too.:-)

      Write me at the included email address to let me know.

    39. Sep 28th, 2006

      Uses for Rcrawl

      [...] I mentioned in a previous post, Announcing Rcrawl, that it uses Assaf Arkin’s scrAPI toolkit for Ruby for link extraction. Now take that a step further and write your own scrAPI, and use Rcrawl to gather the HTML for you. [...]

    40. Oct 1st, 2006

      Uses for Rcrawl at Digital Duckies

      [...] I mentioned in a previous post, Announcing Rcrawl, that it uses Assaf Arkin’s scrAPI toolkit for Ruby for link extraction. Now take that a step further and write your own scrAPI, and use Rcrawl to gather the HTML for you. [...]

    41. Nov 3rd, 2006

      epicblog» Blog Archive » scrAPI and redirects without complete URL

      [...] I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done. [...]

    42. Nov 6th, 2006

      Assaf

      I’ve added a redirect patch to the code in SVN, will be part of the next Gem release.

    43. Nov 10th, 2006

      Ninjawords – a fast online dictionary… fast like a ninja » eightpence – Phil Crosby

      [...] It’s built on Ruby on rails (surprise), and I’m using the poorly documented but excellent scrapi page scraping library to pull definitions from wiktionary that my local dictionary is missing. [...]

    44. Nov 28th, 2006

      Notes » links for 2006-11-29

      [...] Labnotes » Scraping with style: scrAPI toolkit for Ruby (tags: crawler ruby programming) [...]

    45. Dec 13th, 2006

      sting

      Hi…scrAPI doesn’t seem to support double byte html pages….what’s the inside encoding it’s using? I tried and every Chinese characters transformed to ucs encoding….

    46. Dec 13th, 2006

      sting

      Hi…I’ve studied Tidy…it seems tidy problem..I would like to pass the option “char-encoding: raw” to Tidy but I don’t know how. I tried to edit reader.rb and no luck… Any advise?

      Sting

    47. Dec 13th, 2006

      sting

      hi…sorry to update so soon…I’ve got it working…Thanks. It just need a reboot for my RoR application to see the change of reader.rb

    48. Dec 14th, 2006

      hone.wornpath.net : Web Scraping in Ruby!

      [...] Posted by hone Fri, 15 Dec 2006 02:58:12 GMT I needed to scrape nba.com for my database final project. I could either write my own regular expressions like I did for my Information Retrieval class or use tools out there like HTree + REXML which will be like parsing a XML document using REXML. A friend recommended Beautiful Soup which is a python scraper. Ruby has one called RubyfulSoup, but I heard it’s fairly slow, though I don’t know the extent of that and I don’t think I’m going to do massive scraping. I decided to use scrAPI, which loked fairly simple to use. Hpricot looked well developed as well and seems to be better documented. I can’t seem to find a lot of scrAPI documentation at the moment. Comments [...]

    49. Dec 17th, 2006

      Assaf

      sting,

      I pass the page encoding from the HTTP response/HTML meta to Ruby, but for the output I use ASCII encoding. Tidy will then create HTML entities for non-ASCII characters. I haven’t yet experimented with double-byte encoding in Ruby, I might do after Rails 1.2 comes out.

    50. Mar 11th, 2007

      Camilo

      Hello, thanks for a fine gem. I’m wondering, how can I deal with nil values in the result set. I’m scraping addresses from am html page of definition-lists (). The odd is without any content which means that the scraped results show a post code in an city field for example. Am I making sense?
      Thanks

    51. Mar 14th, 2007

      Assaf

      Camilo, I’m not sure what you mean by that. Maybe an example?

    52. Mar 29th, 2007

      R&D Party :: Apartment 2D

      [...] set my sights on menupages.com for the next Earthify script. However, with my discovery of ScrAPI and the gloriously easy scraping that it allows, I am now considering moving to Ruby with the [...]

    53. Apr 19th, 2007

      Sean Mountcastle » April NoVA Ruby Users Group

      [...] RegExps), POOH (Plain Old Open-URI and Hpricot),WWW::Mechanize, scRUBYt, WATIR and FireWatir, and scrAPI.  Of the examples shown, Hpricot looks like an excellent HTML parser (though it does require [...]

    54. Apr 29th, 2007

      http://blog.bigsmoke.us/

      Are scrAPI’s API docs published anywhere? I can’t find a link at all, while I noticed that there *is* some documentation in scrAPI’s code.

    55. Apr 29th, 2007

      Assaf

      http://content.labnotes.org/rdoc/scrapi/.

    56. Apr 30th, 2007

      http://blog.bigsmoke.us/

      Ah, thanks for the URL. :-)

    57. May 2nd, 2007

      Blog of BigSmoke » Web scraping in Ruby: why I had to use scrAPI instead of WWW::Mechanize and Hpricot

      [...] I was crushed, I came across a reference to a Ruby scraper with decent support for CSS3 selectors: scrAPI. Credits for this discovery go to the documentors of scRUBYt, a featurefull scraper layered on top [...]

    58. May 24th, 2007

      Mike H

      Awesome.

      Phases of scrAPI usage:

      1. Elation – Wow, this is so easy and powerful. I’m gonna scrape the world!!!!!

      2. Despair – What the hell is the syntax for the selectors, I’m so confused, and there are no docs

      3. Elation – scrAPI has great test coverage, you can learn everything you need to know about the selectors from the tests.

      Seriously, this is awesome. Many thanks.

    59. Jul 12th, 2007

      http://blog.bigsmoke.us/

      Something that isn’t clear from the API docs: is it possible to turn SGML/XML character entities into their character equivalents when scraping? When you’re scraping data for anything else than reinclusion in an HTML document, this is quite essential.

    60. Jul 12th, 2007

      Assaf

      When you’re using Tidy, you can control the output character encoding. Common entities like & and < are always converted, other entities only if they’re supported by the character set.

    61. Sep 5th, 2007

      Nakul

      Thanks for the great gem Assaf.
      I just ended up making the other part easier (getting CSS Selectors easy way from firebug).

    62. Sep 5th, 2007

      Assaf

      Nakul, thanks, this is super cool.

      Updated the post to include a link to Firequark.

    63. Sep 20th, 2007

      Nick. Bhanji

      when I run the code included below, under macosx — i do not get any errors. However, when I run the same code under Linux (centos5, fedora core 7) I get error. I checked the gems installed, they are the same. the following is the error:

      http://careerbuilder.com/Jobseeker/Jobs/JobResults.aspx?
      IPath=QH&ch=&rs=&s_rawwords=computer%20technician&
      amp;s_jobtypes=ALL&s_freshness=30&s_education=DRNS&
      amp;s_freeloc=sarasota%2Cfl&qsbButton=Find+Jobs+%3E%3E&sd=2
      Scraper::Reader::HTTPRedirectLimitError: Scraper::Reader::HTTPRedirectLimitError
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:112:in `read_page’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/reader.rb:163:in `read_page’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:876:in `request’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:861:in `document’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:749:in `scrape’
      from /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/base.rb:347:in `scrape’
      from ./script/../config/../config/../app/models/job_search.rb:132:in `get_jobs_CB’
      from (irb):3

      the HTTP string I built based on the input for the user (ie. computer%20technician)

      the function that is being called is the following:

      def get_jobs_CB(str = SEARCH_STRING )
      job_site = CB_SEARCH_STMT1 + remove_spaces(str) + CB_SEARCH_STMT2

      puts job_site

      listing = Scraper.define do
      process “td > a:nth-child(#{CB_TITLE_COL})”, :title => :text, :url => ‘@href’
      process “td:nth-child(#{CB_COMPANY_COL})”, :company => :text
      process “td:nth-child(#{CB_LOCATION_COL})”, :location => :text
      process “td:nth-child(#{CB_POSTED_COL})”, :posted => :text

      result :title, :company, :location, :posted, :url
      end

      listed_jobs = Scraper.define() do
      array :data
      process “table#JL_D tr”, :data => listing
      result :data
      end

      jobs = listed_jobs.scrape(URI.parse(job_site))

      @job_list = check_empty_list(jobs)
      @job_list = change_date_format(@job_list) if !@job_list.nil?

      return @job_list
      end

      Can someone point to me what I am doing wrong or what is missing.

      Thanks in advance.

      Nick.B

    64. Sep 20th, 2007

      Assaf

      Nick, only thing I can think of is that these two environments might run different versions of Ruby, and the Net::HTTP library might behave differently.

    65. Sep 21st, 2007

      Nick. Bhanji

      I updated the capistrano on both machines and using same version — is there any way i can check the version of Net library?

    66. Sep 21st, 2007

      Assaf

      That one depends on the version of Ruby you’re using, it’s a core library, so just ruby -v. I used it successfully with 1.8.5 and 1.8.6.

    67. Sep 22nd, 2007

      Nick Bhanji

      i am using ruby 1.8.6, downloaded fedora core7

    68. Oct 2nd, 2007

      Nick

      Hi Assaf,
      I’m thrilled to try this out! However, I am brand new to RoR. Is a step by step guide on the way or do you know anyone who has one already?
      I don’t really know where to do anything and think that this is an awesome project to learn the new language/framework.
      Thanks for your hard work!

    69. Oct 11th, 2007

      SEO

      Thats a very cool peace of code. Thank you for sharing.

      I am trying to scrape pages that have internal links to MP3 files, How can I make the scraper download these files to my local machine?
      Any idea?

      Thank you,
      Ed

    70. Oct 16th, 2007

      Dakota

      Thats cool work!
      Thank you assaf.

    71. Nov 18th, 2007

      Michael Hartl

      This looks like an awesome gem, but the content on the site runs off the screen to the right, rendering it virtually unreadable. Thought you’d like to know.

    72. Nov 30th, 2007

      Assaf

      Thanks, Dakota. And sorry about that Michael, one of the comment ran too long, trimmed it and the style is back to normal.

    73. Dec 11th, 2007

      Colin Z

      Hi Asaf,

      I love the simplistic style of the API. It reminds me quite a bit of XSLT templates.

      Have you thought about adding support for XPath selectors in addition to css?

    74. Dec 11th, 2007

      Assaf

      XPath is more complicated than CSS, so that would end up with a different API, not an API I’d want to use.

    75. Dec 15th, 2007

      Brian

      Wow, thats great tool. This is exactly what I needed for Research project.

      Great Job Assaf.

    76. Dec 16th, 2007

      Ra

      Wonderful work!
      One question.
      I have an href and I want extract substrings from it “x,y,z = href.scan(/(\d+).+(\d+-\d+-\d+)\+(\d+%3[Aa]\d+%3[Aa]\d+)/).flatten” and I want x,y,z be part of the structure returned by result, how?

      10x in advance.

    77. Dec 19th, 2007

      Mona

      Thanks Assaf, that is very useful work. I used to use hpricot. But I guess your script is far better.

      Thanks

    78. Dec 22nd, 2007

      Fabrizio

      Hi!
      Where do I have to set Tidy.path = “/usr/lib/libtidy.dylib” exactly?
      Can you please post some example of a trivial script?
      Thanks

    79. Dec 27th, 2007

      Ra

      Hi Assaf,
      iterating over a lot of pages lib eat a lot of memory, have you an idea about this behaviour ?
      I’m staring to debug however an help is appreciated.

    80. Dec 27th, 2007

      Ra

      Sorry Assaf. My fault, was a problem with mechanize and history.

    81. Dec 27th, 2007

      Michael Staton

      I’ve been using Hpricot and Mechanize for a little while, and I like scrAPI much better. However, there’s a few things i’m at a loss for. For instance, how would I return an array of objects with multiple attributes?

      array :dept, :url
      process “a”, :dept => :text,
      :url => “@href”

      result :dept, :url

      this returns all depts and then all the urls. how can i get it to return an array of both?

    82. Dec 27th, 2007

      Michael Staton

      There’s also an easy way to filter your results in Hpricot that I’d like to use,

      new_doc/”a:contains(’faculty’)”

      but trying to use Hpricot returns the error

      scrapi_test.rb:5:in `initialize’: No such file or directory – ‘designated_url” (Errno::ENOENT)
      from scrapi_test.rb:5:in `open’
      from scrapi_test.rb:5

      I’ve been trying to use attribute_match but have been unsuccessful.

    83. Dec 27th, 2007

      Michael Staton

      Now I’m getting /usr/local/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/html/selector.rb:454:in `select’: undefined method `tag?’ for “http://www.berkeley.edu/catalog/”:String (NoMethodError)
      from scrapi_test.rb:53
      from scrapi_test.rb:52:in `each’
      from scrapi_test.rb:52

      for

      right = HTML::Selector.new “a[href*=dept]“

    84. Dec 31st, 2007

      Michael Staton

      how do you use attribute match?

    85. Jan 1st, 2008

      Assaf

      Michael, for handling arrays you need to create a scraper for the array, and inside that use a scraper for the individual attributes. The eBay example in the post shows how to do it.

      The error you’re getting is from passing a string to the select method. The method expects an HTML document/element (HTML::Document or HTML::Tag).

    86. Jan 5th, 2008

      James Burt

      Thanks Assaf, this is a very useful piece of information. Would be glad if there is a dummy guide for beginners like me.

      Thanks

    87. Jan 15th, 2008

      Al

      Interesting reading. I am just researching for a TV comparison site / project. My diy code is stuck on 302 redirects, when requests are made by non IE clients.

    88. Jan 16th, 2008

      Assaf

      Al, the read_page method takes an option (:user_agent) that allows you to specify a different user agent. By default it doesn’t send any User-Agent header, and some servers ignore requests without it.

    89. Apr 25th, 2008

      jouer en ligne

      Thats cool work!
      Thank you assaf.

    90. Apr 28th, 2008

      John

      Hi, I really like what you’ve put together here and am trying to use it in a project but am running into a little trouble. I’m trying to define a Scraper dynamically setting the selector for it based on a variable, but that doesn’t seem to work. Everything seems to be class methods so creating a new Scrapper class and setting instance variables doesn’t work, putting the variable into a Scraper.define block doesn’t work either because the contents of the block get evaluated later on and the variable used in the block has no meaning at that point. Any idea how to get around this, any help at all would be really appreciate.

    91. Apr 29th, 2008

      John

      Nevermind, I figured it out, I was being stupid.

    92. May 8th, 2008

      Timo

      This has probably saved me a few hours of searching. Thanks, Assaf!

    93. May 18th, 2008

      Attack of the Website Scrapers | The BookmarkMoney Blog

      [...] Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf [...]

    94. May 25th, 2008

      The sixth sense – » Mashups?Web ???????

      [...] Scraping with style: scrAPI toolkit for Ruby????? mashup ???????? [...]

    95. Jun 25th, 2008

      Emerson

      Im new to Ruby and ive been struggling a bit recently with gems and windows.

      It seems that many projects dont give much thought to how a gem might work on environments other than linux/osx.

      So for a lot of gems, i end up downloading dll’s separately and putting them into my C:\Ruby\bin directory to make things work.

      I was using Scrapi and noticed that it tries to package Tidy with it rather than relying on “require ‘tidy’” so my usual dll trick didnt work, and when i tried to run some code i would see an error about “libtidy.so” not being a valid windows object file. You dont say…

      So i did some digging, and i can see that in the “find_tidy()” method in reader.rb, Scrapi tries to set the Tidy path. But its coded wrongly.

      Since the Scrapi gem ships with both a linux “.so” and a windows “.dll” the windows dll will never be found before the linux “.so” according to the strategy in “find_tidy()”.

      def find_tidy()
      return if Tidy.path
      begin
      Tidy.path = File.join(File.dirname(__FILE__), “../tidy”, “libtidy.so”)
      rescue LoadError
      begin
      Tidy.path = File.join(File.dirname(__FILE__), “../tidy”, “libtidy.dll”)
      rescue LoadError
      Tidy.path = File.join(File.dirname(__FILE__), “../tidy”, “libtidy.dylib”)
      end
      end
      end

      I suggest you use a platform detection mechanism like so:

      “RUBY_PLATFORM =~ /mswin32/”

      And then the code will play more nicely. Otherwise, just stick to “require ‘tidy’”, atleast then Scrapi will only have the same problem that all the other gems do :)

    96. Jul 26th, 2008

      Webdesign

      Thanks a lot, this is very useful. Would be glad if there is a dummy guide for beginners. Anyway thank you very very much!!!

    97. Aug 13th, 2008

      Personalberatung

      I also think that it would be great to get a guide for beginners,
      thanks a lot.