1. Aug 15th, 2006

    scrAPI Gem release 1.1.2

    scrAPI is now available as a Ruby Gem. To install:

    gem install scrapi

    Thanks to the always helpful RubyForge for hosting.

    I’m still going to do incremental fixes in SVN, so if you want the bleeding edge, get it from SVN (http:labnotes.org/svn/public/ruby/scrapi). If you want to wait for official releases, Gems are your friends.

    This is also an official announcement for release 1.1.2 which includes a bug fix for first-of-type, last-of-type and adds supports for multiple negations. For example:

    process "h1, h2"

    Will process every element that is either header level 1, or header level 2. On the other hand:

    process ":not(h1):not(h2)"

    Will process all elements but these headers. You can do a lot of interesting things with the :not pseudo class.

    1. Oct 15th, 2006

      Dan

      Hi,
      I’m trying to get :not() to work.
      I built a simple test to remove all img tags:

      element_scraper4 = Scraper.define do
      array :some_elements
      process “*:not( img )”, :some_elements=>:element
      result :some_elements
      end

      element_scraper4.scrape URI.parse(”http://search.ebay.com/search/search.dll?ht=1&from=R4&satitle=cat&sacat=550%26catref%3DC6″)

      I see that the img tags are not scraped out.

      I fiddled with a few permutations of the above syntax and I cant get the img tags removed.

      How to I remove img tags ??

      -Dan

    2. Oct 15th, 2006

      Assaf

      Dan,

      This scraper will return an array with all elements, except img elements. But each element in the array will contain all its child elements as well, and these will include img.

      If you want those remove from the document entirely, do something like:

      process “img” { |e| e.detach }

      Assaf

    3. Oct 21st, 2006

      Dan Bikle

      Hi again,

      I got scrAPI working on my Mac so that’s good.

      FreeBSD is different story.

      My beastie box is tripping over tidy.

      I think it wants a tidy.so library.

      I’m not sure how to make a tidy.so library.

      I can make a tidy executable using the src I got from source forge.

      It looks like I have 2 options:

      -Learn how to make a tidy.so
      -Learn how to configure scrAPI so it uses /usr/local/bin/tidy
      rather than tidy.so

      Any tips anyone?

      -Dan

    4. Oct 22nd, 2006

      Dan Bikle

      Hi again,

      I got scrAPI working on my beastie box.

      It turns out the tidy I was using was
      from my /usr/ports directory rather than from
      sourceforge.

      I had to struggle a bit with the sourceforge tidy.

      My beastie box was missing a bunch of the gnu tools like libtoolize, autoconf, automake…

      Once I got those installed,
      I followed the directions attached to tidy.

      Eventually, I ran a make command which made
      a whole lot of stuff. One of those things
      was a .so file which was named libtidy-0.99.so.0

      I copied it to /tmp/libtidy.so and copied that
      to the location pointed to by the Tidy.path
      variable.

      Mine looks like this:

      Tidy.path=’/usr/local/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/tidy/libtidy.so’

      I’m not sure the best place to put the above variable.

      I put it in my controller but that is not a good
      DRY place.

      But, now I got scrAPI working on my beastie box and
      I’m feeling good.

      -Dan

    5. Oct 24th, 2006

      Assaf

      libtidy-0.99.so.0 is the one I also use. I leave it in the /usr/lib directory, so I can use the same library in other places.

      But if you want to use it strictly with Ruby, I would suggest putting it in the same package as the Ruby Tidy library, again, so you have one place if you decide to use it with anything other than scrAPI.

      (For example, there’s a Rails plugin that uses Tidy to validate your pages during testing)

    Your comment, here ⇓

    Or using OpenID