1. Sep 5th, 2006

    scrAPI cheat sheet

    scrapi.png

    Download as HTML, PDF, or if you’re using cheat: cheat scrapi.

    1. Sep 6th, 2006

      newcomer

      Hi there,

      I’m a newcomer to scrAPI and I must say that it really saved me some time. Thanks!

      However, I still have some doubts so I’ve decided to ask. For example, how would you parse something like this without using regular expressions?:
      adasfasfda
      adasfasfda
      adfasdfa
      adfasdfa

      The aim is to get only the first two paragraphs, but not the last two and get as a result the contents of tag in one variable and the rest of tag in another one.

      Thanks in advance for your time.

    2. Sep 6th, 2006

      newcomer

      Sorry the original markup was interpreted by the blog engine. Let’s try again

      a adfasfafas
      a adfasfafas
      adfasfafas
      adfasfafas

      –>

      Again, the aim is to get only the first two paragraphs, but not the last two and get as a result the contents of tag in one variable and the rest of tag in another one.

      I hope this time the comment is OK.

    3. Sep 6th, 2006

      Assaf

      newcomer,

      You can get just the <b> elements by selecting for “b”. And once you get them, you can remove them from the element (leaving just the rest of the text) by calling detach on each element.

      So:
      bolds = HTML::Selector.new(“b”).select(element)
      bolds.each { |b| b.detach }

      When you’re dealing with the text itself (e.g. you want the first two lines but not the second), use regular expressions.

      scrAPI makes it easier to find that text.

    4. Sep 8th, 2006

      gik

      Hello Assaf,
      thanx a lot for great library.
      But I have few questions regarding it:
      1. If yiu use tidy, why not to build xml and extract content with XPath
      2. Is it possible to make lgical selectors like
      E[foo="bar" && foo2="bar2"]

    5. Sep 8th, 2006

      Assaf

      git,

      Let’s say you want to find an element with the class “foo”. With XPath, most people will do:

      //[@class=foo]

      But it will only work if the element has one class. If the element has two classes, you need some funky string manipulation to find just the foo.

      Try writing that XPath expression and compare it to the CSS one:

      .foo

      I like CSS selectors because they’re simpler to use, and everything I need to know can fit on one cheat sheet page.

      As for the second question:

      E[foo=bar][foo2=bar2] will select E only if it has both attributes (and).

      E[foo=bar], E[foo2=bar2] will select all E with foo=bar, and all E with foo2=bar2 (or).

      E[foo=bar]:not([foo2=bar2]) will select all E with foo=bar but not foo2=bar2.

    6. Sep 8th, 2006

      gik

      Assaf, thanks for reply.

      I don’t know css selectors so well, but is it possible to make selection in css like this xpath
      //E[contains(/F/@class,'some')] –
      E with F child which has ’some’ substring in class attribute

    7. Sep 8th, 2006

      Assaf

      There are no predicates in CSS level 3.

      I am thinking of adding this feature to scrAPI. The CSS syntax could look like:

      E:contains(F.some)

      The reason it’s not there yet is very simple. Extending the syntax so it’s simple to understand, and does 80% of the work is easy. A simple :contains will work.

      Extending the syntax to do the other 20% is very hard, and I think the best way is to just use Ruby blocks. But I need a few more examples of real code so I can extract a clean and simple pattern out of them.

      Is that something you can help with?

    8. Sep 8th, 2006

      Chris

      Hi,

      I’m new to ruby and scrAPI.
      I’d like to scrape the results of submitting a form with POST.
      Is there a way to POST form data with the “scrape” method?

      Thanks.

    9. Sep 8th, 2006

      Assaf

      Chris,

      Not from the scrape method itself, but you can use Net::HTTP to make a POST request and pass the result body to the scrape method.

      If enough people think it’s useful, I can add convenience methods that would do POST and handle form submissions.

    10. Sep 11th, 2006

      gik

      Hello Assaf,
      I am very new in ruby, and don’t know language so well:)
      As I could see it will be very usable to have selector like this:
      “E

      needed content

      fake content

      fake content

    11. Sep 11th, 2006

      gik

      Again….
      E

    12. Sep 11th, 2006

      gik


      E { F D – an D element which is child of E element parent of F element.

      This will allow to scrape content from markup like this:

      [e]
      [f][/f]
      [d]needed content[/d]
      [/e]

      [e]
      [z][/z]
      [d]fake content[/d]
      [/e]

      [e]
      [x][/x]
      [d]fake content[/d]
      [/e]

    13. Sep 11th, 2006

      Assaf

      gik,

      CSS selectors are not part of the Ruby language, but part of the CSS specification. You can learn more about them in this article:

      http://www.xml.com/pub/a/2003/06/18/css3-selectors.html

    14. Sep 11th, 2006

      gik

      Assaf,
      I know about sources of CSS specification:)
      But I propouse to add this selector to scrApi, not in CSS :)

    15. Sep 11th, 2006

      Assaf

      In CSS that would be: E>F+D

    16. Jan 6th, 2007

      Edgar

      I’m just downloading scrAPI

      What is the best way to retrieve all the tags (rel=tag) in every post in a feed?

    17. Jan 6th, 2007

      Assaf

      Scraper.define do
      array :tags
      result :tags
      process “a[href][rel~=tag]“, :tags=>”@href”
      end

    18. May 2nd, 2008

      Arvind

      You have done an incredible job with scrAPI. I am learning Ruby mainly to take advantage of your library. So far, I have only scraped one complex page and the process was smooth enough so far. I have to see if my code is going to be maintainable going forward.

      One thing I tried is enhancements suggested at http://www.quarkruby.com/2008/1/30/scrapi-enhancements/. I liked the idea but ran in to couple of issues with those changes. Are you planning to implement such feature? Their blog has also helped in using scrAPI.

      One other thing I was wondering was can you point me to performance tips in using scrAPI and Tidy so that I can make sure my code is perf conscious. I hear horror stories about ruby performance in general.

      Also, I noticed there is not much activity on scrAPI. Is there a plan to improve/maintain going forward.

      Thank you for all your efforts in building scrAPI.

    Your comment, here ⇓