Ruby / Screen scraping

From WhyNotWiki

Jump to: navigation, search

screen scraping / screen scrapers / web automation / automated data extraction / automated form posting / web spiders

Contents

[edit] Tools / libraries

[edit] star_full.gif star_full.gif star_full.gif star_full.gif star_full.gif WWW:Mechanize

http://mechanize.rubyforge.org/

http://mechanize.rubyforge.org/mechanize/

Author's blog: http://tenderlovemaking.com/category/mechanize/

http://tenderlovemaking.com/2007/02/26/ruby-mechanize-065-released/

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

[edit] Examples

http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html

Mechanize One Liners (http://tenderlovemaking.com/2006/05/26/mechanize-one-liners/). Retrieved on 2007-03-06 14:59.

Fetch a page and print to stdout:

puts WWW::Mechanize.new.get(ARGV[0]).body

List all links in a page:

WWW::Mechanize.new.get(ARGV[0]).links.each { |l| puts l.text }

Visit all links on a page:

(a = WWW::Mechanize.new).get(ARGV[0]).links.each { |l| puts a.click(l).body }

List all links that match a pattern:

WWW::Mechanize.new.get(ARGV[0]).links.text(/[a-z]/).each { |l| puts l.text }

Visit all links that match a pattern:

(a = WWW::Mechanize.new).get(ARGV[0]).links.text(/[a]/).each { |l| puts a.click(l).body }

Smaller Spider:

(mech = WWW::Mechanize.new).get(ARGV[0])
(a = lambda { |p|
  mech.page.links.each { |l| mech.click(l) && p.call(p) if ! mech.visited? l }
}).call(a)


A spider: A Mechanize Spider (http://tenderlovemaking.com/2006/05/24/a-mechanize-spider/). Retrieved on 2007-03-06 14:59.

agent = WWW::Mechanize.new
stack = agent.get(ARGV[0]).links
while l = stack.pop
  stack.push(*(agent.click(l).links)) unless agent.visited? l.href
end



[edit] RedNails: A template-driven data scraper

Homepage: http://rednails.rubyforge.org/


Project/Development: http://rubyforge.org/projects/rednails/


Description: RedNails is a data scraping library that uses templates to determine what data to extract from actual data feeds. RedNails uses the template to create a regular expression that catches the user marker variables. When a string of data is passed to RedNails it will use the regular expression to extract the matches and return them to the user. If the scraped data is regular enough then RedNails is a simple way to extract data as all one needs to do is copy a live data feed and mark the points to extract and make this the template.





http://rednails.rubyforge.org/

          <html>
          <body>
          A bunch of photos:
          #{Rep:<img src="@url@" alt="@txt@"/>}
          </body>
          </html>

[edit] scRUBYt

Homepage: http://scrubyt.org


Project/Development: http://rubyforge.org/projects/scrubyt/


Description: A simple to learn and use, yet very powerful web extraction framework in Ruby. Navigate through the Web, Extract, query, transform and save data from the Web page of interest by the concise and easy to use DSL provided by scRUBYt.




Readiness: 4 - Beta


[edit] beautiful soup

...


[edit] bottone - Ruby Wikipedia client/editor

Categories/Tags: [Wikipedia (category)][MediaWiki (category)]



Project/Development: http://rubyforge.org/projects/bottone/


Description: The aim of this project is to make a tool (or a set of tools) to read, edit and save Wikipedia pages.




Readiness: 1 - Planning, 2 - Pre-Alpha, 2004.11.02.13.30 November 2, 2004



[edit] Other

[edit] How to automate Facebook

http://shanti.railsblog.com/how-to-automate-facebook-interaction-using-ruby-and-www-mechanize

Facts about Ruby / Screen scrapingRDF feed
Description [Oops! No type defined for attribute]
Personal tools