Ruby / Screen scraping
From WhyNotWiki
screen scraping / screen scrapers / web automation / automated data extraction / automated form posting / web spiders
Contents |
[edit] Tools / libraries
[edit]
WWW:Mechanize
http://mechanize.rubyforge.org/
http://mechanize.rubyforge.org/mechanize/
Author's blog: http://tenderlovemaking.com/category/mechanize/
http://tenderlovemaking.com/2007/02/26/ruby-mechanize-065-released/
The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
[edit] Examples
http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html
Mechanize One Liners (http://tenderlovemaking.com/2006/05/26/mechanize-one-liners/).
Fetch a page and print to stdout:
puts WWW::Mechanize.new.get(ARGV[0]).bodyList all links in a page:
WWW::Mechanize.new.get(ARGV[0]).links.each { |l| puts l.text }Visit all links on a page:
(a = WWW::Mechanize.new).get(ARGV[0]).links.each { |l| puts a.click(l).body }List all links that match a pattern:
WWW::Mechanize.new.get(ARGV[0]).links.text(/[a-z]/).each { |l| puts l.text }Visit all links that match a pattern:
(a = WWW::Mechanize.new).get(ARGV[0]).links.text(/[a]/).each { |l| puts a.click(l).body }Smaller Spider:
(mech = WWW::Mechanize.new).get(ARGV[0]) (a = lambda { |p| mech.page.links.each { |l| mech.click(l) && p.call(p) if ! mech.visited? l } }).call(a)
A spider: A Mechanize Spider (http://tenderlovemaking.com/2006/05/24/a-mechanize-spider/).
agent = WWW::Mechanize.new stack = agent.get(ARGV[0]).links while l = stack.pop stack.push(*(agent.click(l).links)) unless agent.visited? l.href end
[edit] RedNails: A template-driven data scraper
| Homepage: | http://rednails.rubyforge.org/
|
|---|---|
| Project/Development: | http://rubyforge.org/projects/rednails/
|
| Description: | RedNails is a data scraping library that uses templates to determine what data to extract from actual data feeds. RedNails uses the template to create a regular expression that catches the user marker variables. When a string of data is passed to RedNails it will use the regular expression to extract the matches and return them to the user. If the scraped data is regular enough then RedNails is a simple way to extract data as all one needs to do is copy a live data feed and mark the points to extract and make this the template.
|
http://rednails.rubyforge.org/
<html>
<body>
A bunch of photos:
#{Rep:<img src="@url@" alt="@txt@"/>}
</body>
</html>
[edit] scRUBYt
| Homepage: | http://scrubyt.org
|
|---|---|
| Project/Development: | http://rubyforge.org/projects/scrubyt/
|
| Description: | A simple to learn and use, yet very powerful web extraction framework in Ruby. Navigate through the Web, Extract, query, transform and save data from the Web page of interest by the concise and easy to use DSL provided by scRUBYt.
|
| Readiness: | 4 - Beta
|
[edit] beautiful soup
...
[edit] bottone - Ruby Wikipedia client/editor
| Categories/Tags: | [Wikipedia (category)][MediaWiki (category)]
|
|---|---|
| Project/Development: | http://rubyforge.org/projects/bottone/
|
| Description: | The aim of this project is to make a tool (or a set of tools) to read, edit and save Wikipedia pages.
|
| Readiness: | 1 - Planning, 2 - Pre-Alpha, 2004.11.02.13.30 November 2, 2004
|
[edit] Other
[edit] How to automate Facebook
http://shanti.railsblog.com/how-to-automate-facebook-interaction-using-ruby-and-www-mechanize
