Posts Tagged ‘open uri’

Super Simple Ruby Web Scraper

Monday, May 19th, 2008

Alrighty folks. Quick walk though for scraping remote web data with Ruby. This is how I did it for my little web scraper I wrote on Saturday..

DISCLAIMER: Web Scraping is kind of a gray area.. don’t steal things that are copywritten and don’t be a jerk. Give credit where credit is due..

First thing is first. You’ll need to install the Mechanize Ruby Gem.

sudo gem install mechanize

Mechanize is pretty slick. It will iterate through a given url and let you access various html elements. Further, you can use Hpricot methods to further grab data.Lets get going..

require ‘rubygems’require ‘mechanize’require ‘uri’

url = “http://www.johnyerhot.com”

The way this is set up, you MUST have a complete url.

@mech = WWW::Mechanize.new

@page = @mech.get(url)

Now, lets say we want to get all the urls for embedded images from the webpage (http://www.johnyerhot.com)..

@imgs = @page.search(”img[@src]“).map {|src| src['src']}

You’ve now got an array (@imgs) with all the urls for embeded images! What we actually did was use Hpricot’s search method to look for and image tags and sucked out the src attribute of that tag. Mechanize does have its own methods for grabbing tags also, for example, you can grab all the link targets from every link in the web page.

#remember @page is just our mechanize instance
# w/ http://www.johnyerhot.com

@links = Array.new
@page.links.each do |link|
@links << link.href
end

Now lets weed out any links to non-images:

@links.each do |link| #yeah we’re only collecting jpgs

if (link.to_s.include? “.jpg”) || (link.to_s.include? “.JPG”) || (link.to_s.include? “.jpeg”) || (link.to_s.include? “.JPEG”)

@imgs << link
end
end

Finally, lets actually grab all those pictures and save them locally using Open URI…

@counter =0
@imgs.each do |image|
url = URI.parse(image)#parse the url and separate need info
Net::HTTP.start(url.host, url.port) { |http|
#appeand the image path with the web root.
image = http.get(image)#actually make the file to save
open(”#{url.host}_#{counter}.jpg”, “wb”) { |file|
file.write(image.body)
counter = counter + 1
}
end

And there you have it! Put it all together and you should have a functioning Ruby web scraper…Sort of. You still have to account for relative vs. absolute urls, are you gonna let in more than jpgs?, what if you need basic authentication for the url? There are still some missing pieces that need to be implemented to have this be ready for general use, but the core is there.
Further Reading
Mechanize Docs
Hpricot
Open URI
Ruby net/http Docs