<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Yerhot - Weblog &#187; mechanize</title>
	<atom:link href="http://www.johnyerhot.com/tag/mechanize/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johnyerhot.com</link>
	<description>moof</description>
	<lastBuildDate>Mon, 16 Aug 2010 22:11:50 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Super Simple Ruby Web Scraper</title>
		<link>http://www.johnyerhot.com/2008/05/19/anatomy-of-a-ruby-web-scraper/</link>
		<comments>http://www.johnyerhot.com/2008/05/19/anatomy-of-a-ruby-web-scraper/#comments</comments>
		<pubDate>Mon, 19 May 2008 19:47:37 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Ruby On Rails]]></category>
		<category><![CDATA[How to]]></category>
		<category><![CDATA[mechanize]]></category>
		<category><![CDATA[open uri]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[sweefner.com]]></category>
		<category><![CDATA[web scraper]]></category>

		<guid isPermaLink="false">http://www.johnyerhot.com/2008/05/19/anatomy-of-a-ruby-web-scraper/</guid>
		<description><![CDATA[Alrighty folks.  Quick walk though for scraping remote web data with Ruby.  This is how I did it for my little web scraper I wrote on Saturday..
DISCLAIMER: Web Scraping is kind of a gray area.. don&#8217;t steal things that are copywritten and don&#8217;t be a jerk.  Give credit where credit is due..
First [...]]]></description>
			<content:encoded><![CDATA[<p>Alrighty folks.  Quick walk though for scraping remote web data with Ruby.  This is how I did it for my little web scraper I wrote on Saturday..</p>
<p><strong>DISCLAIMER:</strong> Web Scraping is kind of a gray area.. don&#8217;t steal things that are copywritten and don&#8217;t be a jerk.  Give credit where credit is due..</p>
<p>First thing is first.  You&#8217;ll need to install the <a href="http://mechanize.rubyforge.org/mechanize/">Mechanize</a> Ruby Gem.</p>
<blockquote><p>sudo gem install mechanize</p></blockquote>
<p>Mechanize is pretty slick.  It will iterate through a given url and let you access various html elements.  Further, you can use <a href="http://code.whytheluckystiff.net/hpricot/wiki">Hpricot</a> methods to further grab data.Lets get going..</p>
<blockquote><p>require &#8216;rubygems&#8217;require &#8216;mechanize&#8217;require &#8216;uri&#8217;</p>
<p>url = &#8220;http://www.johnyerhot.com&#8221;</p></blockquote>
<p>The way this is set up, you MUST have a complete url.</p>
<blockquote><p>@mech = WWW::Mechanize.new</p>
<p>@page = @mech.get(url)</p></blockquote>
<p>Now, lets say we want to get all the urls for embedded images from the webpage (http://www.johnyerhot.com)..</p>
<blockquote><p>@imgs = @page.search(&#8221;img[@src]&#8220;).map {|src| src['src']}</p></blockquote>
<p>You&#8217;ve now got an array (@imgs) with all the urls for embeded images!  What we actually did was use Hpricot&#8217;s search method to look for and image tags and sucked out the src attribute of that tag.  Mechanize does have its own methods for grabbing tags also, for example,  you can grab all the link targets from every link in the web page.</p>
<blockquote><p>#remember @page is just our mechanize instance<br />
# w/ http://www.johnyerhot.com</p>
<p>@links = Array.new<br />
@page.links.each do |link|<br />
@links &lt;&lt; link.href<br />
end</p></blockquote>
<p>Now lets weed out any links to non-images:</p>
<blockquote><p>@links.each do |link|  #yeah we&#8217;re only collecting jpgs</p>
<p>if (link.to_s.include? &#8220;.jpg&#8221;) || (link.to_s.include? &#8220;.JPG&#8221;) || (link.to_s.include? &#8220;.jpeg&#8221;) || (link.to_s.include? &#8220;.JPEG&#8221;)</p>
<p>@imgs &lt;&lt; link<br />
end<br />
end</p></blockquote>
<p>Finally, lets actually grab all those pictures and save them locally using Open URI&#8230;</p>
<blockquote><p>@counter =0<br />
@imgs.each do |image|<br />
url = URI.parse(image)#parse the url and separate need info<br />
Net::HTTP.start(url.host, url.port) { |http|<br />
#appeand the image path with the web root.<br />
image = http.get(image)#actually make the file to save<br />
open(&#8221;#{url.host}_#{counter}.jpg&#8221;, &#8220;wb&#8221;) { |file|<br />
file.write(image.body)<br />
counter = counter + 1<br />
}<br />
end</p></blockquote>
<p>And there you have it! Put it all together and you should have a functioning Ruby web scraper&#8230;Sort of. You still have to account for relative vs. absolute urls, are you gonna let in more than jpgs?, what if you need basic authentication for the url? There are still some missing pieces that need to be implemented to have this be ready for general use, but the core is there.<br />
<strong>Further Reading</strong><br />
<a href="http://mechanize.rubyforge.org/mechanize/">Mechanize Docs</a><br />
<a href="http://code.whytheluckystiff.net/hpricot">Hpricot</a><br />
<a href="http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/">Open URI</a><br />
<a href="http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html">Ruby net/http Docs</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnyerhot.com/2008/05/19/anatomy-of-a-ruby-web-scraper/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
