Post on 28-Nov-2014
description
transcript
Practical Web Scraping
with Web::Scraper
Tatsuhiko Miyagawa miyagawa@gmail.com
Six Apart, Ltd. / Shibuya Perl MongersSF.pm Lightning Talk
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
How many of youhave done
screen-scraping w/ Perl?
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
How many of youhave used
LWP::Simple and regexp?
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
It works!
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
WWW::MySpace 0.70
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
WWW::Search::Ebay 2.231
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
There are3 problems(at least)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
(1)Fragile
Easy to break even with slight HTML changes(like newlines, order of attributes etc.)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
(2)Hard to maintain
Regular expression based scrapers are good Only when they're used in write-only scripts
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
(3)Improper
HTML & encodinghandling
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
<span class="message">I ♥ Shibuya</span>
> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I ♥ Shibuya
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Web::Scraperto the rescue
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Web scraping toolkitinspired by scrapi.rb
DSL-ish
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Example
#!/usr/bin/perl
use strict;
use warnings;
use Web::Scraper;
use URI;
my $s = scraper {
process "strong#ctu", time => 'TEXT';
result 'time';
};
my $uri = URI->new("http://timeanddate.com/worldclock/");
print $s->scrape($uri);
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Basics
use Web::Scraper;
my $s = scraper {
# DSL goes here
};
my $res = $s->scrape($uri);
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
process
process $selector,
$key => $what,
…;
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
$selector:
CSS Selectoror
XPath (start with /)
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
CSS Selector: strong#ctuXPath: //strong[@id="ctu"]
<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
$key:key for the result
hashappend "[]" for
looping
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
$what:'@attr''TEXT''RAW'
Web::Scrapersub { … }
Hash reference
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
process "ul.sites > li > a",
'urls[]' => '@href';
# { urls => [ … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
process '//ul[@class="sites"]/li/a',
'names[]' => 'TEXT';
# { names => [ 'OpenGuides', … ] }
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
process "ul.sites > li > a",
'sites[]' => {
link => '@href', name => 'TEXT';
};
# { sites => [ { link => …, name => … },
# { link => …, name => … } ] };
<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Tools
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
> cpan Web::Scraper
comes with 'scraper' CLI
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
> scraper http://example.com/
scraper> process "a", "links[]" => '@href';
scraper> d
$VAR1 = {
links => [
'http://example.org/',
'http://example.net/',
],
};
scraper> y
---
links:
- http://example.org/
- http://example.net/
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
> scraper /path/to/foo.html
> GET http://example.com/ | scraper
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Demo
Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks
Thank you
http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/
webscraper