+ All Categories
Home > Technology > Web::Scraper for SF.pm LT

Web::Scraper for SF.pm LT

Date post: 28-Nov-2014
Category:
Upload: tatsuhiko-miyagawa
View: 15,882 times
Download: 2 times
Share this document with a friend
Description:
 
33
Practical Web Scraping with Web::Scraper Tatsuhiko Miyagawa [email protected] Six Apart, Ltd. / Shibuya Perl Mongers SF.pm Lightning Talk
Transcript
Page 1: Web::Scraper for SF.pm LT

Practical Web Scraping

with Web::Scraper

Tatsuhiko Miyagawa [email protected]

Six Apart, Ltd. / Shibuya Perl MongersSF.pm Lightning Talk

Page 2: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

How many of youhave done

screen-scraping w/ Perl?

Page 3: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

How many of youhave used

LWP::Simple and regexp?

Page 4: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Page 5: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

Page 6: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

> perl -MLWP::Simple -le '$c = get("http://timeanddate.com/worldclock/"); $c =~ m@<strong id="ctu">(.*?)</strong>@ and print $1'Monday, August 27, 2007 at 12:49:46

Page 7: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

It works!

Page 8: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

WWW::MySpace 0.70

Page 9: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

WWW::Search::Ebay 2.231

Page 10: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

There are3 problems(at least)

Page 11: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

(1)Fragile

Easy to break even with slight HTML changes(like newlines, order of attributes etc.)

Page 12: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

(2)Hard to maintain

Regular expression based scrapers are good Only when they're used in write-only scripts

Page 13: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

(3)Improper

HTML & encodinghandling

Page 14: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

<span class="message">I &hearts; Shibuya</span>

> perl –e '$c =~ m@<span class="message">(.*?)</span>@ and print $1'I &hearts; Shibuya

Page 15: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Web::Scraperto the rescue

Page 16: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Web scraping toolkitinspired by scrapi.rb

DSL-ish

Page 17: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Example

#!/usr/bin/perl

use strict;

use warnings;

use Web::Scraper;

use URI;

my $s = scraper {

process "strong#ctu", time => 'TEXT';

result 'time';

};

my $uri = URI->new("http://timeanddate.com/worldclock/");

print $s->scrape($uri);

Page 18: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Basics

use Web::Scraper;

my $s = scraper {

# DSL goes here

};

my $res = $s->scrape($uri);

Page 19: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

process

process $selector,

$key => $what,

…;

Page 20: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

$selector:

CSS Selectoror

XPath (start with /)

Page 21: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

CSS Selector: strong#ctuXPath: //strong[@id="ctu"]

<td>Current <strong>UTC</strong> (or GMT/Zulu)-time used: <strong id="ctu">Monday, August 27, 2007 at 12:49:46</strong> <br />

Page 22: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

$key:key for the result

hashappend "[]" for

looping

Page 23: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

$what:'@attr''TEXT''RAW'

Web::Scrapersub { … }

Hash reference

Page 24: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 25: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

process "ul.sites > li > a",

'urls[]' => '@href';

# { urls => [ … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 26: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

process '//ul[@class="sites"]/li/a',

'names[]' => 'TEXT';

# { names => [ 'OpenGuides', … ] }

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 27: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

process "ul.sites > li > a",

'sites[]' => {

link => '@href', name => 'TEXT';

};

# { sites => [ { link => …, name => … },

# { link => …, name => … } ] };

<ul class="sites"><li><a href="http://vienna.openguides.org/">OpenGuides</a></li><li><a href="http://vienna.yapceurope.org/">YAPC::Europe</a></li></ul>

Page 28: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Tools

Page 29: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

> cpan Web::Scraper

comes with 'scraper' CLI

Page 30: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

> scraper http://example.com/

scraper> process "a", "links[]" => '@href';

scraper> d

$VAR1 = {

links => [

'http://example.org/',

'http://example.net/',

],

};

scraper> y

---

links:

- http://example.org/

- http://example.net/

Page 31: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

> scraper /path/to/foo.html

> GET http://example.com/ | scraper

Page 32: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Demo

Page 33: Web::Scraper for SF.pm LT

Tatsuhiko MiyagawaTatsuhiko Miyagawa 2007/11/027 SF.pm Lightning Talks2007/11/027 SF.pm Lightning Talks

Thank you

http://search.cpan.org/dist/Web-Scraperhttp://www.slideshare.net/miyagawa/

webscraper


Recommended