+ All Categories
Home > Documents > A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8...

A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8...

Date post: 27-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
30
A Fast-Track-Overview on Web Scraping with R UseR! 2015 Peter Meißner Comparative Parliamentary Politics Working Group University of Konstanz https://github.com/petermeissner http://pmeissner.com http://www.r-datacollection.com/ presented: 2015-07-01 / last update: 2015-06-30
Transcript
Page 1: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

A Fast-Track-Overview on Web Scraping with R

UseR! 2015

Peter MeißnerComparative Parliamentary Politics Working Group

University of Konstanz

https://github.com/petermeissnerhttp://pmeissner.com

http://www.r-datacollection.com/

presented: 2015-07-01 / last update: 2015-06-30

Page 2: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Introduction

stringr

acs

XML

aemoafex

aidar

algstat

almhttrjsonlite

RCurlanametrix

AnDErjson

AntWeb

apsimr

aqp

aqr

archivist

argparse

aRxiv

atsdRJSONIO

audiolyzR

BatchJobs

BayesFactor

beepr

BEQI2

BerlinData

bigml

bigrquery

biom

biorxivr

blsAPI

bold

BoolNet

boostr

boxrbrewdata

broman

broom

causaleffect

Causata

CHCN

choroplethr

chromer

clifroselectr

CLME

colourlovers

comato

commentrcompareODM

COPASutils

coreNLPcouchDB

covr

cranlogs

crn

crunch

CSSd3Network

daff

dams

datacheck

datamart

dataRetrieval

db.r

ddeploy

DDIwR

decctools

demography

devtools

df2json

distcomp

docopt

dplR

dpmr

dvn

EasyMARK

Ecfunecoengine

EcoHydRology

EcoTroph

edeR

eeptools

EIAdata

elastic

emdatr enaR

enigma

eqs2lavaan

ESEAevaluate

evobiR

exCon

exsic

factualR

FAOSTAT

fbRanks

fdsFedData

federalregister

FinCal

fitbitScraper

FRESA.CAD

fslr

games

GAR

gemtc

gender

genderizeR

geocodeHEREgeojsonio

geonames

geotopbricks

GetoptLong

gfcanalysis

GGally

ggmap

ggsubplot

ggvis

gistr

gmailr

gnumericgoogleVis

gooJSON

gProfileRgraphicsQC

gridSVG

grImport

gsDesign

rvest gsheetGuardianR

h2o

hddtools

helsinki

HierO

HistogramTools

hive

hoardeR htmlTable

htmlwidgetsdigest

methods

mime

R6

IATscores

iDynoR

imguR

indicoio

interAdapt

internetarchive

io

ips

ISOweekjames.analysis

jSonarR

kintone knitcitations

knitr

knockoff

KoNLP

lawn

LDAvis

leafletR

letsR

LindenmayeR

lint

lubridate

Luminescence

mailR

MALDIquantForeign

managelocalrepo

manifestoR

MazamaSpatialUtils

megaptera

metagear

miniCRAN

mldr

mlxRMODISTools

mongolitempoly

mseapca

mtk

MTurkR

MUCflights

muir

myepisodes

ndtvneotoma

netgen

networkD3

networkreporting

neuroimngramrnhlscrapr nlWaldTest

NMF

NNTbiomarker

notifyR

OAIHarvester

odfWeave

ODMconverter

Ohmage

OIdata ONETr

OpasnetUtils

opencpu

OpenRepGridoptiRum

orgR

osmar

OutbreakTools

P2C2M

paleobioDB

patchSynctex

pathological

PBSmodelling

pdfetch

PepPreppkgmaker

plotKMLplusser

pmml

polidatapollstR

polywog

primerTree

profrpryr

psidR

pubmed.mineR

PubMedWordcloudpullword

pumilioR

pushoverr

pvsRpxR

pxweb

qat

QCAtools

qdap

qdapToolsQuandl

quipu

R4CDISC

R4CouchDB

r4ss

RAdwords

rainfreqrAltmetric

randNames

rAvis

rbefdatarbhl

RbioRXN

rbison

Rbitcoin

rbitcoinchartsapi

rClinicalCodes

rclinicaltrials

Rcolombosrcorpora

RcppOctave

rcrossrefRCryptsy

bitops

RDataCanvas

rdatamarket

RDML

RDota

rdrop2

rdryadRDSTK

rDVR

readMLData

readMzXmlData

readODS

rebird

recalls

redcapAPI

REDCapR

RefManageR

rentrezReol

repmisreportRx

rerddap

xml2

reshape2

restimizeapi

retrosheet

reutilsRfacebook

rFDSN

rfigshare

RFinanceYJ

rfishbaserfisheries

rfoaas

RForcecomRGA

rgauges

rgbif

RGENERATEPREC

rgexfrglobi

RGoogleAnalyticsRgoogleMaps

rgrass7

rHealthDataGovrHpcc

RIGHT

rinatrio

rite

RJafroc

Rjpstatdb rjstat

rJython

RlabkeyRlinkedin

rlist

rlme

rLTP

RLumShiny

rmongodb

Rmonkey

rnbn

rneosRNeXML

rngtools

rnoaa

rNOMADSrnrfaROAuth

ropensecretsapi

roxygen2

rPlant

rplos

rprime

rprintf

RProtoBuf

rpubchem

RPublicaRPushbullet

rPython

RSDA

rsdmxRSelenium

rsgcc

RSiteCatalyst

rsml

rsnpsRSocrata

RStars

rsunlightrtematres

Rtts

rUnemploymentData

rversionsrvertnet

magrittr

rWBclimate

RWeather

RXKCD

RXMCDA

Ryacas

RYoudaoTranslate

ryouready

rYoutheria

scholar

scidbscrapeR

sdcTable

semPlot

SensusR

seqminer

servrSGP

shiny

shinybootstrap2

shinyFiles

shopifyr

simPH

slackr

SmarterPolandSocialMediaMineR

soilDB

solr

sorvi

sos4R

sotkanet

source.gist

spanr

spareserver

SPARQL

spatialEco

spatsurv

spgrass6

spocc

sqliter

sqlshare

sqlutils

srd

ssh.utils

sss

Stack statar

StatDataML

stcm

StereoMorph stm

stmCorrViz

StormstreamRstressr

stringi

structSSI

SubpathwayGMir

surveydata

svIDE

swirl

SWMPrsymbolicDA

SynergizeR

tabplotd3

taRifx.geo

taxize

TcGSA

TFX

Thinknum

tibbrConnector

tidyjson

timeseriesdb

timetreetm.plugin.europresse

tm.plugin.factiva

tm.plugin.lexisnexis

tm.plugin.webminingTR8

translatetranslateR

treebase

tspmetatumblR

twitteR

ucbthesis

Rcpp

urltools

ustyc V8

vardpoorvdmR

vegdata

vetools

VideoComparisonvows

W3CMarkupValidator

waterData

WaterML

WDI

webchem

webutils

WikidataRWikipediaRWikipediR

WMCapacity

wux

x12 x12GUI

x.ent

utils

BH

XML2R

yhatr

zendeskR

Page 3: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Introduction

phase problems examplesdownload protocols HTTP, HTTPS, POST, GET, . . .

procedures cookies, authentication, forms, . . .————– ————– ——————————extraction parsing translating HTML (XML, JSON, . . . ) into R

extraction getting the relevant partscleansing cleaning up, restructure, combine

Page 4: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Conventions

All code examples assume . . .

I dplyrI magrittr

. . . to be loaded via . . .

library(dplyr)library(magrittr)

. . . while all other package dependencies will be made explicit on an example byexample base.

Page 5: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Reading Text from the Web

news <-"http://cran.r-project.org/web/packages/base64enc/NEWS" %>%readLines(url)

news %>% extract(1:10) %>% cat(sep="\n")

## 0.1-2 2014-06-26## o bugfix: encoding content of more than 65536 bytes without## linebreaks produced padding characters between chunks because## chunk size was not divisible by three.###### 0.1-1 2012-11-05## o fix a bug in base64decode where output is a file name#### o add base64decode(file=...) as a (non-leaking) shorthand for

Page 6: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Extracting Information from Text

. . . with base R

news %>%substring(7, 16) %>%grep("\\d{4}.\\d{1,2}.\\d{1,2}", ., value=T)

## [1] "2014-06-26" "2012-11-05" "2012-09-07"

Page 7: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Extracting Information from Text

. . . with stringr

library(stringr)news %>%

str_extract("\\d{4}.\\d{1,2}.\\d{1,2}")

## [1] "2014-06-26" NA NA NA## [5] NA NA "2012-11-05" NA## [9] NA NA NA NA## [13] NA "2012-09-07" NA

Page 8: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML / XML

. . . with rvest

library(rvest)

rpack_html <-"http://cran.r-project.org/web/packages" %>%html()

rpack_html %>% class()

## [1] "HTMLInternalDocument" "HTMLInternalDocument"## [3] "XMLInternalDocument" "XMLAbstractDocument"

Page 9: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML / XML. . . with rvest

rpack_html %>% xml_structure(indent = 2)

## {DTD}## <html [xmlns]>## <head>## <title> {text}## <link [rel, type, href]>## <meta [http-equiv, content]>## <body>## {text}## <h1> {text}## {text}## <h3 [id]> {text}## {text}## <p> {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## {text}## <h3 [id]> {text}## {text}## <p>## {text}## <kbd> {text}## {text}## <kbd> {text}## {text}## <a [href]> {text}## {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## {text}## <h3 [id]> {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## <a [href]> {text}## {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## <a [href]> {text}## {text}## <a [href]> {text}## {text}## {text}## <h3 [id]> {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## {text}## <h3 [id]> {text}## {text}## <p>## {text}## <a [href]> {text}## {text}## <a [href]> {text}## {text}## {text}## <hr>## <h3 [id]> {text}## {text}## <dl>## <dt>## <a [href]> {text}## {text}## <dd> {text}## {text}## <dt>## <a [href]> {text}## {text}## <dd>## {text}## <a [href]> {text}## {text}## {text}## <dt>## <a [href]> {text}## {text}## <dd> {text}## {text}## <dt>## <a [href]> {text}## {text}## <dd> {text}## {text}## <dt>## <a [href]> {text}## {text}## <dd> {text}## {text}

Page 10: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML / XML. . . with rvest

rpack_html %>% html_text() %>% cat()

## CRAN - Contributed Packages## Contributed Packages#### Available Packages## Currently, the CRAN package repository features 6803 available packages.## Table of available packages, sorted by date of publication## Table of available packages, sorted by name## Installation of Packages#### Please type## help("INSTALL")## or## help("install.packages")## in R for information on how to install packages from this## repository. The manual#### R Installation and Administration## (also contained in the R base sources)## explains the process in detail.###### CRAN Task Views## allow you to browse packages by topic and provide tools to## automatically install all packages for special areas of## interest.## Currently, 33 views are available.###### Package Check Results#### All packages are tested regularly on machines running## Debian GNU/Linux,## Fedora and## Solaris.## Packages are also checked under OS X and Windows, but## typically only on the day the package appears on CRAN.###### The results are summarized in the## check summary (some## timings are also available).## Additional details for Windows checking and building can be## found in the## Windows## check summary.###### Writing Your Own Packages#### The manual## Writing R Extensions## (also contained in the R base sources) explains how to write## new packages and how to contribute them to CRAN.#### Repository Policies#### The manual## CRAN Repository Policy## [PDF]## describes the policies in place for the CRAN package repository.###### Related Directories## Archive## Previous versions of the packages listed above, and other packages formerly available.## Orphaned## Packages with no active maintainer, see the corresponding README.## bin/windows/contrib## Windows binaries of contributed packages## bin/macosx/contrib## OS X Snow Leopard binaries of contributed packages## bin/macosx/mavericks/contrib## OS X Mavericks binaries of contributed packages

Page 11: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Extraction from HTML / XML

. . . with rvest and XPath

rpack_html %>%html_node(xpath="//p/a[contains(@href, 'views')]/..")

## <p>## <a href="../views/">CRAN Task Views</a>## allow you to browse packages by topic and provide tools to## automatically install all packages for special areas of## interest.## Currently, 33 views are available.## </p>

Page 12: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Extraction from HTML / XML

. . . with rvest and XPath

rpack_html %>%html_nodes(xpath="//a") %>%html_attr("href") %>%extract(1:6)

## [1] "available_packages_by_date.html"## [2] "available_packages_by_name.html"## [3] "../../manuals.html#R-admin"## [4] "../views/"## [5] "http://www.debian.org/"## [6] "http://www.fedoraproject.org/"

Page 13: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Extraction from HTML / XML

. . . with rvest convenience functions

"http://cran.r-project.org/web/packages/multiplex/index.html" %>%html() %>%html_table() %>%extract2(1) %>%filter(X1 %in% c("Version:", "Published:", "Author:"))

## X1 X2## 1 Version: 1.6## 2 Published: 2015-05-19## 3 Author: J. Antonio Rivero Ostoic

Page 14: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

JSON

"https://api.github.com/users/daroczig/repos" %>%readLines(warn=F) %>%substring(1,300) %>%str_wrap(60) %>%cat()

## [{"id":## 12325008,"name":"AndroidInAppBilling","full_name":"daroczig/## AndroidInAppBilling","owner":{"login":"daroczig","id":## 495736,"avatar_url":"https://avatars.githubusercontent.com/## u/495736?v=3","gravatar_id":"","url":"https://## api.github.com/users/daroczig","html_url":"https://## github.com/daroczig","f

Page 15: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

JSON

. . . with jsonlite

library(jsonlite)fromJSON("https://api.github.com/users/daroczig/repos") %>%

select(language) %>%table() %>%sort(decreasing=TRUE)

## .## R JavaScript Emacs Lisp Groff Jasmin## 16 4 1 1 1## Java PHP Python## 1 1 1

Page 16: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML forms / HTTP methods

. . . with rvest and httr

library(rvest)library(httr)

text <-"Quirky spud boys can jam after zapping five worthy Polysixes."

mainpage <- html("http://read-able.com")

Page 17: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML forms / HTTP methods

. . . with rvest and httr

mainpage %>%html_nodes(xpath="//form") %>%html_attrs()

## [[1]]## method action## "get" "check.php"#### [[2]]## method action## "post" "check.php"

Page 18: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML forms / HTTP methods

. . . with rvest and httr

mainpage %>%html_nodes(

xpath="//form[@method='post']//*[self::textarea or self::input]")

## [[1]]## <textarea id="directInput" name="directInput" rows="10" cols="60"></textarea>#### [[2]]## <input type="submit" value="Calculate Readability" />#### attr(,"class")## [1] "XMLNodeSet"

Page 19: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML forms / HTTP methods

. . . with rvest and httr

response <-POST(

"http://read-able.com/check.php",body=list(directInput = text),encode="form"

)

Page 20: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

HTML forms / HTTP methods

. . . with rvest and httr

response %>%extract2("content") %>%rawToChar() %>%html() %>%html_table() %>%extract2(1)

## X1 X2 X3## 1 Flesch Kincaid Reading Ease 61.3 NA## 2 Flesch Kincaid Grade Level 7.2 NA## 3 Gunning Fog Score 4.0 NA## 4 SMOG Index 6.0 NA## 5 Coleman Liau Index 14.2 NA## 6 Automated Readability Index 7.6 NA

Page 21: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Overcoming the Javascript Barrier

. . . with RSelenium browser automation

library(RSelenium)checkForServer() # make sure Selenium Server is installedstartServer()remDr <- remoteDriver() # defaults firefoxdings <- remDr$open(silent=T) # see: ?remoteDriver !!remDr$navigate("https://spiegel.de")remDr$screenshot(

display = F,useViewer = F,file = paste0("spiegel_", Sys.Date(),".png")

)

Page 22: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Overcoming the Javascript Barrier. . . with RSelenium browser automation

Page 23: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Authentication

. . . with httr and httpuv

library(httpuv)library(httr)

twitter_token <- oauth1.0_token(oauth_endpoints("twitter"),twitter_app <- oauth_app(

"petermeissneruser2015",key = "fP7WB5CcoZNLVQ2Xh8nAdFVAN",secret = "PQG1eEJZ65Mb8ANHz8q7yp4MqgAmiAVED90F4ZvQUSTHxiGzPT"

))

Page 24: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Authentication

. . . with httr and httpuv

req <-GET(

paste0("https://api.twitter.com/1.1/search/tweets.json","?q=%23user2015&result_type=recent&count=100"

),config(token = twitter_token)

)

Page 25: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Authentication

. . . with httr and httpuv

tweets <-req %>%content("parsed") %>%extract2("statuses") %>%lapply(`[`, "text") %>%unlist(use.names=FALSE) %>%subset(!grepl("^RT ", tweets)) %>%extract(1:15)

Page 26: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Authentication

. . . with httr and httpuv

tweets %>% substr(1,60) %>% cat(sep="\n")

## We're almost ready! #useR2015 @RevolutionR http://t.co/Lxov7## The booth is getting ready for you! See you soon! #useR2015## jra kzvettetni prblunk: 4+ magyaroszgi ltogat a #user2015 ko## Congress centre only a stones throw from my room #UseR2015 h## On my way to #useR2015 Wup, wup!## My first day at #useR2015 is about to get underway! I hope y## TIBCO: RT ianmcook: In Denmark at user2015aalborg #useR2015## And please all Hungarian attendees of #user2015 ping me to g## @MangoTheCat has arrived! #rstats #user2015 #aalborg http://## great experience in #DataMeetsViz, now ready for #useR2015## Are you looking forward to see this year's t-shirt? Reg. ope## Trying the Danish hospitality at #useR2015 http://t.co/Qrsvb## Nice to be in the beautiful #Aalborg for #useR2015. See you## Time to get some sleep... need to be alert for #useR2015 #rs## In Denmark at @user2015aalborg #useR2015 with @TIBCO #Spotfi

Page 27: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Technologies and Packages

I Regular Expressions / String HandlingI stringr, stringi

I HTML / XML / XPAth / CSS SelectorsI rvest, xml2, XML

I JSONI jsonlite, RJSONIO, rjson

I HTTP / HTTPSI httr, curl, Rcurl

I Javascript / Browser AutomationI RSelenium

I URLI urltools

Page 28: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Reads

I Basics on HTML, XML, JSON, HTTP, RegEx, XPathI Munzert et al. (2014): Automated Data Collection with R. Wiley.

http://www.r-datacollection.com/

I curl / libcurlI http://curl.haxx.se/libcurl/c/curl_easy_setopt.html

I CSS SelectorsI W3Schools: http://www.w3schools.com/cssref/css_selectors.asp

I Packages: httr, rvest, jsonlite, xml2, curlI Readmes, demos and vignettes accompanying the packages

I Packages: RCurl and XMLI Munzert et al. (2014): Automated Data Collection with R. Wiley.I Nolan and Temple-Lang (2013): XML and Web Technologies for Data

Science with R. Springer

Page 29: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Conclusion

I Use Mac or Linux because there will come the time when specialcharacters punch you in the face on R/Windows and according to R-develthis is unlikely to change any time soon.

I Do not listen to guys saying you should use some other language forWeb-Scraping. If you like R, use R - for any job.

I Use stringr, rvest and jsonlite first and the other packages if needed.I If you want to do scraping learn Regular Expressions, file manipulation with

R (file.create(), file.remove(), . . . ), XPath or CSS Selectors and a littleHTML-XML-JSON.

I Web scraping in R has evolved to a convenience state but still is a movingtarget within a year there might be even more powerful and/or moreconvenience packages.

I Before scraping data: (1) Watch for the download button; (2) Have a lookat CRAN Web Technologies Task View; Look for an API or if maybesomeone else has done it before. k

Page 30: A Fast-Track-Overview on Web Scraping with R - UseR! 2015 · tm.plugin.webmining TR8 translatetranslateR treebase tspmeta tumblR twitteR ucbthesis Rcpp urltools ustyc V8 vdmR vardpoor

Thanks

thanks()

## Alex Couture-Beil, Duncan Temple Lang, Duncan Temple Lang,## Duncan Temple Lang, Duncan Temple Lang, Hadley Wickham,## Hadley Wickham, Hadley Wickham, Hadley Wickham, Ian## Bicking, Inc., Jeroen Ooms, Jeroen Ooms, John Harrison,## Lloyd Hilaiel, Mark Greenaway, Oliver Keyes, R Foundation,## RStudio, RStudio, RStudio, RStudio, RStudio, See AUTHORS## file. igraph author details, Simon Potter, Simon Sapin,## Simon Urbanek, the CRAN Team

. . . and the R Community and all the others.


Recommended