+ All Categories
Home > Documents > By Robert Vesco - Meetupfiles.meetup.com/1503964/2010-06-24_WebScrapeIntro.pdf · Check to make...

By Robert Vesco - Meetupfiles.meetup.com/1503964/2010-06-24_WebScrapeIntro.pdf · Check to make...

Date post: 19-May-2018
Category:
Upload: nguyendieu
View: 219 times
Download: 1 times
Share this document with a friend
25
Introduction to Webscraping with R By Robert Vesco >> For Access to R Code, Please Open this Presentation in Dedicated PDF Application and Click on Pin <<
Transcript
  • Introduction to Webscraping with R

    By Robert Vesco

    >> For Access to R Code, Please Open this Presentationin Dedicated PDF Application and Click on Pin

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Why Use R for Webscraping

    No external languages or scripts needed

    Makes workflow more efficient

    Easier to share with othe[R] colleagues

    Can accomplish most webscraping needs quickly andefficiently

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Why XML, XPATH Approach

    Faster than using regular expressions

    More robust

    Nearly all languages now support XPATH approach

    HTML code in the wild getting better all the time andhence makes XPATH more reliable

    Can and should be used with regular expressions

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    HTML - The Code Behind the Web

    1 < l i i d=member 4403063 >2 3 4 6 7 8 9

    10 Robert V

    11 12 13 14 < l i>15 Joined : April 12 , 201016 17 18

    19 1) I 'm a l o v e [ r ] not a ha t e [ r ]20 2)

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Using XPATH to Select Items

    We want to select just my name

    1 . //* [ @class='memName ' ]

    or use fuller path

    1 . //*div/div/div/ etc . . . . /a [ @class='memName ' ]

    Will both pull out Robert V from this code:

    1 Robert V

    BUT - more importantly, the above code will also pull outevery name if youre looking at all the code on the page!

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    XPATH Tutorials

    http://www.zvon.org/xxl/XPathTutorial/General/examples.html

    http://www.w3schools.com/xpath/

    http://www.zvon.org/xxl/XPathTutorial/General/examples.htmlhttp://www.zvon.org/xxl/XPathTutorial/General/examples.htmlhttp://www.w3schools.com/xpath/

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Important XML Functions in R

    htmlTreeParse() parses file,cleans malformed HTML, and make it available forquerying

    1 web

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Some Tools - Firebug + FireXPATH

    Allows you to select items on a webpage and inspect theirunderlying tags. Also allows you to query your XPATH tosee what it will select!

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Some Tools - Selector Gadget

    Allows you to select multiple elements on a screen

    Useful for very complicated layouts

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    General Steps

    Select and download the pages you want

    Query the document

    Select which items you want and save to dataframe

    Repeat

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Find Which Pages Have the Info You Need +Download

    1 #Seq Variables2 cStartSeq 03 cEndSeq 1004 cStepSeq 205 #link variables6 chrURLPrefix h t t p : //www.meetup.com/RusersDC/members/? o f f s e t=7 chrURLSuffix &d e s c=1&s o r t=c h a p t e r m e m b e r . a t i m e 8 #Files will be read and save to this path. Make sure it exists on your computer9 #or change path to wherever you save this script to ! !

    10 chrSetDir = /R/R Meetup/ 11 #setwd( chrSetDir) #commented out for sweave12 chrDir p a s t e ( chrSetDir , RawData/ , sep=)13 #Check to see if folder exists , else create it.14 i f ( RawData %in% d i r ( chrSetDir )==FALSE ) {dir.create ( chrDir ) }15 f o r ( w in seq ( cStartSeq , cEndSeq , cStepSeq ) )16 {17 #Create URL that will download page18 u r l p a s t e ( chrURLPrefix , w , chrURLSuffix , sep=)19 #Create name for URL. Important because sometimes URL names have illegal

    characters

    20 #or lengths for files systems.21 urlName p a s t e ( chrDir , w , . h t m l , sep=)22 #Without error catching script will crash. Websites frequently time out !23 err t r y ( download.file ( u r l , destfile = urlName , quiet = TRUE ) , silent = TRUE )24 i f ( c l a s s ( err )==t r y e r r o r )25 {26 #you may be hitting the server too hard , so backoff and try again later.27 Sys.sleep ( 5 ) #in seconds , adjust as necessary28 t r y ( download.file ( u r l , destfile = urlName , quiet = TRUE ) , silent = TRUE )29 }30 }

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Process Pages and Extract Data

    1 r e q u i r e ( XML )2 r e q u i r e ( xtable )3 vFiles list.files ( chrDir ) # put files in rawdata folder into vector and get

    length

    4 iLenFilesList l e n g t h ( vFiles ) #create list to store dataframes5 l s l i s t ( )6 f o r ( i in 1 : iLenFilesList )7 {8 #each i will pull a different URL9 u r l vFiles [ i ]

    10 chrFileUrl p a s t e ( chrDir , u r l , sep=)11 # this function works on dirty html , adding closing tags and such.12 web htmlTreeParse ( chrFileUrl , error=f u n c t i o n ( ... ) {} , useInternalNodes = TRUE ,

    encoding = UTF8 , trim=TRUE )13 #Use vectorized function to get names14 vNames xpathSApply ( web , '//* [ @ c l a s s =memName ] ' , xmlValue )15 #Same as above , but use regex to clean up a bit16 vDates gsub ( J o i n e d : | \ r \n , , xpathSApply ( web , '//*/ u l [ @ c l a s s =D l e s s

    memStats ] / l i ' , xmlValue ) )17 #Since not every person has a quote , we break the problem into part getting

    chunks of code

    18 vQuote2 getNodeSet ( web , //*/ d i v [ @ c l a s s =' D t i t l e ' ] )19 #now we look for quotes - notice ".//" this means subquery -- IMPORTANT !20 vQuote3 s a p p l y ( vQuote2 , f u n c t i o n ( x ) xpathSApply (x , . //p [ @ c l a s s ='D l e s s ' ] ,

    xmlValue ) )21 # we get list () for node with no quotes. Replace list () with NULL22 vQuote4 gsub ( '\ r | \ n | [ ] ' , , s a p p l y ( vQuote3 , f u n c t i o n ( x ) i f e l s e ( is.list ( x ) , NA ,

    x ) ) )23 #add df to list. This is ok for small scrapes , but for larges ones , you need to

    write to file or db

    24 l s [ [ i ] ] data.frame ( Name=vNames , Date=vDates , Quote=vQuote4 , stringsAsFactors=FALSE )

    25 }

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Combine Data from Each Loop

    1 #combine df2 d f do.call ( r b i n d , l s )3 #sample output for latex4 l i b r a r y ( Hmisc )5 df2 d f [ 1 : 3 , ]6 latex ( df2 , f i l e =' ' , col.just=c ( l , l , p{2 i n } ) )

    df2 Name Date Quote1 JOEL ROBERTS June 9, 2010 My name is Joel Roberts. I am a friend of Bryan

    Stroube who told me about the DC useR Group. Iwork with several systems that use XML for informa-tion exchange between dissimilar computer / softwaresystems.

    2 Arun May 1, 2010 NA3 Travis M April 16, 2010 NA

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    RCurl Package

    More flexible, allows one to modify headers, referers, etc...

    Beyond http: https, ftp, sftp, scp, ldap, etc.....

    Come from c library libcurl so fast, extensive, and activelydeveloped

    Can use persistent connections, cookies, and processrequests as they come in rather than sequentially

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Some RCurl Examples

    getURL() allows the use of https whereas built-in R functions do not (recent Rversions may be different, internet2 method must have valid cert)

    1 txt = getURL ( h t t p s : //www. t w i t t e r . com , ssl . verifyhost=0, ssl . verifypeer=0)

    getCurlHandle() handles allow persistent connections and settings to be usedacross repeated call to same server which is similar to pass list of arguments, butpotentially with better network connectivity.

    1 curl = getCurlHandle ( cookie=cookie , useragent= M o z i l l a / 5 . 0 ( Windows ; U; Windows NT 5 . 1 ; enUS ; r v : 1 . 8 . 1 . 6 ) Gecko / 20070725 F i r e f o x / 2 . 0 . 0 . 6 )

    2 txt = getURL ( h t t p : //www. meetup . com/Ru s e r sDC/members/ , curl=curl , . opts = l i s t ( verbose = TRUE ) )

    3 txt2 = getURL ( h t t p : //www. meetup . com/Ru s e r sDC/members/ , curl=curl , . opts= l i s t ( verbose = TRUE ) )

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Warning and Practical Advice

    Be Nice! Go easy on the servers especially if youre usingrCURL

    Check to make sure website does not prohibit scrapping.See if they already have an API. Else, send an email toweb owners.

    If login required, then need to use rCurl package.

    Always use a proxy - if for nothing else you dont wantyour home address to accidently get blocked (ie googleautomated query blocking).

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Warning and Practical Advice

    Consider setting up your old computer or laptop to runscraping. Frees your main computer up. Set it to emailyou if it crashes.....

    Get familiar with error catching and debugging

    Even if you try to make your code robust, large jobs willrequire several retweakings of code, because ex ante, youdont know all the possible permutations. Your code willbecome more generalized with more exception handling.

    XML packages have slightly different interpretations.Hence going from one language to the next may requireslightly different queries

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    Outline

    1 Why Use R for Webscraping

    2 Why XML, XPATH Approach

    3 The Basics of Webscraping

    4 R Example

    5 RCurl

    6 Practical Advice

    7 References

  • Introductionto

    Webscrapingwith R

    By RobertVesco

    Why Use R forWebscraping

    Why XML,XPATHApproach

    The Basics ofWebscraping

    R Example

    RCurl

    PracticalAdvice

    References

    References

    R code used in this presentation, click on pin

    Tutorials,http://www.zvon.org/xxl/XPathTutorial/General/examples.html

    http://www.w3schools.com/xpath/

    Firebughttp://getfirebug.com/

    https://addons.mozilla.org/en-US/firefox/addon/11900/

    SelectorGadgethttp://www.selectorgadget.com/

    XML package docshttp://www.omegahat.org/RSXML/

    rCurl Packagehttp://www.omegahat.org/RCurl/

    # Used R 2.9# Used Frank Harrel Sweavel.sty file for code in presentation# Written by Robert Vesco for R-meetup 2010-06-24# Purpose: An example to illustrate basic webscraping with R###############################################################################

    ###############################Create loop to download all the relevant pages. #We want to do this to minimize timeout errors and keep evidence##################################

    #Seq VariablescStartSeq


Recommended