Post on 13-May-2015
description
transcript
Web Scraping
@AnnieCushing
For Code-ophobes
What I’m not
@AnnieCushing
What I am
THE WIND BENEATH MY WEB-SCRAPING WINGS
@djchrisle
@ethanlyon
@AnnieCushing
3 WAYS TO SCRAPE IN GOOGLE DOCS
• ImportFeed• ImportHTML• ImportXML
@AnnieCushing
=ImportFeed
ImportFeed
=ImportFeed(URL, query, headers, numItems)
http://bit.ly/importfeed@AnnieCushing
=ImportFeed("http://feeds.searchengineland.com/searchengineland")
OR
=ImportFeed(C4) My preference
@AnnieCushing
@AnnieCushing
http://slidesha.re/stalker-wil
STALKING FOR LINKS
BY @WILREYNOLDS
=ImportHTML
ImportHTML
• Table• List
TWO OPTIONS
@AnnieCushing
=ImportHtml(URL, query, index)
URL: “www.domain.com/whatever” OR cell reference query: “table” or “list” OR cell referenceindex: If multiple lists or tables, which one (3 = 3rd table)
@AnnieCushing
Table Example of ImportHTML
@AnnieCushing
List Example of ImportHTML
@AnnieCushing
=ImportXML
ImportXML
http://bit.ly/xpath-tutorial
=ImportXML(URL, query)
@AnnieCushing
Simple Explanation of XPath
XPath uses path expressions to select nodes or node-sets in an XML document.
@AnnieCushing
@AnnieCushing
7 Types of Nodes
@AnnieCushing
Simple Explanation of XPath
<div><p><blockquote><price><ul>
ELEMENTS
@AnnieCushing
• As you drill down, you separate nodes with /
• Ex: /html/div/ul/li/a
PARENT-CHILD NODES
@AnnieCushing
classidsize
Look for the = sign
ATTRIBUTES
@AnnieCushing
Simple Explanation of XPath
/: Starts at the root//: Starts wherever @: Selects attributes []: Answers the question “Which one?”[*]: All
KEY CHARACTERS
@AnnieCushing
Let’s Start Simple
@AnnieCushing
Magic!
@AnnieCushing
Grab the URLs
@AnnieCushing
Because it’s an @tribute!
Let’s dial it up
@AnnieCushing
http://bit.ly/distilled-xml
@AnnieCushing
@AnnieCushing
What if your child nodes look like this?
Let’s dial it up
@AnnieCushing
Could do it this way
@AnnieCushing
At your own risk
@AnnieCushing
Better plan
@AnnieCushing
The world according to Annie
// = blah blah yada yada
@AnnieCushing
Can even be in the middle of the XPath
//div[@class=‘main’]//blockquote[2]
@AnnieCushing
Other ways to tell “which one” in XPath
STARTS-WITH
@AnnieCushing
Other ways to tell “which one” in XPath
@AnnieCushing
CONTAINS
Other ways to tell “which one” in XPath
@AnnieCushing
Other ways to tell “which one” in XPath
INDEX VALUE
@AnnieCushing
Other ways to tell “which one” in XPath
LAST()
@AnnieCushing
Become a scraping FOOL
@NicoMiceli
@AnnieCushing
• Pull queries from Topsy• Pull product feeds• Pull specific elements from a sitemap• Scrape Twitter followers• Pull GA metrics• Scrape HTML tables (e.g., list of countries from Wikipedia)• Scrape lists (e.g., scraped lists of consumer review sites to create
a custom search engine, top sports blogs, etc.)• Scrape rankings• Scrape GA codes / Adsense IDs / IPs / IP Country Codes• Find de-indexed sites• Scrape directories• Scrape Yahoo / Google for relevant pages from directory listings• Scraping title / h1 / meta descriptions• Scrape page URLs to find if someone is linking to you• Scrape Google to find snippets of text on a list of domains (for link
networks)• Scrape Quora
43
SEE IMPORT FUNCTIONS IN THEIR NATURAL HABITAT!http://bit.ly/annies-gdoc@AnnieCushin
g
AWWW YEAHHH!
TO PLAY …
1. Log in2. File > Make a copy…3. Poke around and test
@AnnieCushing
RESOURCES
XPath Tutorial: http://bit.ly/xpath-tutorial Annie’s Gdoc: http://bit.ly/annies-gdocDistilled Guide: http://bit.ly/distilled-guideSEER Cookbook: http://bit.ly/seer-cookbook
@AnnieCushing