+ All Categories
Home > Documents > 1 Register at Utah Geek Events Free event with lunch provided, dozens of speakers, giveaways,...

1 Register at Utah Geek Events Free event with lunch provided, dozens of speakers, giveaways,...

Date post: 18-Jan-2016
Category:
Upload: benedict-obrien
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
66
1 ter at Geek Events //www.utahgeekevents.com / event with lunch provided, dozens of speakers, ways, sponsors, and all things data!
Transcript
Page 1: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

1

Register atUtah Geek Eventshttp://www.utahgeekevents.com/Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

Page 2: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

MIS6330 - Spring 2015

• Database Implementation• David Paper• Python• Object-Oriented Programming• PL/SQL on Oracle• PyMongo• MongoDB

Page 3: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

3

http://webextract.net/

Page 4: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

4

http://webextract.net/

Page 5: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

5

http://scrapy.org/

Page 6: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

6

Data Problem:

300,000 books published annually129,864,880 books publishedAbout 10-15% still in print

Page 7: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

7

Module GoalCrawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], 'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}

Page 8: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

8

Tools to gather HTML source code• Numerous methods to get/view HTML code

– Web Browser -> View Source– curl– wget– telnet <host> 80

• HTTP GET

– Firebug or Inspector in Firefox or Chrome

Page 9: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

9

Web Browser -> View Source

Page 10: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

10

Programs to grab HTML code

How do we write a program that will grab and parse multiple web pages nightly?

Page 11: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

11

curljweeks@server2:~$ curl mis6330.go.usu.edu/~jweeks/simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$

Page 12: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

12

wgetjweeks@server2:~$ wget mis6330.go.usu.edu/~jweeks/simple.html--2014-03-10 11:17:26-- http://mis6330.go.usu.edu/~jweeks/simple.htmlResolving mis6330.go.usu.edu (mis6330.go.usu.edu)... 129.123.55.48Connecting to mis6330.go.usu.edu (mis6330.go.usu.edu)|129.123.55.48|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 349 [text/html]Saving to: `simple.html.1'

100%[======================================>] 349 --.-K/s in 0s

2014-03-10 11:17:26 (40.2 MB/s) - `simple.html.1' saved [349/349]

jweeks@server2:~$ more simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$

Page 13: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

13

Firefox + FirebugFirebug Add-On being replaced by built-in developer tools https://www.getfirebug.com/

Page 14: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

14

Firefox -> Tools -> Web Developer -> Inspector

Page 15: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

15

Chrome -> Tools -> Developer Tools

Page 16: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

16

Problem:Extract Amazon Book PricesHow do we extract 100+ book prices, titles, and ISBN daily?What if you want to extract all product prices daily?

Amazon Best Sellershttp://www.amazon.com/gp/bestsellers/books/ref=sv_b_2

Amazon Best Sellers in DatabasesRank 1-20:

http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5

Rank 21-40:http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2

Page 17: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

17

Problem:Extract Amazon Book Prices

Page 18: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

18

Amazon Product Advertising APIWhat if Amazon API is not available or restricts access?

Query Restrictions: One query per second• 60 seconds * 60 minutes * 24 hours = 86,400 queries per day• Add 1 query per second for every $4600 item revenue per

month

https://affiliate-program.amazon.com/gp/advertising/api/detail/main.htmlhttp://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html

Page 19: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

19

Solution: Scrapy

• Web Scraping• Utility to pull data from websites• Automated HTML gather and parse• Developed in python

Page 20: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

20

Scrapy Minimalist Example

Run through followed by detailed examples

Steps to scrape data from websites1. Identify web page(s)

http://mis6330.go.usu.edu/~jweeks/simple.html

2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data

Page 21: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

21

Simple Example Web Pagehttp://mis6330.go.usu.edu/~jweeks/simple.html

Identify data you wish to scrape

Page 22: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

22

Scrapy Minimalist ExampleSSH via MobaXterm to:

server2.go.usu.eduLogin:

hadoopPassword:

mis6110Port:

22 (default)

Page 23: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

23

Sample runs on Amazon.com

cdcd scrapy/amazon/amazon/spiders/scrapy crawl amazon -o`date +amazon_spider.%Y.%m.%d.json` -t json --nologhead `date +amazon_spider.%Y.%m.%d.json`

cd ~/scrapy/amazonr/amazonr/spiders/scrapy crawl amazonr -o`date +amazonr_spider.%Y.%m.%d.json` -t json --nologhead -3 `date +amazonr_spider.%Y.%m.%d.json`

Page 24: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

24

{"Publisher": ["Sams Publishing; 4 edition (November 4, 2012)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n \n \n\t\t ", "\n\t \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["12.8 ounces (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&amp;asin=0672336073&amp;seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["SQL in 10 Minutes, Sams Teach Yourself (4th Edition): Ben Forta: 0752063336076: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0672336073"], "Link": "http://www.amazon.com/Minutes-Sams-Teach-Yourself-Edition/dp/0672336073", "ISBN_13": ["978-0672336072"], "Pages": ["<li><b>Paperback:</b> 288"]},

Page 25: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

25

{"Publisher": ["Packt Publishing - ebooks Account (September 2015)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["2.1 pounds (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&amp;asin=1783555130&amp;seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["Python Machine Learning: Sebastian Raschka: 9781783555130: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["1783555130"], "Link": "http://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130", "ISBN_13": ["978-1783555130"], "Pages": ["<ul><li><b>Paperback:</b> 454"]},

Page 26: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

26

Scrapy Minimalist ExampleDo not run these commands during demo.Replace "simple" with a project name of your selectioncd scrapyscrapy startproject simplevi ~/scrapy/simple/simple/items.pyvi ~/scrapy/simple/simple/spiders/simple1_spiders.pycd ~/scrapy/simple/simple/spidersscrapy crawl simple1

Page 27: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

27

Scrapy Minimalist Examplecat ~/scrapy/simple/simple/items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class SimpleItem(Item): # define the fields for your item here like: link = Field() h1 = Field() h2 = Field() row1col3 = Field() row2col3 = Field()

Page 28: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

28

cat ~/scrapy/simple/simple/spiders/simple1_spiders.pyfrom scrapy.spider import BaseSpiderfrom simple.items import SimpleItemfrom scrapy.selector import Selector

class SimpleSpider(BaseSpider): name = "simple1" start_urls = [ 'http://mis6330.go.usu.edu/~jweeks/simple.html' ]

def parse(self, response): sitem = SimpleItem() webpage = Selector(response) sitem['link'] = response.url sitem['h1'] = webpage.xpath('//h1/text()').extract() sitem['h2'] = webpage.re('<h2>(.*?)</h2>') sitem['row1col3'] = webpage.xpath('.//table/tr[1]/td[3]').extract() sitem['row2col3'] = webpage.xpath('//table/tr[2]').re('(<td>.*?3</td>)') print sitem return sitem

Page 29: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

29

jweeks@server2:~/scrapy/simple/simple$ scrapy crawl simple12014-03-31 09:09:08-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: simple)…{'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] DEBUG: Scraped from <200 http://mis6330.go.usu.edu/~jweeks/simple.html> {'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] INFO: Closing spider (finished)2014-03-31 09:09:08-0600 [simple1] INFO: Dumping Scrapy stats:…2014-03-31 09:09:08-0600 [simple1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/simple/simple$

Page 30: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

30

Scrapy Tutorialhttp://doc.scrapy.org/en/latest/intro/tutorial.html

Steps to scrape data from websites1. Identify web page(s)2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data

Page 32: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

32

Scrapy Tutorial2. Identify data you wish to scrape

– Title– Price– ISBN-10– ISBN-13– Language– URL to page– 5 star rating– # of pages

Page 33: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

33

Scrapy Tutorial3. Parse HTML to extract data

– Title– Price

Page 34: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

34

Scrapy Tutorial

3. Parse HTML to extract data– ISBN-10– ISBN-13– Language– URL– 5 star rating– # of pages

Page 35: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

35

Scrapy Tutorial4. Report the extracted data

Crawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], …

'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}

Page 36: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

36

Scrapy Commands• Create Scrapy project

– scrapy startproject amazondb

• Crawl and parse the web site– scrapy crawl amazon1– scrapy crawl amazon1 -t json -o amazondb_outputA.json

• Interactive parse checker – scrapy shell “http://www.amazon.com”

Page 37: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

37

Scrapy Files• Create Scrapy Project

– Start a putty (ssh) connection to Linux server– mkdir scrapy– cd scrapy – scrapy startproject amazondb

Files created by startproject

amazondb/scrapy.cfg Ignore: defines project name

amazondb/amazondb/items.py Defines data to be captured

amazondb/amazondb/settings.py Ignore: defines project spiders

amazondb/amazondb/__init__.py Ignore: empty file

amazondb/amazondb/pipelines.py Ignore: defines pipeline class

amazondb/amazondb/spiders/__init__.py Ignore: empty file

Page 38: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

38

Scrapy Walk Through1. Define data fields to be captured: items.py2. Define spider to parse capture data3. Examine output4. Identify how to parse HTML source code and capture data fields

1. Scrapy shell2. Regular Expression parsing3. XPath parsing with FirePath

Page 39: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

39

items.py

Edit amazondb/amazondb/items.pyFor each data field “name” you wish to capture define: <name> = Field()

Page 40: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

40

items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class AmazondbItem(Item): # define the fields for your item here like: # name = Field() Link = Field() Title = Field() aTitle = Field() PageTitle = Field() Pages = Field() Publisher = Field() Language = Field() ASIN = Field() ISBN_10 = Field() ISBN_13 = Field() # Price will be part of your assignment

Page 41: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

41

Spider ConfigurationCreate a spider configuration file

amazondb/amazondb/spiders/amazon1_spiders.py• Define the name of the spider “amazon1”• Import appropriate libraries• Import Amazondb item class• Limit the scope of searches to only “amazon.com”• Specify the URLs from which to start search• Parse/Capture each data fields information from the HTML

Page 42: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

42

Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem

class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]

def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem

Page 43: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

43

Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem

class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

]

Page 44: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

44

Scrapy Spider Configuration def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem

Page 45: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

45

Scrapy Spider ConfigurationSpecial Pares/Capture Commands

• URL of the scraped web page– response.url

• Xpath commands to select specific HTML– xpath('//title/text()')– xpath('//table[@id="productDetailsTable"]') – xpath('.//*[@id="productTitle"]')

• Regular Expression parsing to capture “(.*)” fields– re('(.*) pages')– re('<li><b>Publisher:</b> (.*)<\/li>')– re('<li><b>ISBN-10:</b> (.*)<\/li>')

Additional Detailshttp://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

Page 46: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

46

Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>

scrapy crawl amazon2 -t json -o amazon2.json

Page 47: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

47

Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json…============2014-03-10 18:25:47-0600 [amazon1] DEBUG: Scraped from <200 http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> {'ASIN': [], 'ISBN_10': [u'0735668019'], 'ISBN_13': [u'978-0735668010'], 'Language': [u'English'], 'Link': 'http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82', 'PageTitle': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books'], 'Pages': [u'<li><b>Paperback:</b> 842'], 'Publisher': [u'Microsoft Press (January 11, 2013)'], 'Title': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))'], 'aTitle': []}2014-03-10 18:25:48-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81> (referer: None)…jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

Page 48: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

48

Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ more output1.json[{"Publisher": ["Microsoft Press (January 11, 2013)"], "Language": ["English"], "Title": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))"], "PageTitle": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0735668019"], "Link": "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82", "aTitle": [], "ISBN_13": ["978-0735668010"], "Pages": ["<li><b>Paperback:</b> 842"]}, {"Publisher": ["O'Reilly Media; 1 edition (December 23, 2005)"], "Language": ["English"], "Title": [], "PageTitle": ["SQL Cookbook (Cookbooks (O'Reilly)): Anthony Molinaro: 9780596009762: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0596009763"], "Link": "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "aTitle": ["SQL Cookbook (Cookbooks (O'Reilly)) "], "ISBN_13": ["978-0596009762"], "Pages": ["<li><b>Paperback:</b> 636"]}]jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

JSON Pretty Print Validator http://jsonformatter.curiousconcept.com/

Page 49: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

49

Scrapy Spider OutputJSON Pretty Print Validator http://jsonformatter.curiousconcept.com/

Page 50: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

50

Scrapy Spider Verbose Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json2014-03-10 18:25:46-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: amazondb)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Optional features available: ssl, http112014-03-10 18:25:46-0600 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'amazondb.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['amazondb.spiders'], 'FEED_URI': 'output1.json', 'BOT_NAME': 'amazondb'}2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled item pipelines: 2014-03-10 18:25:46-0600 [amazon1] INFO: Spider opened2014-03-10 18:25:46-0600 [amazon1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232014-03-10 18:25:46-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802014-03-10 18:25:47-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> (referer: None)…..2014-03-10 18:25:48-0600 [amazon1] INFO: Closing spider (finished)2014-03-10 18:25:48-0600 [amazon1] INFO: Stored json feed (2 items) in: output1.json2014-03-10 18:25:48-0600 [amazon1] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 580, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 183580, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 3, 11, 0, 25, 48, 141281), 'item_scraped_count': 2, 'log_count/DEBUG': 10, 'log_count/INFO': 4, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 3, 11, 0, 25, 46, 743234)}2014-03-10 18:25:48-0600 [amazon1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

Page 51: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

51

Scrapy Shell

• Use scrapy shell to test individual field name parsing commands• Connect to a specified sample URL• Use correct “double quotes” to reference URL• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

Page 52: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

52

Scrapy Shellscrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

• response.url• Xpath commands:

– sel.xpath('//title/text()')– sel.xpath('//title/text()').extract()– sel.xpath('//table[@id="productDetailsTable"]') – sel.xpath('.//*[@id="productTitle"]').extract()

• Regular Expression parsing to capture “(.*)” fields– sel.re('<li><b>ISBN-10:</b> (.*)<\/li>')– sel.re('<li><b>Publisher:</b> (.*)<\/li>')

Page 53: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

53

Scrapy Shell

• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

• Print all href for links on page– sel.xpath('//a/@href').extract()

• Print href for all images– sel.xpath('//a[contains(@href, "image")]/@href').extract()

• Pretty Print– import pprint– pp= pprint.PrettyPrinter(indent=4)– pp.pprint(sel.xpath('//a/@href').extract())– pp.pprint(sel.xpath('//a[contains(@href, "image")]/@href').extract() )– pp.pprint(sel.xpath('//title/text()'))

Page 54: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

54

Scrapy ShellExtract the Title of the Web Page

>>> sel.xpath('//title')[<Selector xpath='//title' data=u'<title>Microsoft Visual C# 2012 Step by '>]

>>> sel.xpath('//title').extract()[u'<title>Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books</title>']

>>> sel.xpath('//title/text()')[<Selector xpath='//title/text()' data=u'Microsoft Visual C# 2012 Step by Step (S'>]

>>> sel.xpath('//title/text()').extract()[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books']

Page 55: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

55

Scrapy ShellSplit parsing into two commandsWhy would you do this?

>>> sel.xpath('//ul/li/a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']

>>> sample=sel.xpath('//ul/li')>>> sample.xpath('a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']

Page 56: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

56

Scrapy Shell• Selector Tutorial

– http://doc.scrapy.org/en/latest/topics/selectors.html

• Regular Expression How To– http://docs.python.org/2/howto/regex.html– http://docs.python.org/2/library/re.html

• XPath – Easiest to determine using Firefox with Firebug and FirePath add-ons

Page 57: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

57

Firefox + Firebug + FirePathFirePath XPath Extension to Firebughttps://addons.mozilla.org/en-US/firefox/addon/firepath/

Firebug Add-Onhttps://www.getfirebug.com/

Firepath displays XPath toan element such as: html/body/table/tr[2]/td[3]

Page 58: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

58

Regular Expression Parsing^ matches start of the string $ matches end of the string[5b-d] matches any chars '5', 'b', 'c' or 'd'[^a-c6] matches any char except 'a', 'b', 'c' or '6'\ escapes special characters.

\w Alphanumeric: [0-9a-zA-Z_], or is LOCALE dependant\W Non-alphanumeric\d Digit\D Non-digit\s Whitespace: [ \t\n\r\f\v] (space, tab, newline,…)\S Non-whitespace. matches any character* 0 or more occurrences+ 1 or more occurrences? 0 or 1 occurrences{m} exactly 'm’ occurrencesR|S matches either regex R or regex S.

() Creates a capture group, and indicates precedence.(?P<name>...) Creates a named capturing group.(?P=<name>) Matches whatever matched previously named group(?(id)yes|no) Match 'yes' if group 'id' matched, else 'no‘

Complete Cheat Sheet with additional details: http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf

Page 59: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

59

Regular Expression ParsingOnline real-time Regular Expression test: https://pythex.org/

Page 60: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

60

Regular Expression ParsingTest Code:

<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body> </html>

Regular Expressions:<title>.*</title><title>(.*)</title>\w\d\w(\d)[\s<>]\w{2}[\s<>][\s<>](\w{4})[\s<>][\s<>](?P<first>\w{4})[\s<>](?P<second>\w{6})[\s<>][\s<](?P<first>\w+)[\s>](?P<middle>.+)[\s>]</(?P=first)><(?P<first>\w+)>(?P<middle>.+)</(?P=first)>

Page 61: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

61

Scrapy Shell XPath and RE ExampleXpath: //title/text()RE: Extract the elements before a “:”

>>> sel.xpath('//title/text()').re('(\w+):')[u'Sharp', u'9780735668010', u'com']

>>> sel.xpath('//title/text()').re('(.+):')[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com']

>>> sel.xpath('//title/text()').re('(\S+):')[u'(Microsoft))', u'Sharp', u'9780735668010', u'Amazon.com']

Page 62: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

62

Amazon CrawlModify amazon from scraping two specified web pages tocrawl a series of web pages showing the 100 top ranked database books from 1-20, 21-40, 41-60, 61- 80 and 81-100 and scrape the associated 100 books.

Page 63: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

63

Amazon Crawlfrom scrapy.selector import Selectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom amazondb.items import AmazondbItem

class AmazondbSpider(CrawlSpider): name = "amazon2" allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2']

rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))), Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item') )

def parse_item(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() aitem['PageTitle'] = webpage.xpath('//title/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') return aitem

Page 64: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

64

Amazon Crawljweeks@server2:~/scrapy/amazondb/amazondb/spiders$ diff amazon1_spider.py amazon2_spider.py

< from scrapy.spider import BaseSpider> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor> from scrapy.contrib.spiders import CrawlSpider, Rule

< class AmazondbSpider(BaseSpider):> class AmazondbSpider(CrawlSpider):

< name = "amazon1"> name = "amazon2"

< start_urls = [< "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", < "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]> start_urls = [“http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2”]

> rules = (> Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))),> Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item')> )

< def parse(self, response):> def parse_item(self, response):

Page 65: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

65

Amazon Crawl Changes• Import additional libraries• BaseSpider -> CrawlSpider• Change the name of the spider• Change the start_urls• Add “rules” to either crawl or parse• Parse -> parse_item• Add additional fields, if desired• Update Items.py to include additional fields

Page 66: 1 Register at Utah Geek Events  Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

66

Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>

scrapy crawl amazon2 -t json -o amazon2.json


Recommended