1 Register at Utah Geek Events Free event with lunch provided, dozens of speakers, giveaways,...

transcript

Register atUtah Geek Eventshttp://www.utahgeekevents.com/Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!

MIS6330 - Spring 2015

• Database Implementation• David Paper• Python• Object-Oriented Programming• PL/SQL on Oracle• PyMongo• MongoDB

http://webextract.net/

http://scrapy.org/

Data Problem:

300,000 books published annually129,864,880 books publishedAbout 10-15% still in print

Module GoalCrawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], 'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}

Tools to gather HTML source code• Numerous methods to get/view HTML code

– Web Browser -> View Source– curl– wget– telnet <host> 80

• HTTP GET

– Firebug or Inspector in Firefox or Chrome

Web Browser -> View Source

Programs to grab HTML code

How do we write a program that will grab and parse multiple web pages nightly?

curljweeks@server2:~$ curl mis6330.go.usu.edu/~jweeks/simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$

wgetjweeks@server2:~$ wget mis6330.go.usu.edu/~jweeks/simple.html--2014-03-10 11:17:26-- http://mis6330.go.usu.edu/~jweeks/simple.htmlResolving mis6330.go.usu.edu (mis6330.go.usu.edu)... 129.123.55.48Connecting to mis6330.go.usu.edu (mis6330.go.usu.edu)|129.123.55.48|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 349 [text/html]Saving to: `simple.html.1'

100%[======================================>] 349 --.-K/s in 0s

2014-03-10 11:17:26 (40.2 MB/s) - `simple.html.1' saved [349/349]

jweeks@server2:~$ more simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$

Firefox + FirebugFirebug Add-On being replaced by built-in developer tools https://www.getfirebug.com/

Firefox -> Tools -> Web Developer -> Inspector

Chrome -> Tools -> Developer Tools

Problem:Extract Amazon Book PricesHow do we extract 100+ book prices, titles, and ISBN daily?What if you want to extract all product prices daily?

Amazon Best Sellershttp://www.amazon.com/gp/bestsellers/books/ref=sv_b_2

Amazon Best Sellers in DatabasesRank 1-20:

http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5

Rank 21-40:http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2

Problem:Extract Amazon Book Prices

Amazon Product Advertising APIWhat if Amazon API is not available or restricts access?

Query Restrictions: One query per second• 60 seconds * 60 minutes * 24 hours = 86,400 queries per day• Add 1 query per second for every $4600 item revenue per

https://affiliate-program.amazon.com/gp/advertising/api/detail/main.htmlhttp://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html

Solution: Scrapy

• Web Scraping• Utility to pull data from websites• Automated HTML gather and parse• Developed in python

Scrapy Minimalist Example

Run through followed by detailed examples

Steps to scrape data from websites1. Identify web page(s)

http://mis6330.go.usu.edu/~jweeks/simple.html

2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data

Simple Example Web Pagehttp://mis6330.go.usu.edu/~jweeks/simple.html

Identify data you wish to scrape

Scrapy Minimalist ExampleSSH via MobaXterm to:

server2.go.usu.eduLogin:

hadoopPassword:

mis6110Port:

22 (default)

Sample runs on Amazon.com

cdcd scrapy/amazon/amazon/spiders/scrapy crawl amazon -o`date +amazon_spider.%Y.%m.%d.json` -t json --nologhead `date +amazon_spider.%Y.%m.%d.json`

cd ~/scrapy/amazonr/amazonr/spiders/scrapy crawl amazonr -o`date +amazonr_spider.%Y.%m.%d.json` -t json --nologhead -3 `date +amazonr_spider.%Y.%m.%d.json`

{"Publisher": ["Sams Publishing; 4 edition (November 4, 2012)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n \n \n\t\t ", "\n\t \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["12.8 ounces (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&asin=0672336073&seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["SQL in 10 Minutes, Sams Teach Yourself (4th Edition): Ben Forta: 0752063336076: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0672336073"], "Link": "http://www.amazon.com/Minutes-Sams-Teach-Yourself-Edition/dp/0672336073", "ISBN_13": ["978-0672336072"], "Pages": ["<li><b>Paperback:</b> 288"]},

{"Publisher": ["Packt Publishing - ebooks Account (September 2015)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["2.1 pounds (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&asin=1783555130&seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["Python Machine Learning: Sebastian Raschka: 9781783555130: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["1783555130"], "Link": "http://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130", "ISBN_13": ["978-1783555130"], "Pages": ["<ul><li><b>Paperback:</b> 454"]},

Scrapy Minimalist ExampleDo not run these commands during demo.Replace "simple" with a project name of your selectioncd scrapyscrapy startproject simplevi ~/scrapy/simple/simple/items.pyvi ~/scrapy/simple/simple/spiders/simple1_spiders.pycd ~/scrapy/simple/simple/spidersscrapy crawl simple1

Scrapy Minimalist Examplecat ~/scrapy/simple/simple/items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class SimpleItem(Item): # define the fields for your item here like: link = Field() h1 = Field() h2 = Field() row1col3 = Field() row2col3 = Field()

cat ~/scrapy/simple/simple/spiders/simple1_spiders.pyfrom scrapy.spider import BaseSpiderfrom simple.items import SimpleItemfrom scrapy.selector import Selector

class SimpleSpider(BaseSpider): name = "simple1" start_urls = [ 'http://mis6330.go.usu.edu/~jweeks/simple.html' ]

def parse(self, response): sitem = SimpleItem() webpage = Selector(response) sitem['link'] = response.url sitem['h1'] = webpage.xpath('//h1/text()').extract() sitem['h2'] = webpage.re('<h2>(.*?)</h2>') sitem['row1col3'] = webpage.xpath('.//table/tr[1]/td[3]').extract() sitem['row2col3'] = webpage.xpath('//table/tr[2]').re('(<td>.*?3</td>)') print sitem return sitem

jweeks@server2:~/scrapy/simple/simple$ scrapy crawl simple12014-03-31 09:09:08-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: simple)…{'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] DEBUG: Scraped from <200 http://mis6330.go.usu.edu/~jweeks/simple.html> {'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] INFO: Closing spider (finished)2014-03-31 09:09:08-0600 [simple1] INFO: Dumping Scrapy stats:…2014-03-31 09:09:08-0600 [simple1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/simple/simple$

Scrapy Tutorialhttp://doc.scrapy.org/en/latest/intro/tutorial.html

Steps to scrape data from websites1. Identify web page(s)2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data

Scrapy Tutorial1. Identify web page(s)

– Books:http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81 http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82

Scrapy Tutorial2. Identify data you wish to scrape

– Title– Price– ISBN-10– ISBN-13– Language– URL to page– 5 star rating– # of pages

Scrapy Tutorial3. Parse HTML to extract data

– Title– Price

Scrapy Tutorial

3. Parse HTML to extract data– ISBN-10– ISBN-13– Language– URL– 5 star rating– # of pages

Scrapy Tutorial4. Report the extracted data

Crawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], …

'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}

Scrapy Commands• Create Scrapy project

– scrapy startproject amazondb

• Crawl and parse the web site– scrapy crawl amazon1– scrapy crawl amazon1 -t json -o amazondb_outputA.json

• Interactive parse checker – scrapy shell “http://www.amazon.com”

Scrapy Files• Create Scrapy Project

– Start a putty (ssh) connection to Linux server– mkdir scrapy– cd scrapy – scrapy startproject amazondb

Files created by startproject

amazondb/scrapy.cfg Ignore: defines project name

amazondb/amazondb/items.py Defines data to be captured

amazondb/amazondb/settings.py Ignore: defines project spiders

amazondb/amazondb/__init__.py Ignore: empty file

amazondb/amazondb/pipelines.py Ignore: defines pipeline class

amazondb/amazondb/spiders/__init__.py Ignore: empty file

Scrapy Walk Through1. Define data fields to be captured: items.py2. Define spider to parse capture data3. Examine output4. Identify how to parse HTML source code and capture data fields

1. Scrapy shell2. Regular Expression parsing3. XPath parsing with FirePath

items.py

Edit amazondb/amazondb/items.pyFor each data field “name” you wish to capture define: <name> = Field()

items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class AmazondbItem(Item): # define the fields for your item here like: # name = Field() Link = Field() Title = Field() aTitle = Field() PageTitle = Field() Pages = Field() Publisher = Field() Language = Field() ASIN = Field() ISBN_10 = Field() ISBN_13 = Field() # Price will be part of your assignment

Spider ConfigurationCreate a spider configuration file

amazondb/amazondb/spiders/amazon1_spiders.py• Define the name of the spider “amazon1”• Import appropriate libraries• Import Amazondb item class• Limit the scope of searches to only “amazon.com”• Specify the URLs from which to start search• Parse/Capture each data fields information from the HTML

Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem

class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]

def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem

Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem

class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

Scrapy Spider Configuration def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem

Scrapy Spider ConfigurationSpecial Pares/Capture Commands

• URL of the scraped web page– response.url

• Xpath commands to select specific HTML– xpath('//title/text()')– xpath('//table[@id="productDetailsTable"]') – xpath('.//*[@id="productTitle"]')

• Regular Expression parsing to capture “(.*)” fields– re('(.*) pages')– re('<li><b>Publisher:</b> (.*)<\/li>')– re('<li><b>ISBN-10:</b> (.*)<\/li>')

Additional Detailshttp://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors

Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>

scrapy crawl amazon2 -t json -o amazon2.json

Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json…============2014-03-10 18:25:47-0600 [amazon1] DEBUG: Scraped from <200 http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> {'ASIN': [], 'ISBN_10': [u'0735668019'], 'ISBN_13': [u'978-0735668010'], 'Language': [u'English'], 'Link': 'http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82', 'PageTitle': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books'], 'Pages': [u'<li><b>Paperback:</b> 842'], 'Publisher': [u'Microsoft Press (January 11, 2013)'], 'Title': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))'], 'aTitle': []}2014-03-10 18:25:48-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81> (referer: None)…jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ more output1.json[{"Publisher": ["Microsoft Press (January 11, 2013)"], "Language": ["English"], "Title": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))"], "PageTitle": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0735668019"], "Link": "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82", "aTitle": [], "ISBN_13": ["978-0735668010"], "Pages": ["<li><b>Paperback:</b> 842"]}, {"Publisher": ["O'Reilly Media; 1 edition (December 23, 2005)"], "Language": ["English"], "Title": [], "PageTitle": ["SQL Cookbook (Cookbooks (O'Reilly)): Anthony Molinaro: 9780596009762: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0596009763"], "Link": "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "aTitle": ["SQL Cookbook (Cookbooks (O'Reilly)) "], "ISBN_13": ["978-0596009762"], "Pages": ["<li><b>Paperback:</b> 636"]}]jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

JSON Pretty Print Validator http://jsonformatter.curiousconcept.com/

Scrapy Spider OutputJSON Pretty Print Validator http://jsonformatter.curiousconcept.com/

Scrapy Spider Verbose Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json2014-03-10 18:25:46-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: amazondb)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Optional features available: ssl, http112014-03-10 18:25:46-0600 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'amazondb.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['amazondb.spiders'], 'FEED_URI': 'output1.json', 'BOT_NAME': 'amazondb'}2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled item pipelines: 2014-03-10 18:25:46-0600 [amazon1] INFO: Spider opened2014-03-10 18:25:46-0600 [amazon1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232014-03-10 18:25:46-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802014-03-10 18:25:47-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> (referer: None)…..2014-03-10 18:25:48-0600 [amazon1] INFO: Closing spider (finished)2014-03-10 18:25:48-0600 [amazon1] INFO: Stored json feed (2 items) in: output1.json2014-03-10 18:25:48-0600 [amazon1] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 580, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 183580, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 3, 11, 0, 25, 48, 141281), 'item_scraped_count': 2, 'log_count/DEBUG': 10, 'log_count/INFO': 4, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 3, 11, 0, 25, 46, 743234)}2014-03-10 18:25:48-0600 [amazon1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/amazondb/amazondb/spiders$

Scrapy Shell

• Use scrapy shell to test individual field name parsing commands• Connect to a specified sample URL• Use correct “double quotes” to reference URL• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

Scrapy Shellscrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

• response.url• Xpath commands:

– sel.xpath('//title/text()')– sel.xpath('//title/text()').extract()– sel.xpath('//table[@id="productDetailsTable"]') – sel.xpath('.//*[@id="productTitle"]').extract()

• Regular Expression parsing to capture “(.*)” fields– sel.re('<li><b>ISBN-10:</b> (.*)<\/li>')– sel.re('<li><b>Publisher:</b> (.*)<\/li>')

Scrapy Shell

• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"

• Print all href for links on page– sel.xpath('//a/@href').extract()

• Print href for all images– sel.xpath('//a[contains(@href, "image")]/@href').extract()

• Pretty Print– import pprint– pp= pprint.PrettyPrinter(indent=4)– pp.pprint(sel.xpath('//a/@href').extract())– pp.pprint(sel.xpath('//a[contains(@href, "image")]/@href').extract() )– pp.pprint(sel.xpath('//title/text()'))

Scrapy ShellExtract the Title of the Web Page

>>> sel.xpath('//title')[<Selector xpath='//title' data=u'<title>Microsoft Visual C# 2012 Step by '>]

>>> sel.xpath('//title').extract()[u'<title>Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books</title>']

>>> sel.xpath('//title/text()')[<Selector xpath='//title/text()' data=u'Microsoft Visual C# 2012 Step by Step (S'>]

>>> sel.xpath('//title/text()').extract()[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books']

Scrapy ShellSplit parsing into two commandsWhy would you do this?

>>> sel.xpath('//ul/li/a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']

>>> sample=sel.xpath('//ul/li')>>> sample.xpath('a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']

Scrapy Shell• Selector Tutorial

– http://doc.scrapy.org/en/latest/topics/selectors.html

• Regular Expression How To– http://docs.python.org/2/howto/regex.html– http://docs.python.org/2/library/re.html

• XPath – Easiest to determine using Firefox with Firebug and FirePath add-ons

Firefox + Firebug + FirePathFirePath XPath Extension to Firebughttps://addons.mozilla.org/en-US/firefox/addon/firepath/

Firebug Add-Onhttps://www.getfirebug.com/

Firepath displays XPath toan element such as: html/body/table/tr[2]/td[3]

Regular Expression Parsing^ matches start of the string $ matches end of the string[5b-d] matches any chars '5', 'b', 'c' or 'd'[^a-c6] matches any char except 'a', 'b', 'c' or '6'\ escapes special characters.

\w Alphanumeric: [0-9a-zA-Z_], or is LOCALE dependant\W Non-alphanumeric\d Digit\D Non-digit\s Whitespace: [ \t\n\r\f\v] (space, tab, newline,…)\S Non-whitespace. matches any character* 0 or more occurrences+ 1 or more occurrences? 0 or 1 occurrences{m} exactly 'm’ occurrencesR|S matches either regex R or regex S.

() Creates a capture group, and indicates precedence.(?P<name>...) Creates a named capturing group.(?P=<name>) Matches whatever matched previously named group(?(id)yes|no) Match 'yes' if group 'id' matched, else 'no‘

Complete Cheat Sheet with additional details: http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf

Regular Expression ParsingOnline real-time Regular Expression test: https://pythex.org/

Regular Expression ParsingTest Code:

<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body> </html>

Regular Expressions:<title>.*</title><title>(.*)</title>\w\d\w(\d)[\s<>]\w{2}[\s<>][\s<>](\w{4})[\s<>][\s<>](?P<first>\w{4})[\s<>](?P<second>\w{6})[\s<>][\s<](?P<first>\w+)[\s>](?P<middle>.+)[\s>]</(?P=first)><(?P<first>\w+)>(?P<middle>.+)</(?P=first)>

Scrapy Shell XPath and RE ExampleXpath: //title/text()RE: Extract the elements before a “:”

>>> sel.xpath('//title/text()').re('(\w+):')[u'Sharp', u'9780735668010', u'com']

>>> sel.xpath('//title/text()').re('(.+):')[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com']

>>> sel.xpath('//title/text()').re('(\S+):')[u'(Microsoft))', u'Sharp', u'9780735668010', u'Amazon.com']

Amazon CrawlModify amazon from scraping two specified web pages tocrawl a series of web pages showing the 100 top ranked database books from 1-20, 21-40, 41-60, 61- 80 and 81-100 and scrape the associated 100 books.

Amazon Crawlfrom scrapy.selector import Selectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom amazondb.items import AmazondbItem

class AmazondbSpider(CrawlSpider): name = "amazon2" allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2']

rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))), Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item') )

def parse_item(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() aitem['PageTitle'] = webpage.xpath('//title/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') return aitem

Amazon Crawljweeks@server2:~/scrapy/amazondb/amazondb/spiders$ diff amazon1_spider.py amazon2_spider.py

< from scrapy.spider import BaseSpider> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor> from scrapy.contrib.spiders import CrawlSpider, Rule

< class AmazondbSpider(BaseSpider):> class AmazondbSpider(CrawlSpider):

< name = "amazon1"> name = "amazon2"

< start_urls = [< "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", < "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]> start_urls = [“http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2”]

> rules = (> Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))),> Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item')> )

< def parse(self, response):> def parse_item(self, response):

Amazon Crawl Changes• Import additional libraries• BaseSpider -> CrawlSpider• Change the name of the spider• Change the start_urls• Add “rules” to either crawl or parse• Parse -> parse_item• Add additional fields, if desired• Update Items.py to include additional fields

Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>

scrapy crawl amazon2 -t json -o amazon2.json

1 Register at Utah Geek Events Free event with lunch provided, dozens of speakers, giveaways,...

Documents