Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | benedict-obrien |
View: | 215 times |
Download: | 1 times |
1
Register atUtah Geek Eventshttp://www.utahgeekevents.com/Free event with lunch provided, dozens of speakers, giveaways, sponsors, and all things data!
MIS6330 - Spring 2015
• Database Implementation• David Paper• Python• Object-Oriented Programming• PL/SQL on Oracle• PyMongo• MongoDB
3
http://webextract.net/
4
http://webextract.net/
5
http://scrapy.org/
6
Data Problem:
300,000 books published annually129,864,880 books publishedAbout 10-15% still in print
7
Module GoalCrawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], 'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}
8
Tools to gather HTML source code• Numerous methods to get/view HTML code
– Web Browser -> View Source– curl– wget– telnet <host> 80
• HTTP GET
– Firebug or Inspector in Firefox or Chrome
9
Web Browser -> View Source
10
Programs to grab HTML code
How do we write a program that will grab and parse multiple web pages nightly?
11
curljweeks@server2:~$ curl mis6330.go.usu.edu/~jweeks/simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$
12
wgetjweeks@server2:~$ wget mis6330.go.usu.edu/~jweeks/simple.html--2014-03-10 11:17:26-- http://mis6330.go.usu.edu/~jweeks/simple.htmlResolving mis6330.go.usu.edu (mis6330.go.usu.edu)... 129.123.55.48Connecting to mis6330.go.usu.edu (mis6330.go.usu.edu)|129.123.55.48|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 349 [text/html]Saving to: `simple.html.1'
100%[======================================>] 349 --.-K/s in 0s
2014-03-10 11:17:26 (40.2 MB/s) - `simple.html.1' saved [349/349]
jweeks@server2:~$ more simple.html<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body></html>jweeks@server2:~$
13
Firefox + FirebugFirebug Add-On being replaced by built-in developer tools https://www.getfirebug.com/
14
Firefox -> Tools -> Web Developer -> Inspector
15
Chrome -> Tools -> Developer Tools
16
Problem:Extract Amazon Book PricesHow do we extract 100+ book prices, titles, and ISBN daily?What if you want to extract all product prices daily?
Amazon Best Sellershttp://www.amazon.com/gp/bestsellers/books/ref=sv_b_2
Amazon Best Sellers in DatabasesRank 1-20:
http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5
Rank 21-40:http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2
17
Problem:Extract Amazon Book Prices
18
Amazon Product Advertising APIWhat if Amazon API is not available or restricts access?
Query Restrictions: One query per second• 60 seconds * 60 minutes * 24 hours = 86,400 queries per day• Add 1 query per second for every $4600 item revenue per
month
https://affiliate-program.amazon.com/gp/advertising/api/detail/main.htmlhttp://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html
19
Solution: Scrapy
• Web Scraping• Utility to pull data from websites• Automated HTML gather and parse• Developed in python
20
Scrapy Minimalist Example
Run through followed by detailed examples
Steps to scrape data from websites1. Identify web page(s)
http://mis6330.go.usu.edu/~jweeks/simple.html
2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data
21
Simple Example Web Pagehttp://mis6330.go.usu.edu/~jweeks/simple.html
Identify data you wish to scrape
22
Scrapy Minimalist ExampleSSH via MobaXterm to:
server2.go.usu.eduLogin:
hadoopPassword:
mis6110Port:
22 (default)
23
Sample runs on Amazon.com
cdcd scrapy/amazon/amazon/spiders/scrapy crawl amazon -o`date +amazon_spider.%Y.%m.%d.json` -t json --nologhead `date +amazon_spider.%Y.%m.%d.json`
cd ~/scrapy/amazonr/amazonr/spiders/scrapy crawl amazonr -o`date +amazonr_spider.%Y.%m.%d.json` -t json --nologhead -3 `date +amazonr_spider.%Y.%m.%d.json`
24
{"Publisher": ["Sams Publishing; 4 edition (November 4, 2012)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n \n \n\t\t ", "\n\t \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["12.8 ounces (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&asin=0672336073&seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["SQL in 10 Minutes, Sams Teach Yourself (4th Edition): Ben Forta: 0752063336076: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0672336073"], "Link": "http://www.amazon.com/Minutes-Sams-Teach-Yourself-Edition/dp/0672336073", "ISBN_13": ["978-0672336072"], "Pages": ["<li><b>Paperback:</b> 288"]},
25
{"Publisher": ["Packt Publishing - ebooks Account (September 2015)"], "Language": ["English"], "Title": ["\n ", "\n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ", "\n \n \n \n \n "], "OutOf5stars": [], "Product_Dimensions": [], "Shipping_Weight": ["2.1 pounds (<a href=\"/gp/help/seller/shipping.html?ie=UTF8&asin=1783555130&seller=ATVPDKIKX0DER\">View shipping rates and policies</a>)"], "PageTitle": ["Python Machine Learning: Sebastian Raschka: 9781783555130: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["1783555130"], "Link": "http://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130", "ISBN_13": ["978-1783555130"], "Pages": ["<ul><li><b>Paperback:</b> 454"]},
26
Scrapy Minimalist ExampleDo not run these commands during demo.Replace "simple" with a project name of your selectioncd scrapyscrapy startproject simplevi ~/scrapy/simple/simple/items.pyvi ~/scrapy/simple/simple/spiders/simple1_spiders.pycd ~/scrapy/simple/simple/spidersscrapy crawl simple1
27
Scrapy Minimalist Examplecat ~/scrapy/simple/simple/items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class SimpleItem(Item): # define the fields for your item here like: link = Field() h1 = Field() h2 = Field() row1col3 = Field() row2col3 = Field()
28
cat ~/scrapy/simple/simple/spiders/simple1_spiders.pyfrom scrapy.spider import BaseSpiderfrom simple.items import SimpleItemfrom scrapy.selector import Selector
class SimpleSpider(BaseSpider): name = "simple1" start_urls = [ 'http://mis6330.go.usu.edu/~jweeks/simple.html' ]
def parse(self, response): sitem = SimpleItem() webpage = Selector(response) sitem['link'] = response.url sitem['h1'] = webpage.xpath('//h1/text()').extract() sitem['h2'] = webpage.re('<h2>(.*?)</h2>') sitem['row1col3'] = webpage.xpath('.//table/tr[1]/td[3]').extract() sitem['row2col3'] = webpage.xpath('//table/tr[2]').re('(<td>.*?3</td>)') print sitem return sitem
29
jweeks@server2:~/scrapy/simple/simple$ scrapy crawl simple12014-03-31 09:09:08-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: simple)…{'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] DEBUG: Scraped from <200 http://mis6330.go.usu.edu/~jweeks/simple.html> {'h1': [u'Simple h1 header'], 'h2': [u'Hello World!'], 'link': 'http://mis6330.go.usu.edu/~jweeks/simple.html', 'row1col3': [u'<td>Column3</td>'], 'row2col3': [u'<td>Column3</td>']}2014-03-31 09:09:08-0600 [simple1] INFO: Closing spider (finished)2014-03-31 09:09:08-0600 [simple1] INFO: Dumping Scrapy stats:…2014-03-31 09:09:08-0600 [simple1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/simple/simple$
30
Scrapy Tutorialhttp://doc.scrapy.org/en/latest/intro/tutorial.html
Steps to scrape data from websites1. Identify web page(s)2. Identify data you wish to scrape3. Parse HTML to extract data4. Report the extracted data
31
Scrapy Tutorial1. Identify web page(s)
– Books:http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81 http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82
32
Scrapy Tutorial2. Identify data you wish to scrape
– Title– Price– ISBN-10– ISBN-13– Language– URL to page– 5 star rating– # of pages
33
Scrapy Tutorial3. Parse HTML to extract data
– Title– Price
34
Scrapy Tutorial
3. Parse HTML to extract data– ISBN-10– ISBN-13– Language– URL– 5 star rating– # of pages
35
Scrapy Tutorial4. Report the extracted data
Crawled (200) <GET http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> Scraped from <200 http://www.amazon.com/ Microsoft-Access-2010-Step/dp/0735626928/ref=zg_bs_549646_26%0A> {'ASIN': [], 'ISBN_10': [u'0735626928'], 'ISBN_13': [u'978-0735626928'], 'Language': [u'English'], 'OutOf5stars': [u'4.0'], 'Pages': [u'<ul><li><b>Paperback:</b> 448'], 'Product_Dimensions': [], 'Publisher': [u'Microsoft Press; Pap/Psc edition (July 20, 2010)'], 'Shipping_Weight': [u'1.7 pounds], 'Title': []}Scraped from <200 http://www.amazon.com/ Doing-Data-Science-Straight-Frontline-ebook/dp/B00FRSNHDC/ref=zg_bs_549646_27%0A> {'ASIN': [u'B00FRSNHDC'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.4'], 'Pages': [u'<li><b>Print Length:</b> 406'], …
'Product_Dimensions': [], 'Publisher': [u"O'Reilly Media; 1 edition (October 10, 2013)"], 'Shipping_Weight': [], 'Title': []}Scraped from <200 http://www.amazon.com/ Tableau-Your-Data-Analysis-Software/dp/1118612043/ref=zg_bs_549646_25%0A> {'ASIN': [], 'ISBN_10': [u'1118612043'], 'ISBN_13': [u'978-1118612040'], 'Language': [u'English'], 'OutOf5stars': [u'4.3'], 'Pages': [u'<ul><li><b>Paperback:</b> 528'], 'Product_Dimensions': [], 'Publisher': [u'Wiley; 1 edition (November 11, 2013)'], 'Shipping_Weight': [u'2.6 pounds'], 'Title': []}Scraped from <200 http://www.amazon.com/ Sams-Teach-Yourself-Minutes-Edition-ebook/dp/B009XDGF2C/ref=zg_bs_549646_22%0A> {'ASIN': [u'B009XDGF2C'], 'ISBN_10': [], 'ISBN_13': [], 'Language': [u'English'], 'OutOf5stars': [u'4.5'], 'Pages': [u'<li><b>Print Length:</b> 288'], 'Product_Dimensions': [], 'Publisher': [u'Sams Publishing; 4 edition (October 25, 2012)'], 'Shipping_Weight': [], 'Title': []}
36
Scrapy Commands• Create Scrapy project
– scrapy startproject amazondb
• Crawl and parse the web site– scrapy crawl amazon1– scrapy crawl amazon1 -t json -o amazondb_outputA.json
• Interactive parse checker – scrapy shell “http://www.amazon.com”
37
Scrapy Files• Create Scrapy Project
– Start a putty (ssh) connection to Linux server– mkdir scrapy– cd scrapy – scrapy startproject amazondb
Files created by startproject
amazondb/scrapy.cfg Ignore: defines project name
amazondb/amazondb/items.py Defines data to be captured
amazondb/amazondb/settings.py Ignore: defines project spiders
amazondb/amazondb/__init__.py Ignore: empty file
amazondb/amazondb/pipelines.py Ignore: defines pipeline class
amazondb/amazondb/spiders/__init__.py Ignore: empty file
38
Scrapy Walk Through1. Define data fields to be captured: items.py2. Define spider to parse capture data3. Examine output4. Identify how to parse HTML source code and capture data fields
1. Scrapy shell2. Regular Expression parsing3. XPath parsing with FirePath
39
items.py
Edit amazondb/amazondb/items.pyFor each data field “name” you wish to capture define: <name> = Field()
40
items.py# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class AmazondbItem(Item): # define the fields for your item here like: # name = Field() Link = Field() Title = Field() aTitle = Field() PageTitle = Field() Pages = Field() Publisher = Field() Language = Field() ASIN = Field() ISBN_10 = Field() ISBN_13 = Field() # Price will be part of your assignment
41
Spider ConfigurationCreate a spider configuration file
amazondb/amazondb/spiders/amazon1_spiders.py• Define the name of the spider “amazon1”• Import appropriate libraries• Import Amazondb item class• Limit the scope of searches to only “amazon.com”• Specify the URLs from which to start search• Parse/Capture each data fields information from the HTML
42
Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem
class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]
def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem
43
Scrapy Spider Configurationfrom scrapy.spider import BaseSpiderfrom scrapy.selector import Selectorfrom amazondb.items import AmazondbItem
class AmazondbSpider(BaseSpider): name = "amazon1" allowed_domains = ['amazon.com'] start_urls = [ "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"
]
44
Scrapy Spider Configuration def parse(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['PageTitle'] = webpage.xpath('//title/text()').extract() aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') print '======' print '======‘ return aitem
45
Scrapy Spider ConfigurationSpecial Pares/Capture Commands
• URL of the scraped web page– response.url
• Xpath commands to select specific HTML– xpath('//title/text()')– xpath('//table[@id="productDetailsTable"]') – xpath('.//*[@id="productTitle"]')
• Regular Expression parsing to capture “(.*)” fields– re('(.*) pages')– re('<li><b>Publisher:</b> (.*)<\/li>')– re('<li><b>ISBN-10:</b> (.*)<\/li>')
Additional Detailshttp://doc.scrapy.org/en/latest/topics/selectors.html#topics-selectors
46
Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>
scrapy crawl amazon2 -t json -o amazon2.json
47
Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json…============2014-03-10 18:25:47-0600 [amazon1] DEBUG: Scraped from <200 http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> {'ASIN': [], 'ISBN_10': [u'0735668019'], 'ISBN_13': [u'978-0735668010'], 'Language': [u'English'], 'Link': 'http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82', 'PageTitle': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books'], 'Pages': [u'<li><b>Paperback:</b> 842'], 'Publisher': [u'Microsoft Press (January 11, 2013)'], 'Title': [u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))'], 'aTitle': []}2014-03-10 18:25:48-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81> (referer: None)…jweeks@server2:~/scrapy/amazondb/amazondb/spiders$
48
Scrapy Spider Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ more output1.json[{"Publisher": ["Microsoft Press (January 11, 2013)"], "Language": ["English"], "Title": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft))"], "PageTitle": ["Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0735668019"], "Link": "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82", "aTitle": [], "ISBN_13": ["978-0735668010"], "Pages": ["<li><b>Paperback:</b> 842"]}, {"Publisher": ["O'Reilly Media; 1 edition (December 23, 2005)"], "Language": ["English"], "Title": [], "PageTitle": ["SQL Cookbook (Cookbooks (O'Reilly)): Anthony Molinaro: 9780596009762: Amazon.com: Books"], "ASIN": [], "ISBN_10": ["0596009763"], "Link": "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", "aTitle": ["SQL Cookbook (Cookbooks (O'Reilly)) "], "ISBN_13": ["978-0596009762"], "Pages": ["<li><b>Paperback:</b> 636"]}]jweeks@server2:~/scrapy/amazondb/amazondb/spiders$
JSON Pretty Print Validator http://jsonformatter.curiousconcept.com/
49
Scrapy Spider OutputJSON Pretty Print Validator http://jsonformatter.curiousconcept.com/
50
Scrapy Spider Verbose Outputjweeks@server2:~/scrapy/amazondb/amazondb/spiders$ scrapy crawl amazon1 -t json -o output1.json2014-03-10 18:25:46-0600 [scrapy] INFO: Scrapy 0.20.0 started (bot: amazondb)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Optional features available: ssl, http112014-03-10 18:25:46-0600 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'amazondb.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['amazondb.spiders'], 'FEED_URI': 'output1.json', 'BOT_NAME': 'amazondb'}2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2014-03-10 18:25:46-0600 [scrapy] DEBUG: Enabled item pipelines: 2014-03-10 18:25:46-0600 [amazon1] INFO: Spider opened2014-03-10 18:25:46-0600 [amazon1] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2014-03-10 18:25:46-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232014-03-10 18:25:46-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802014-03-10 18:25:47-0600 [amazon1] DEBUG: Crawled (200) <GET http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82> (referer: None)…..2014-03-10 18:25:48-0600 [amazon1] INFO: Closing spider (finished)2014-03-10 18:25:48-0600 [amazon1] INFO: Stored json feed (2 items) in: output1.json2014-03-10 18:25:48-0600 [amazon1] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 580, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 183580, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 3, 11, 0, 25, 48, 141281), 'item_scraped_count': 2, 'log_count/DEBUG': 10, 'log_count/INFO': 4, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 3, 11, 0, 25, 46, 743234)}2014-03-10 18:25:48-0600 [amazon1] INFO: Spider closed (finished)jweeks@server2:~/scrapy/amazondb/amazondb/spiders$
51
Scrapy Shell
• Use scrapy shell to test individual field name parsing commands• Connect to a specified sample URL• Use correct “double quotes” to reference URL• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"
52
Scrapy Shellscrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"
• response.url• Xpath commands:
– sel.xpath('//title/text()')– sel.xpath('//title/text()').extract()– sel.xpath('//table[@id="productDetailsTable"]') – sel.xpath('.//*[@id="productTitle"]').extract()
• Regular Expression parsing to capture “(.*)” fields– sel.re('<li><b>ISBN-10:</b> (.*)<\/li>')– sel.re('<li><b>Publisher:</b> (.*)<\/li>')
53
Scrapy Shell
• scrapy shell "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82"
• Print all href for links on page– sel.xpath('//a/@href').extract()
• Print href for all images– sel.xpath('//a[contains(@href, "image")]/@href').extract()
• Pretty Print– import pprint– pp= pprint.PrettyPrinter(indent=4)– pp.pprint(sel.xpath('//a/@href').extract())– pp.pprint(sel.xpath('//a[contains(@href, "image")]/@href').extract() )– pp.pprint(sel.xpath('//title/text()'))
54
Scrapy ShellExtract the Title of the Web Page
>>> sel.xpath('//title')[<Selector xpath='//title' data=u'<title>Microsoft Visual C# 2012 Step by '>]
>>> sel.xpath('//title').extract()[u'<title>Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books</title>']
>>> sel.xpath('//title/text()')[<Selector xpath='//title/text()' data=u'Microsoft Visual C# 2012 Step by Step (S'>]
>>> sel.xpath('//title/text()').extract()[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com: Books']
55
Scrapy ShellSplit parsing into two commandsWhy would you do this?
>>> sel.xpath('//ul/li/a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']
>>> sample=sel.xpath('//ul/li')>>> sample.xpath('a/text()').extract()[u'Your Amazon.com', u"Today's Deals", u'Gift Cards', u'Sell', u'Help', u'Books', u'Advanced Search', u'New Releases', u'Best Sellers', u'The\xa0New\xa0York Times\xae\xa0Best\xa0Sellers', u"Children's Books", u'Textbooks', u'Textbook Rentals', u'Sell\xa0Us Your\xa0Books', u'Best\xa0Books of\xa0the\xa0Month', u'Deals in\xa0Books', u'View shipping rates and policies', u'See Top 100 in Books', u'Careers', u'Investor Relations', u'Press Releases', u'Amazon and Our Planet', u'Amazon in the Community', u'Sell on Amazon', u'Become an Affiliate', u'Advertise Your Products', u'Independently Publish with Us', u'See all', u'Amazon.com Rewards Visa Card', u'Amazon.com Store Card', u'Shop with Points', u'Credit Card Marketplace', u'Amazon Currency Converter', u'Your Account', u'Shipping Rates & Policies', u'Amazon Prime', u'Returns & Replacements', u'Manage Your Kindle', u'Help', u'Australia', u'Brazil', u'Canada', u'China', u'France', u'Germany', u'India', u'Italy', u'Japan', u'Mexico', u'Spain', u'United Kingdom', u'Conditions of Use', u'Privacy Notice', u'Interest-Based Ads']
56
Scrapy Shell• Selector Tutorial
– http://doc.scrapy.org/en/latest/topics/selectors.html
• Regular Expression How To– http://docs.python.org/2/howto/regex.html– http://docs.python.org/2/library/re.html
• XPath – Easiest to determine using Firefox with Firebug and FirePath add-ons
57
Firefox + Firebug + FirePathFirePath XPath Extension to Firebughttps://addons.mozilla.org/en-US/firefox/addon/firepath/
Firebug Add-Onhttps://www.getfirebug.com/
Firepath displays XPath toan element such as: html/body/table/tr[2]/td[3]
58
Regular Expression Parsing^ matches start of the string $ matches end of the string[5b-d] matches any chars '5', 'b', 'c' or 'd'[^a-c6] matches any char except 'a', 'b', 'c' or '6'\ escapes special characters.
\w Alphanumeric: [0-9a-zA-Z_], or is LOCALE dependant\W Non-alphanumeric\d Digit\D Non-digit\s Whitespace: [ \t\n\r\f\v] (space, tab, newline,…)\S Non-whitespace. matches any character* 0 or more occurrences+ 1 or more occurrences? 0 or 1 occurrences{m} exactly 'm’ occurrencesR|S matches either regex R or regex S.
() Creates a capture group, and indicates precedence.(?P<name>...) Creates a named capturing group.(?P=<name>) Matches whatever matched previously named group(?(id)yes|no) Match 'yes' if group 'id' matched, else 'no‘
Complete Cheat Sheet with additional details: http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf
59
Regular Expression ParsingOnline real-time Regular Expression test: https://pythex.org/
60
Regular Expression ParsingTest Code:
<html> <head> <title>This is a very simple web page</title> </head> <body> <h1>Simple h1 header</h1> <table> <tr> <td>Row1</td> <td>Column2</td> <td>Column3</td> </tr> <tr> <td>Row2</td> <td>Column2</td> <td>Column3</td> </tr> </table> </body> </html>
Regular Expressions:<title>.*</title><title>(.*)</title>\w\d\w(\d)[\s<>]\w{2}[\s<>][\s<>](\w{4})[\s<>][\s<>](?P<first>\w{4})[\s<>](?P<second>\w{6})[\s<>][\s<](?P<first>\w+)[\s>](?P<middle>.+)[\s>]</(?P=first)><(?P<first>\w+)>(?P<middle>.+)</(?P=first)>
61
Scrapy Shell XPath and RE ExampleXpath: //title/text()RE: Extract the elements before a “:”
>>> sel.xpath('//title/text()').re('(\w+):')[u'Sharp', u'9780735668010', u'com']
>>> sel.xpath('//title/text()').re('(.+):')[u'Microsoft Visual C# 2012 Step by Step (Step By Step (Microsoft)): John Sharp: 9780735668010: Amazon.com']
>>> sel.xpath('//title/text()').re('(\S+):')[u'(Microsoft))', u'Sharp', u'9780735668010', u'Amazon.com']
62
Amazon CrawlModify amazon from scraping two specified web pages tocrawl a series of web pages showing the 100 top ranked database books from 1-20, 21-40, 41-60, 61- 80 and 81-100 and scrape the associated 100 books.
63
Amazon Crawlfrom scrapy.selector import Selectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom amazondb.items import AmazondbItem
class AmazondbSpider(CrawlSpider): name = "amazon2" allowed_domains = ['amazon.com'] start_urls = ['http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2']
rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))), Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item') )
def parse_item(self, response): aitem = AmazondbItem() webpage = Selector(response) aitem['Link'] = response.url aitem['Title'] = webpage.xpath('.//*[@id="productTitle"]/text()').extract() aitem['aTitle'] = webpage.xpath('.//*[@id="btAsinTitle"]/text()').extract() aitem['PageTitle'] = webpage.xpath('//title/text()').extract() webpagetable = webpage.xpath('//table[@id="productDetailsTable"]') aitem['Pages'] = webpagetable.re('(.*) pages') aitem['Publisher'] = webpagetable.re('<li><b>Publisher:</b> (.*)<\/li>') aitem['Language'] = webpagetable.re('<li><b>Language:</b> (.*)<\/li>') aitem['ASIN'] = webpagetable.re('<li><b>ASIN:</b> (.*)<\/li>') aitem['ISBN_10'] = webpagetable.re('<li><b>ISBN-10:</b> (.*)<\/li>') aitem['ISBN_13'] = webpagetable.re('<li><b>ISBN-13:</b> (.*)<\/li>') return aitem
64
Amazon Crawljweeks@server2:~/scrapy/amazondb/amazondb/spiders$ diff amazon1_spider.py amazon2_spider.py
< from scrapy.spider import BaseSpider> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor> from scrapy.contrib.spiders import CrawlSpider, Rule
< class AmazondbSpider(BaseSpider):> class AmazondbSpider(CrawlSpider):
< name = "amazon1"> name = "amazon2"
< start_urls = [< "http://www.amazon.com/Cookbook-Cookbooks-OReilly-Anthony-Molinaro/dp/0596009763/ref=zg_bs_549646_81", < "http://www.amazon.com/Microsoft-Visual-2012-Step-By/dp/0735668019/ref=zg_bs_549646_82" ]> start_urls = [“http://www.amazon.com/Best-Sellers-Books-Databases/zgbs/books/549646/ref=zg_bs_nav_b_2_5#2”]
> rules = (> Rule(SgmlLinkExtractor(restrict_xpaths=('//ol[@class="zg_pagination"]'))),> Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="zg_title"]')),callback='parse_item')> )
< def parse(self, response):> def parse_item(self, response):
65
Amazon Crawl Changes• Import additional libraries• BaseSpider -> CrawlSpider• Change the name of the spider• Change the start_urls• Add “rules” to either crawl or parse• Parse -> parse_item• Add additional fields, if desired• Update Items.py to include additional fields
66
Amazon Crawl Runscrapy crawl <spider_name> \ -t <output_type> \ -o <output_file_name>
scrapy crawl amazon2 -t json -o amazon2.json