+ All Categories
Home > Technology > Getting started with Scrapy in Python

Getting started with Scrapy in Python

Date post: 13-May-2015
Category:
Upload: virendra-rajput
View: 1,244 times
Download: 0 times
Share this document with a friend
Popular Tags:
11
Web Scraping with Scrapy Virendra Rajput Hacker @Markitty
Transcript
Page 1: Getting started with Scrapy in Python

Web Scraping with ScrapyVirendra Rajput

Hacker @Markitty

Page 2: Getting started with Scrapy in Python

Agenda

● What is web scraping and why it's fun● My experiments with web scraping● Getting started with Scrapy● How Scrapy works and a quick Demo ● Why Scrapy● Questions

Page 3: Getting started with Scrapy in Python

What is Web Scraping?

● Extracting information from websites● Problem:

○ Static websites ○ No access to APIs to extract the data you

need○ Need to extract data periodically

● Manual solution - go to the website and copy the required data

● Smarter solution: Web Scraping

Page 4: Getting started with Scrapy in Python

My Experiments with Scraping

Page 5: Getting started with Scrapy in Python

Web Scraping in Python

● Download webpage with urllib2, requests

● Parse the page with BeautifulSoup/lxml

● Select with XPath or css selectors

Page 6: Getting started with Scrapy in Python

Scrapy - fast high Level Screen Scraping and web crawling Framework● Pick a website● Define the data you want to scrape● Write the spider to extract the data● Run the spider ● Store the Data

Page 7: Getting started with Scrapy in Python

Demo

Page 8: Getting started with Scrapy in Python
Page 9: Getting started with Scrapy in Python

Why Scrapy

● Simplicity● Fast● Productive/ Extensible● Portable● Well docs & Healthy community● Commercial Support

Page 10: Getting started with Scrapy in Python

Advanced Features (built in)

● Interactive shell for trying XPaths (useful for debugging)

● selecting and extracting data from html sources

● cleaning and sanitizing the scraped data● generating feed exports (JSON, CSV)● media pipeline for downloading stuff● Middlewares for (cookies, HTTP

compression, cache, user-agent spoofing, etc)

Page 11: Getting started with Scrapy in Python

questions?


Recommended