Post on 17-Aug-2015
transcript
– Mitchell Kapor, a variation of former MIT President Jerome Weisner’s quote
“Getting information off the Internet is like taking
a drink from a fire hydrant.”
Tons of information on the internetNews / Rappler, ABS-CBN News, GMA News Online SOCIAL MEDIA / Facebook, Twitter TRANSPORTATION / MMDA, WAZE, DOTC WEBSITE WEATHER / Project NOAH, PAGASA E-COMMERCE / LAZADA, ZALORA, EBAY, OLX Government DATA / PHILGEPS
Tons of information on the internetNews / What’s trending? What’s HAPPENING?
SOCIAL MEDIA / What are the people’s sentiment on subject x? TRANSPORTATION / What’s the traffic like later?
WEATHER / What’s the effect of weather on traffic?
E-COMMERCE / Who’s selling the cheapest item x? Government DATA / Where are our taxes going?
Web scraping๏ computer software technique of extracting
information from websites.
๏ focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.
https://en.wikipedia.org/wiki/Web_scraping
Conventional Way
๏ Fetch the webpage using urllib, httplib or requests.
๏Use beautifulsoup4, lxml or regular expressions to get extract information.
๏Analyze/store the information!
Conventional Way
๏ Blocking! We have to wait for each request to finish before we move on to the next.
๏ If it encounters an error somewhere, we’re doomed. Everything will just halt!
Conventional Way
๏We can use threading, gevent or other libraries to make it asynchronous.
๏We can wrap parts of the code in try-except blocks to catch possible Exceptions.
Why Scrapy?๏ Processes requests and responses asynchronously.
๏Customizable! You can override almost everything.
๏Handles cookies, delays, timeouts, etc so you won’t have to. No need to reinvent the wheel!
๏ Includes Selectors, a parsing library that can parse HTML and XML using XPATH or CSS; or you can just use Beautiful Soup!
History of Scrapy๏An open source framework to scrape websites
๏ Scrapy was started by Pablo Hoffman and Shane Evans (2007)
๏Originally a tool used by Shane’s company
๏ They saw the potential, and open sourced it.
Getting Started with scrapy
๏As easy as pip install Scrapy.
๏ Start a project with scrapy startproject project_name.
๏Creating your first spider!
At Scrapinghub๏Company that provides scraping-related services to
clients around the globe.
๏Distributed team of 105 people around the world.
๏Active in contributing to open-source!
๏ Project owner of Scrapy!
Academe/Research๏A U.S. Department of Energy National Laboratory
Operated by a university in California.
๏Analyzes relation between product price, energy efficiency and other product features of typical home appliances.
๏ Partnered with Scrapinghub for academic research!
Market Analytics๏A UK company that provides price, promotion and online
product positioning analytics.
๏Help consumers find the best prices!
๏Help online retailers compare their prices with other retailers.
๏Help brands check if retailers are providing accurate product information.
๏ Partnered with Scrapinghub for their scraping needs!
Government Research
๏ Scrapinghub is participating in DARPA’s Memex.
๏Crawls the deep web.
๏Aids in systematically tracking down criminal activity.
MRT Passenger Traffic
๏Crawls the MRT3 website using Scrapy.
๏Downloads the CCTV images for each station.
๏Approximate the relative passenger traffic for the certain moment using computer vision!
MiniBalita.com
๏A news reader for Philippine news.
๏Crawls Philippine news websites such as Rappler, ABS-CBN News, Inquirer, Spin.ph, etc.
๏ Integrated with TextTeaser to produce “mini” balita.
2013 General Elections๏Crawled the 2013 General Elections to find trends.
๏ 70 clustered precincts registered 100% turnout, most of them in ARMM.
๏One clustered precinct voted for only one senator. No one voted for anyone else despite the fact that a voter may choose up to 12 candidates!
Is SCRAPING legal?๏ Legalities about scraping is a gray area.
๏ Scraping public data is somewhat legal.
๏ Illegality may arise from how the data is used.
๏ Some websites explicitly prohibit scraping.
๏Always obey robots.txt.
End. Any Questions?Jolo BalbinTwitter: @mojojolohttp://www.summarizerman.com
Mikko GozaloTwitter: @mikkogozalohttp://www.mikkogozalo.com