Date post: | 30-Jul-2015 |
Category: |
Data & Analytics |
Upload: | alteryx |
View: | 68 times |
Download: | 1 times |
#inspire15
Building On-Demand Business Location DatasetsOr…How I Stopped Worrying about Bad Business Location Data and Learned to Love the Download Tool
Tuesday, May 19, 2014
John Hollingsworth, GIS Manager, Clear Channel Outdoor
#inspire15
Business Problem
Bad Data = Unhappy Clients
#inspire15
• We create maps and analyses that contain locations of our clients, their competitors, and other Points Of Interest.
• The data need to be current and accurate.
• The data are constantly changing and therefore require a real-time source.
• Existing solutions all have downsides.
Business Problem
#inspire15
Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)• Expensive
• Often outdated
• Often poor spatial accuracy
• Duplicates in some cases (Walmart has pharmacy, tire store, etc.)
Existing Solutions
#inspire15
• Not comprehensive
• On-demand requests cost money and time
• Periodically refreshed
Existing Solutions
Aggregators (AggData, Factual)
Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)
#inspire15
• Requires geocoding/data quality checks
• Requires continual requests to ensure current data
• Not available in most cases
Existing Solutions
Aggregators (AggData, Factual)
Data from client (spreadsheet of addresses)
Comprehensive Business Dataset (Dun & Bradstreet, DatabaseUSA)
#inspire15
Alteryx-based Solution
Use the Alteryx Download tool to ‘scrape’ data from awebsite’s location tool.
#inspire15
Quick Demonstration
#inspire15
Yikes!!!
Is this legal?
Cuz it doesn’t feel legal.
#inspire15
• US Supreme Court has ruled that “an author who claims infringement must prove "the existence of ... intellectual production, of thought, and conception.“ and also in reference to phone number listings, “these bits of information are uncopyrightable facts”– Feist v. Rural 1991
• Terms of Service agreements on websites do not protect factual information.
• A company could theoretically bring a case for damages if the download process is so intense as to cause a disruption of service for their servers. You may need to throttle your collection to prevent this type of intrusive attack.
• All that said, caveat metentis. Meaning consult your in-house legal staff for additional clarification.
Yes. This Is Legal.
#inspire15
• Analyze Web Page and Location App Web Traffic
• Determine Collection Method to Use Based on Website Architecture
• Configure Download Tool
• Parse Results
• Error Correct
• Troubleshoot
Overview Of How To Do This
#inspire15
Analyze Web Traffic
#inspire15
• Use Web Traffic Debugging software such as Fiddler• http://www.telerik.com/download/fiddler• Set output to Raw in both windows
• Turn on cookies
• Determine if you must use iterative tool or not – sometimes all of the locations are listed on one page.
• Be rigorous – often there is an obvious, hard way and also a subtle, easy way.
Analyze Web Traffic: Best Practices
#inspire15
• Experiment using trial and error by copying data from Inspectors window and running it in the Composer window.
Analyze Web Traffic: Best Practices
#inspire15
Single Request
• Single request returns all addresses and latitude/longitude data
• JSON, XML, main web page
• Hint: Look for single Google Map with all points
Collection Methods
#inspire15
• List of store URLs on main page->pull each page
• List of states->List of stores->pull file or each page
• List of states->List of cities->List of stores->pull file or each page
Collection Methods
Multi-Step
Single Request
#inspire15
• e.g. http://www.store.com/3829
• Iterate through a set number of integers for store IDs
• Can be tricky because sometimes huge gaps in IDs
Collection Methods
Multi-Step
Single Request
Sequential IDs
#inspire15
• Use zip codes for search criteria instead of city/state
• Grid Centroids based on search radius
• Grid MBR values based on search radius
• Tip: Experiment with enlarging search radius. If no limit, then you can get all in one request.
Collection Methods
Multi-Step
Single Request
Sequential IDs
Spatial
#inspire15
Common Spatial Searches
Grid centroids as Lat/Long input values with 100 mile radius
#inspire15
Common Spatial Searches
Zip codes nearest to grid centroids as input values with 100 mile radius
#inspire15
Configure Download Tool
#inspire15
• Determine GET or POST method
• Watch out for Encode URL Text
• Copy Headers
• Experiment using Fiddler Composer to see which Headers are necessary
• Try without cookie Header as those can expire and break your workflow.
Configure Download Tool
#inspire15
Parse Results
#inspire15
• Sample: If you are iterating, just a few iterations to test parse logic.
• Look for meta property if on a store’s page
• Add RecordID if iterating as the JSON will restart numbering
• Use the JSON/XML parsing tools in Alteryx
Parse Results: Best Practices
#inspire15
• Use Multi-Row Formula tool to parse HTML
Parse Results: Best Practices
#inspire15
Error Correct
#inspire15
• Deduplicate when radius collection method used – Use Unique Tool
• Bad geocodes: you are at the mercy of the geocoder that created the data
• Verify counts using Wikipedia or company's annual report
Error Correct
#inspire15
Troubleshoot
#inspire15
• Lat/Lon values in Google geocode string that are not real• sll=latitude,longitude is where the search originated, not the
actual point
• IP timeouts – may need to throttle to solve
• Parse cues not in all pages or extra lines cause skips - e.g. address data includes shopping center name, etc.
• Multiple pages in search results
• Some sites include closed stores
Troubleshoot
#inspire15
Q & A
#inspire15
Free Stuff!!
Go to
http://tinyurl.com/WebScrapingToolsto download zip file containing useful macros and sample workflow.
THANK YOU!
#inspire15