This high level presentation talks about some of the most common ways to collect data in the context of the standard analytical process. This includes scraping, using APIs and how the frequency of update can influence which method you choose.
27
1 But how do I GET the data? Transparency Camp 2014
Transcript
1. 1 But how do I GET the data? Transparency Camp 2014
2. Shooju is a Web-Based Data Platform 2 Consolidate your
internal and external data sources Make all data searchable from
one place Provide continuous updating Seamlessly integrate with
tools and applications Share data across your entire organization
Save time and energy while reducing errors and problems with
version control Shooju saves time, improves data quality and
enhances data sharing across your entire organization
3. The Analytical Process 3 Data Data Data Data Data Data Data
Data Data Data Data Data
4. The Analytical Process 4 Data Data Data Data Data Data Data
Data Data Data Data Data some place
5. The Analytical Process 5 Data Data Data Data Data Data Data
Data Data Data Data Data some place your tool of choice
6. The Analytical Process 6 Data Data Data Data Data Data Data
Data Data Data Data Data some place your tool of choice your
product
7. The Analytical Process 7 Data Data Data Data Data Data Data
Data Data Data Data Data some place your tool of choice your
product The Fun Part
8. The Analytical Process 8 Data Data Data Data Data Data Data
Data Data Data Data Data some place your tool of choice your
product The Not Fun Part
9. Big data vs. small data 9
10. A boring 2 x 2 10
11. The harsh 80/20 reality 11 Most organizations spend more
time collecting, cleaning, downloading, managing and wrangling data
than they do conducting analysis
12. Three ways to get data API Good Bad Scraping Manual 12
Defined as ETL (Extract, Transform, Load) process
13. Method comparison 13 TechnicalExpertiserequired Time (and
annoyance) Manual Scraping API
14. 14 Average cost curve of data collection Manual Collection
AverageCost Number of times data is collected
15. 15 Average cost curve of data collection Manual Collection
AverageCost Number of times data is collected Scraping
16. 16 Average cost curve of data collection Manual Collection
AverageCost Number of times data is collected Scraping API
17. How do you get your data? What do you like? What dont you
like? 17
18. Once the data is scraped, where can it go? CSV XLS DBF SQL
NoSQL Many others 18
19. Where does your data go when you collect it? 19
20. 1 Appendix
21. Shooju Value Added Cost Savings By saving analyst time and
energy, Shooju allows analysts to do more with less, reducing data
management costs and putting more focus on high-value analysis.
Added Quality Automating data processes internally will ensure that
your data is accurate, up-to-date and consistent across your entire
organization. Enhanced Decision Making Having more accurate data
available faster with more analyst time left for analysis leads to
enhanced decision making. 21 Cost Savings Added Quality Enhanced
Decision Making Shooju Value Added
22. 22 Shooju Sources Excel Add-In & Other Tools Custom BI
Apps Web Search Auto- Import Drivers # of analysts retrieving time
saved in retrieval # of sources frequency of retrieval # of
analysts refreshing time saved in tool refresh # of sources
frequency of refresh time to integrate data analysts contributing
data # of tools created analyst upload time # of analysts searching
time saved in search # of sources frequency of search 5 analysts 65
min / source 22 sources 18 times / year 11 analysts 74 min / source
22 sources 14 times / year 9 min / source 22 sources 32 times /
year $97k (14%) $73k (10%) $248k (35%) $702kTotal: Cost Savings 13
analysts 14 wk of dev. saved 8 analysts contributing 2 apps created
$284k (41%) 40 min 10 times / year Sample Cost Savings Cost Savings
Added Quality Enhanced Decision MakingShooju Value Added * Based on
real 40-person organization. Assumed annual wages vary between $30k
and $140k. $410k savings equivalent to 10% of HR spend* Shooju
speeds up custom BI application development by making all data
natively accessible and continuously updated in any BI tool or
custom app. USD (%)
23. Added Quality: The Three Cs 23 Cost Savings Added
QualityShooju Value Added Consistency Shooju ensures that all
analysts are using the same data across all their tools and
applications. By allowing analysts to upload their own data to the
platform, internal data as well as external data now flows
seamlessly - without messy spreadsheet links. Currency By
automatically pulling in the latest source data through the Shooju
importer layer, Shooju ensures that all of your spreadsheets and
models are populated with the latest data. Our native plugins for
Excel, Access and all your other tools allow data to flow through
directly without any need for the analyst to download or copy and
paste. Correctness The more data is touched by human hands, the
more prone it is to errors. By streamlining workflows and
automating work processes, Shooju eliminates most of these errors,
saving time and ensuring that the data you rely on is more
accurate. Enhanced Decision Making
24. We support any data source 24 Ask us about non-mainstream
data sources that traditional data providers dont carry.
25. Shooju Data Process 25
26. Shooju vs. Custom Data Warehouse Custom Data Warehouse
Shooju Design Custom Plug-and-play Cost 7+ digits 5-6 digits
Rollout timeline Months / Years Hours Scalability Minimal Infinite
Flexibility Low High Maintenance High Low Stakeholders IT
controlled Analyst run / IT maintained Tool and app support Clunky,
requiring IT Native tool support 26 Data warehouse projects are
costly, time consuming and result in inflexible systems with low
adoption rates
27. Shooju vs. Off-the-shelf Data Management* Off-the-shelf
Data Management* Shooju Service focus Data provision/management
Process improvement Prepackaged data feeds Many None Custom data
feeds None (not natively supported) Included(all feeds are custom)
Internal data integration Weeks (high consulting fees) Days
(included in service) Process flexibility Low High Analyst learning
curve Weeks Hours Ease of migrating off Very difficult/impossible
Easy Annual fee 6-7 digits 5-6 digits 27 Data management* solutions
focus on generic data provision rather than process improvement and
limit analysts to a closed and inflexible data ecosystem. *
Top-ranked providers in the EnergyRisk Data Management category
include: Morningstar, ZE Power Group, SunGard, Allegro, Pioneer
Solutions, SAS, and InteractiveData. See
http://www.slideshare.net/Allegrodev/energy-risk-magazines-etrm-software-rankings-2013