ENERGY & ENVIRONMENT • NATIONAL SECURITY • HEALTH • CYBERSECURITY
© SAIC. All rights reserved.
Big Data and Analytics Capabilities and Challenges Nancy Grady, SAIC Technical Fellow, Data Science September 18-19, 2012
SAIC.com © SAIC. All rights reserved.
How to Approach Big Data Opportunities
• Computing Context • Data Life Cycle • Big Data Engineering • Data Science
2
SAIC.com © SAIC. All rights reserved.
The Fifth Wave of Computing … meets …
• Servers and dumb terminals (1950s-1960s)
• Personal computers (1970s) • Internet (1990s) • Cloud (2000s)
– Infrastructure, applications, data, and analytics
• Mobile devices (2010s) – Carry them everywhere – Connectivity – Sensors – Location – Books – Oh, …and phone service
3
SAIC.com © SAIC. All rights reserved.
The Fourth Paradigm
1. Experiments 2. Theory 3. Simulation
– Big Iron 4. Data Analytics
– Big Data – Sensors
In The Fourth Paradigm: Data-Intensive Scientific Discovery, pioneering computer scientist Jim Gray refers to discovery based on data-intensive science: • Business Intelligence
– tell me what’s happening • Data Analytics
– tell me what’s working • Data Mining
– tell me what’s unusual • Modeling and Prediction
– tell me what’s about to happen 4
SAIC.com © SAIC. All rights reserved.
Data Scaling
We’re feeling the disruption of powers-of-ten scaling
• Compute power – growing according to Moore’s Law – Data volumes are growing faster
• Storage capacity – growing according to Moore’s Law – Seek times are growing more
slowly
5
Powers of Ten
SAIC.com © SAIC. All rights reserved.
Concepts
• Data Science – Data used as evidence through hypothesis and experiment – Opportunistic experiment versus designed experiment
• Data Engineering – Design and construction of software systems
• Data Product – Actionable result driven by data
• Big Data – Volume, Velocity, Variety – Complexity, Latency, Cleanliness, Completeness, Provenance
6
Data Life Cycle
SAIC.com © SAIC. All rights reserved.
Data Analysis Life Cycle (Data View)
8
Collect
Analyze
Need
Curate Act &
Monitor
Data
Information Knowledge
Benefit
Goal
Evaluate
SAIC.com © SAIC. All rights reserved.
MISSION
COLLECT CURATE
ANALYZE
VISUALIZE MONITOR
ACT
SAIC Data Science
Life Cycle
Data Analysis Life Cycle (Practitioner View)
9
SAIC.com © SAIC. All rights reserved.
Data Analysis Life Cycle (Practitioner View)
10
HYPOTHESIS
STORE
1
2
MISSION
COLLECT CURATE
ANALYZE
VISUALIZE MONITOR
ACT
SAIC Data Science
Life Cycle
SAIC.com © SAIC. All rights reserved.
Mission Drives Everything
DESCRIPTION
CONFIRMATION
PREDICTION
PRESCRIPTION
INSIGHT FORESIGHT HINDSIGHT
What has just happened?
What is happening now?
Why is this happening?
What am I missing?
What will happen next?
What will happen if I take this action?
What’s the best that can happen?
How do I make it happen?
DISCOVERY
OPTIMIZATION
Valu
e
Complexity
Reporting
Alerting
Correlation
Mining
Forecasting
Modeling
Maximization
Recommendation
11
Big Data Engineering
SAIC.com © SAIC. All rights reserved.
Big Data Characteristics
Engineering • Volume • Velocity • Variety
Science • Complexity • Latency • Cleanliness • Completeness • Provenance
13
SAIC.com © SAIC. All rights reserved.
Traditional Data Life Cycle
Domain
Cleanse Transform
ETL Action
Warehouse
Summarized Data
Algorithm
Analytic Mart
CAPTURE CURATE ANALYZE DEPLOY
Staging
14
SAIC.com © SAIC. All rights reserved.
Big Volume Engineering
Raw Data Cluster
Cleanse Transform Analyze
Shard
Data Product
Map
/Red
uce
Mart
Summarized Data
CAPTURE CURATE ANALYZE DEPLOY
Volume
Complexity Domain
15
SAIC.com © SAIC. All rights reserved.
Big Data Analytics
Public Cloud Infrastructure Private Cloud Infrastructure
Model Index
Scale2Insight
Models
Model Building …ƒ(n)…
NoSQL Store
Big
Dat
a An
alyt
ics
Mining
Spatial
Graph
Pattern
Learning
Software API
Search Tools Batch Stream
Classification
Prediction
Lucene
Anomaly Detection
Patterns Relationships
Identity
Tactical Views Alerts
Anomalies
Scalable
Model Analytics
A ⇒ B
model cache
16
SAIC.com © SAIC. All rights reserved.
Big Velocity Engineering
CAPTURE CURATE ANALYZE DEPLOY
Enriched Data Cluster
Velocity Volume
Complexity
Cleanse Transform
Shard
Alerting
Domain
17
SAIC.com © SAIC. All rights reserved.
Big Data Ingestion – Real Time Information Gateway
Source 1
Source 2
Source 3
Source X
Data Sources
Alerting Tools
Query Tools
Analysis Tools
Enrichment Sources
Alerting
Query Analysis
Data Storage
Fusion and Enrichment
…1001010101110101010110
+ Real Time Analysis
Public or Private Cloud Infrastructure
18
SAIC.com © SAIC. All rights reserved.
Big Data Adds to Existing Infrastructure
Enrichment Sources
Custom Enrichment
Enriched Data Storage
Ingestion, Fusion, Enrichment,
Alerting
Query, Modeling, Characterization,
Prediction
Custom Analytics
Models Analyst Tools
TO INFORMATION
Browsing Modules
TO KNOWLEDGE
TO INSIGHT
RTIG Big Data Ingestion
S2i Big Data Analytics
NoSQL Big Data Storage
FROM DATA
TO EXPLORATION
Query Analysis
Current Environment
Data
External Data
19
SAIC.com © SAIC. All rights reserved.
SAIC Big Data Platform
Public Cloud Infrastructure
Scalable, Multi-Tenant
Component Repository
xml
csv
others
custom
Sources Parse
(name/val pairs)
…010011…
Translate (to data model)
…010011…
+
Process (add
enrichment) …010011…
Private Cloud Infrastructure
Model Index
RTIG
Scale2Insight
Models
Model Building …ƒ(n)…
NoSQL Store
Big
Dat
a In
gest
ion
Big
Dat
a An
alyt
ics
Index
Alert Consumers
Search Tools e.g., Solr
Alert Engine
Lucene
Custom
Custom
Mining
Spatial
Graph
Pattern
Learning
Software API
Search Tools Batch Stream
Others
Classification
Prediction
Lucene
Anomaly Detection
Patterns Relationships
Identity
Tactical Views Alerts
Anomalies
Scalable
Model Analytics
A ⇒ B
model cache
20
Data Science
SAIC.com © SAIC. All rights reserved.
Data Analysis Life Cycle (Practitioner View)
22
HYPOTHESIS
STORE
1
2
MISSION
COLLECT CURATE
ANALYZE
VISUALIZE MONITOR
ACT
SAIC Data Science
Life Cycle
SAIC.com © SAIC. All rights reserved.
Big Data Analysis Life Cycle
HYPOTHESIZE
STORE
23
1
3
EXPLORE 2
MISSION
COLLECT CURATE
ANALYZE
VISUALIZE MONITOR
ACT
SAIC Data Science
Life Cycle
SAIC.com © SAIC. All rights reserved.
Data Science
DATA SCIENCE
STATISTICS DATA MINING
DOMAIN EXPERTISE
PROGRAMMING SKILLS
RESEARCH
ANALYTIC SYSTEMS ALGORITHMS
24
SAIC.com © SAIC. All rights reserved.
Data Science Emphasis
TRADITIONAL EMPHASIS
ALGORITHMS
SYSTEM
DATA
DATA ENGINEERING
SYSTEM
DATA
ALGORITHMS
DATA SCIENCE
DATA
ALGORITHMS
SYSTEM
25
SAIC.com © SAIC. All rights reserved.
Agile Analytics
26
DATA
SUMMARIZE
EXPLORE
DIS
PLAY
HYPOTHESIZE
FEAT
UR
ES
SAIC.com © SAIC. All rights reserved.
Summary
• The “Science” is what needs to be done • The “Engineering” is how to do it
• Big Data is often re-purposed from data collected for other reasons
• Engineering enables Big Data Analysis to move into a rapid Science hypothesis-testing cycle for greater value
27
SAIC.com © SAIC. All rights reserved.
Recommendations
• Focus on building your team to cover all the Data Science skills
• Determine your dominant engineering characteristics to design your approach – Volume, Velocity, Variety
• Be aware of the other characteristics of the data
• Plan to spend 80% of the time on the Curation
• Work to add in a Big Data capability to your existing infrastructure
28
Contact Information
Nancy Grady, Ph.D. SAIC Technical Fellow, Data Science Homeland and Civilian Solutions
[email protected] 865-604-6733