Challenges in Analytics for BIG Data
Dr. Prasant Misra W: https://sites.google.com/site/prasantmisra
Disclaimer:
The opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of the organization that he works for.
A simple narrative to BIG DATA
8/26/2016 2
DATA whose characteristics exceeds the capabilities of conventional algorithms, systems and techniques to derive useful value is considered BIG
datascience.berkeley.edu
The term if very fuzzy and means different things to different groups of people ….
How did we arrive at this stage ?
8/26/2016 3
1960 - 70
1980 - 90
2000 -10 and beyond
Year
Size
8/26/2016 4
History of Computing
Accessibility to cyber end points have increased drastically …
8/26/2016 5
Device Proliferation
http://www.onethatmatters.com/wp-content/uploads/2015/12/Internet-of-Things-why.png
8/26/2016 6
DATA Proliferation
Web & Social Media
Enter-prises
Gov.
Example of a Killer APP: Navigation https://www.microsoft.com/en-us/research/event/tutorial-mobile-location-sensing
8/26/2016 7
8/26/2016 8
Local Search - I
Context Service Example
Current Location
Local business
8/26/2016 9
Local Search - II
Context Service Example
Current Location
Local business and directions
+ Time Tracks
Businesses in driving direction
8/26/2016 10
Local Search - III
Context Service Example
Current Location
Local business and directions
+ Time Tracks
Businesses in driving direction
+ History
Personalized directions
Take 520 East
8/26/2016 11
Local Search - IV
Context Service Example
Current Location
Local business and directions
+ Time Tracks
Businesses in driving direction
+ History
Personalized directions
+ Community
Tourist recommendation
35% people pick the scenic route
8/26/2016 12
Local Search - V
Alert: Bad Traffic
Consider Alternate
route
Context Service Example
Current Location
Local business and directions
Tracks Businesses in driving direction
+ History
Personalized directions
+ Community
Tourist recommendation
+ Push
alerts, triggers, reminders
BIG Data for Location Analytics …
8/26/2016 13
Analytics: Span across Verticals & Horizontals
Depending on the type and quality of analytics, system could manifest themselves into:
User-centric Systems — Systems That Know/Aware
Adaptive Systems — Systems That Learn
Cognitive Systems — Systems That Reason
E
N
E
R
G
Y
W
A
T
E
R
R
E
T
A
I
L
T
E
L
C
O
M
H
E
A
L
T
H
Time, Location Management
Sensor, Device Management
Network Management
Cloud Infra Management
Customer Management
The flavor of Data that is BIG
8/26/2016 14
The 4 Dimensions of BIG Data
8/26/2016 15
Analytics
8/26/2016 16
Value 8/26/2016 17
Hindsight and Insight/ Insights into the PAST
Foresight/ Insights into the FUTURE
Skill
Descriptive
“WHAT has happened ? ”
Diagnostic
“WHY did this happen ?”
Prescriptive
“WHAT should we do ?”
Predictive
“WHAT could happen ? ”
Information Optimization
Analytics : Category
DASHBOARD
FORECAST ACTIONS, RULES,
RECOMMs
Example: Energy Analytics for a PV Microgrid
8/26/2016 18
Descriptive: What is the total energy, instantaneous energy and power, etc., …?
Diagnostic: Why is the panel temperature decreasing when the solar irradiance is high and the wind speed is very low ?
Predictive: Can I forecast the plant output for tomorrow, or can I generate 4kWh net energy ?
Predictive : What actions should be undertaken for the plant to reach 4kW energy generation capacity from its current 2 kW ?
8/26/2016 19
Analytics : Methodology
Reason and Plan with Uncertain Knowledge
Quantify uncertainty & Probabilistic reasoning: Bayesian networks, Conditional distributions
Probabilistic reasoning over time:
Hidden Markov models, Kalman filters, Dynamic Bayesian networks
Simple decisions: Utility theory, Decision networks
Complex decisions: Partial observable Markov Decision Process (MDP), Game theoretic models
Planning graphs
Learning and Data Mining:
[Supervised | Semi-supervised | Unsupervised | Reinforcement] learning – Classification, Clustering
Different type of ANN | Deep Learning Networks | Support Vector Machines
Challenges
8/26/2016 20
Data to Knowledge Pipeline
8/26/2016 21
Cyber & Physical Space Entities
Edge
Global Infra
Data Ingestion
Data Analysis
Applications
Data source
“Big” data Infra
“Little” data Infra
Decision making with Knowledge
DATA @ REST (VOLUME) Archival/Static data (TBs) in Data stores
DATA @ MOTION (VELOCITY) Streaming data
DATA @ MANY FORMS (VARIETY) Structured/Unstructured, Text, Multimedia, Audio, Video
DATA @ DOUBT (VERACITY) Data with uncertainty that may be due to incompleteness, missing points, etc.,
NATURE of INGESTED DATA
COGNITIVE Learn dynamically ?
PRESCRIPTIVE What are the best outcomes ?
PREDICTIVE What could happen ?
DESCRIPTIVE What has happened ?
DISCOVERY What do we have ?
NATURE of ANALYSIS
A first list of challenges derived from the V’s
8/26/2016 22
Volume: How much data is really relevant to the problem solution & what is the cost of processing ?
Can you really afford to store and process all that data ?
Velocity A lot of data is coming in at high speed
Need for streaming versus block approach to data analysis
How to analyze data in-flight and combine with data at-rest
Variety:
A small fraction is in structured formats (e.g., relational, XML, etc.)
A fair amount is semi-structured (e.g., web logs, etc.)
The rest of the data is unstructured (e.g., text, photographs, etc.)
No single data model can currently handle the diversity
Veracity: Cover term for: Accuracy, Precision, Reliability, Integrity
What is it that you don’t know about the data ?
Top Challenges
8/26/2016 23
Data acquisition
Is raw data of interest in totality ?
Challenge:
design efficient filters and compression techniques in a manner that does not discard useful
information; automatically generate the right meta data to describe it
Data reduction
Will traditional data reduction approaches (via compression) become overwhelming ?
Challenge: introduction of new data collection practices and models as per analytical needs;
compact (space, time) representations/dictionary/basis; parsimonious model (low-
dimensionality, compressed sensing and sparse data capture models)
“Big-Little” Data
Device cloud vs. Conventional cloud; Distributed data and Peer-to-Peer Federation
Challenge: how to combine Big and Little data for meaningful analytics (often in real time)
Analytics from the Edge to the Cloud
Will the current model of pushing all data to a central cloud for analytics scale, be inefficient, and
alleviate privacy concerns ?
Challenge: how to automate distributed analytics and decision making on subsets of “Little” and
“Big” data; within the constraints of device capability, privacy needs, energy and network costs,
and application QoS
Top Challenges
8/26/2016 24
Handling inconsistent/incomplete/missing data and outliers
Is this critical ?
Challenge: design robust imputation algorithms
Heterogeneous Data Fusion
Is there a need to analyze the relationship between heterogeneous data objects/streams
Challenge: Extract right amount of semantics, sequential data fusion via transform spaces
Scalability with multi-level hierarchy
Will traditional methods of data navigational and search in deep hierarchy be scalable ?
Challenge: design newer alternatives
Data summarization for interactive Query
Will examination of datasets (all at once) become difficult ?
Data summarization let users request data with particular characteristics
Data summarization: organize data based on the presence/type of feature
Scientific data features: geometrical, topological, statistical
Non-scientific data features: related to semantic/syntactic components of the data
Challenge:
extraction of meaningful features, both from high and low dimension data
data storage and indexing in an I/O efficient format for rapid runtime retrieval
Top Challenges
8/26/2016 25
Analytics of temporally/spatially evolving features
Do data features occur at different spatial and temporal scales ?
Challenge: effective visual techniques that are computationally practical and that can take advantage of humans unique cognitive ability to track those feature changes
Representation of evidence and uncertainty
Interpretation of evidence is subject to person performing this task, and depends on his prior knowledge, subjective settings and viewpoint
Uncertainty quantification models the consequence based on the presented evidence and then predicts the qualities of the corresponding outcome
Challenge: how to represent evidence and uncertainty clearly and without bias through visualization
Sense making to users/decision makers
Involves examining all the assumptions made and retracing the analysis
There can be many sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, users will try to understand, and verify, the results produced by the computer.
Challenge: what should the man-machine interface for this look like ?
Platforms and Tools
8/26/2016 26
8/26/2016 27
8/26/2016 28
Scale and Size DOES matter !!!
8/26/2016 29
References
Stephen H. Kaisler et. Al ,“Big data and analytics: challenges and issues”
Pak Chung Wong, Han-Wei Shen, Chaomei Chen, “Top Ten Interaction Challenges in Extreme-Scale Visual Analytics”
http://link.springer.com/chapter/10.1007/978-1-4471-2804-5_12#page-1
Other info graphics from the web !!!