Post on 20-Jan-2016
transcript
COMP9321 Web Application Engineering
Semester 2, 2015
Dr. Amin BeheshtiService Oriented Computing Group,
CSE, UNSW Austral ia
Week 11( P a r t I I )
http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411http://www.cse.unsw.edu.au/~sbeheshti/
COMP9321, 15s2, Week 11 Tuesday, 13 October 2015
Big Data: Challenges and Opportunities
http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411http://www.cse.unsw.edu.au/~sbeheshti/
COMP9321, 15s2, Week 11
http://www.intelli3.com/
3
We are Generating Vast Amounts of Data !!
Healthcare
Remote patient monitoring
Manufacturing
Product sensors
Location-Based Services
Real time location data
Retail
Social media…
Digitalization of Artefacts
books, music, videos, etc.
COMP9321, 15s2, Week 11
4
We are Generating Vast Amounts of Data !!
Air Bus A380: generate 10 TB every 30 min
Twitter: Generate approximately 12 TB of data per day.
Facebook: Facebook data grows by over 500 TB daily.
New York Stock: Exchange 1TB of data everyday.
COMP9321, 15s2, Week 11
5
We are Generating Vast Amounts of Meta-data !!
Data
Versioning
Provenance
Security
Privacy
…
COMP9321, 15s2, Week 11
6
We are Generating Vast Amounts of Meta-data !!
Data
Versioning
Provenance
Security
Privacy
…
We are Tracing everything: Who did What? When? Where? …
e.g. Twitter handles ~1.6 billion search queries per day.COMP9321, 15s2, Week 11
7
We are Generating Vast Amounts of Meta-data !!
Data
Versioning
Provenance
Security
Privacy
…
COMP9321, 15s2, Week 11
Beheshti S.M.R. et al. "E
nabling the Analysis of Cross-
Cutting Aspects in Ad-hoc Processes", C
AiSE Conference
(2013)
8
Reading a book, e.g. Kindle tracks: what you are reading, when you are reading it, how often you read it, etc.
Listening to music, e.g. mp3 player tracks: what you are listening to, when and how often, in what order, etc.
Smart phones, e.g. iPhone tracks: our location, our speed, what apps we are using, who we are ringing, etc.
We are Generating Vast Amounts of Meta-data !!
COMP9321, 15s2, Week 11
9
Reading a book, e.g. Kindle tracks: what you are reading, when you are reading it, how often you read it, etc.
Listening to music, e.g. mp3 player tracks: what you are listening to, when and how often, in what order, etc.
Smart phones, e.g. iPhone tracks: our location, our speed, what apps we are using, who we are ringing, etc.
We are Generating Vast Amounts of Meta-data !!
COMP9321, 15s2, Week 11
Projects: Smart TV
10
Big Data and Big Meta-Data
share, comment, review,crowdsource, etc.
COMP9321, 15s2, Week 11
Big
11
So, What is Big Data?
Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second!
Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc.
COMP9321, 15s2, Week 11
12
So, What is Big Data?
Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second!
Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc.
COMP9321, 15s2, Week 11
Big Data !=Large Datasets
The big data problem can be seen as a massive
number of small data islands from personal, shared
and business data.
Linking and analyzing of this data is of high
interest.
13
So, What is Big Data?
Big data refers to our ability to collect and analyse the ever expanding amounts of data and meta-data that we are generating every second!
Challenges: Capture, Storage, Search, Sharing, Transfer, Analysis, Visualization, etc.
COMP9321, 15s2, Week 11
Big Data !=Large Datasets
The big data problem can be seen as a massive
number of small data islands from personal, shared
and business data.
Linking and analyzing of this data is of high
interest.
14
What Makes it Big Data?
Volume the vast amounts of data generated every second.
Velocity the speed at which new data is generated and moves around.
Variety the increasingly different types of data.
Veracity the quality of data, e.g. the messiness of the data. Needs detecting and correcting noisy and inconsistent data
Value Statistical, Events, Correlation, Hypothetical
COMP9321, 15s2, Week 11
15
Challenges: How to Store and Process?
Big data is high volume, high velocity, and/or high variety information assets.
Require new forms of storage and processing.
On-hand database management tools?
Traditional data processing applications?
COMP9321, 15s2, Week 11
16
Challenges: Big Data Storage
NoSQL databases:
Employs less constrained consistency models.Simple retrieval and appending operations.Significant performance benefits.
Examples:• Key–value Store• Document Store• Graph Database• …
COMP9321, 15s2, Week 11
17
Challenges: Big Data Storage(Graphs are Everywhere)
Use
r
Movie
Netflix
Collaborative Filtering
Doc
s
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis
COMP9321, 15s2, Week 11
18
Challenges: Big Data Storage(Graphs are Everywhere)
Use
r
Movie
Netflix
Collaborative Filtering
Doc
s
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis…Beheshti, et al. “Large Scale Graph Processing Systems:
Survey and An Experimental Evaluation”, Cluster Computing
Journal, 2015
…Beheshti, et al. “On Characterizing the Performance of
Distributed Graph Computation Platforms”. TPCTC Conference,
2014.
19
Challenges: Big Data Storage(Graphs are Everywhere)
Use
r
Movie
Netflix
Collaborative Filtering
Doc
s
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis…Beheshti, et al. “Large Scale Graph Processing Systems:
Survey and An Experimental Evaluation”, Cluster Computing
Journal, 2015
…Beheshti, et al. “On Characterizing the Performance of
Distributed Graph Computation Platforms”. TPCTC Conference,
2014.
20
Challenges: Big Data Storage(Graphs are Everywhere)
Use
r
Movie
Netflix
Collaborative Filtering
Doc
s
Words
Wiki
Text Analysis
Social Network
Probabilistic Analysis…Beheshti, et al. “Large Scale Graph Processing Systems:
Survey and An Experimental Evaluation”, Cluster Computing
Journal, 2015
…Beheshti, et al. “On Characterizing the Performance of
Distributed Graph Computation Platforms”. TPCTC Conference,
2014.
…,Beheshti S.M.R. et al. "DREAM: Distributed RDF Engine
with Adaptive Query Planner and Minimal Communication",
VLDB (2015)
21
Challenges: Big Data Processing
Apache Hadoop:Hadoop is an open source framework that uses a
simple programming model to enable distributed processing of large data sets on clusters of computers.
Who Use Hadoop?
AmazonFacebookGoogle IBMNew York TimesYahoo!…
Apache Hadoop solution:• Distributed File System (HDFS)• MapReduce• Pig• HCatalog
COMP9321, 15s2, Week 11
22
Challenges: Big Data Processing
Apache Spark:
EfficientIn-memory
storage
UsableRich APIs in
Java, Scala, Python
Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
2-5× less code
Up to 10× faster on disk,100× in memory
COMP9321, 15s2, Week 11
23
Challenges: Big Data Processing
Apache Spark:
EfficientIn-memory
storage
UsableRich APIs in
Java, Scala, Python
Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop
2-5× less code
Up to 10× faster on disk,100× in memoryDifference between Spark and MapReduce
• Spark stores data in-memory whereas H
adoop stores data on
disk.
• RDD, uses a clever way of guaranteeing fault to
lerance that
minimizes network I/O.
COMP9321, 15s2, Week 11 Resilient Distributed Dataset (RDD), Spark's data storage model
24
Challenges: Big Data Integration
PeopleWeb ServicesIT SystemsWorkflows
Example Scenario: Business Processes (BPs)
...
Various
Perspect
ives
and Goals
BPsExecution
LogQuery
and
Explore
COMP9321, 15s2, Week 11
25
Challenges: Big Data Integration
PeopleWeb ServicesIT SystemsWorkflows
Example Scenario: Business Processes (BPs)
...
Various
Perspect
ives
and Goals
BPsExecution
LogQuery
and
Explore
Beheshti S.M.R. et al. "A
query language for analyzing
business processes execution", BPM Conference (2011)
COMP9321, 15s2, Week 11
26
Challenges: Big Data Integration
Messy, schema-less and complex Big Data world.
Less than 10% of Big Data world are genuinely relational.
e.g. Linked Data
COMP9321, 15s2, Week 11
27
Challenges: Big Data Integration
Big Data-as-a-Service:Effective processing of big data within acceptable
processing time Easy access of the big data and the big data analysis
results
COMP9321, 15s2, Week 11
API Engineering• ProgrammableWeb - APIs, Mashups and the Web as Platform;
• www.programmableweb.com/
• DataSift….open data sources
Reminder28
COMP9321, 15s2, Week 8
Seminars: API Engineering and Micro-Services
Thursday, 15 October from 15:00-17:00;Where: UNSW, Mathews Theatre D.
Two interesting talks: • API Engineering (Scientia Prof. Boualem Benatallah).• Micro-services (Mr. Graham Lea).
29
Challenges: Big data requires a broad set of skills
COMP9321, 15s2, Week 11
Math and Operations Research Expertise
Develop analytic algorithms
VisualizationExpertise
Interpret data sets, determine correlations andpresent in meaningful ways
Tool Developers
Mask complexity and analytics to lower skills
boundaries
Industry VerticalDomain Expertise
Develop hypothesis, identifyrelevant business issues,
ask the right questions
Data Experts
Data architecture, management,
governance, policy
Decision MakingExecutive andManagement
Apply information to solvebusiness issues
Data Scientist
30
Challenges: Big Data Analytics
Analytics can be defined in many ways, but what matters is the purpose of analytics.
Most definitions agree on the following: Analytics is used to gain insights from data in order to make better decisions, using mathematical or scientific methods.
Analyse Decide
Data Insight Action
COMP9321, 15s2, Week 11
Manage the Data Understand the Data Act on the Data
31
Challenges: Big Data Analytics
Analytics can be defined in many ways, but what matters is the purpose of analytics.
Most definitions agree on the following: Analytics is used to gain insights from data in order to make better decisions, using mathematical or scientific methods.
Analyse Decide
Data Insight Action
COMP9321, 15s2, Week 11
Manage the Data Understand the Data Act on the Data
• Reporting is the most w
idely used analytic capability
• Gather data from multip
le sources and create standard
summarizations of the data
• Visualizations are created to bring the data to life and make it
easy to interpret.
32
Challenges: Big Data Analytics
COMP9321, 15s2, Week 11
33
Challenges: Big Data Analytics
COMP9321, 15s2, Week 11
Cognitive computing systems learn and interact naturally
with people to extend what either humans or machine could
do on their own.
self-learning systems that use :
• Data Mining,
• NLP
• Machine Learning
• Pattern Recognition
• Crowdsourcing
• …
e.g. IBM Watson Q&Ahttp://www.research.ibm.com/cognitive-computing/
34
Challenges: Big Data Analytics
Example:• Beheshti et al., “Scalable Graph-based OLAP Analytics over Process
Execution Data”, DAPD Journal (2015).• Beheshti et al., “A Framework and a Language for On-Line Analytical
Processing on Graphs”, WISE Conference (2012).
OLAP, is an approach to answering multi-dimensional analytical queries swiftly.
Problem: • extension of existing OLAP techniques to
analysis of graphs is not straightforward.• key business insights remain hidden in the
interactions among objects.
Solution:• On-Line Analytical Processing on Graphs
COMP9321, 15s2, Week 11
35
Challenges: Big Data Analytics
COMP9321, 15s2, Week 11
36
Challenges: Big Data Analytics
Big Data Analytics benefits from:• NLP• Machine Learning
• Pattern recognition, Learning, Extraction, Classification, Enrichment, Linking, etc.
COMP9321, 15s2, Week 11
Examples:
• Healthcare• Social Networks
• e.g. Twitter• Education• Finance• …
37
Challenges: Big Data Analytics
Big Data Analytics benefits from:• NLP• Machine Learning
• Pattern recognition, Learning, Extraction, Classification, Enrichment, Linking, etc.
Beheshti , et al., “Big data and cross-document coreference resolution: Current state and future opportunities”...
COMP9321, 15s2, Week 11
38
Big Data Leadership !!
Industry has been in the lead Google, Amazon, Yahoo!, etc.
University researchers have been left behind !! due to lack of access to large-scale cluster computing
facilities
Government agencies are making heavy investments Investments in big-data computing will have extraordinary
near-term and long-term benefits. Cloud computing must be considered a strategic resource
COMP9321, 15s2, Week 11
39
Big Data: Opportunities
COMP9321, 15s2, Week 11
• Varieties of Data• Text• Social Media• Networks• Multimedia• Machine Data• Sensors
• Analytics• Organizing Big Data• Navigating through
data• Summarizing Big Data• Process Data
Analytics• Support decision-
making
• Integration• Integrating enterprise and
public data• Linking data/context• Entity Extraction and Integration• Knowledge Graph
• Big Data Performance• In memory• New Benchmarks and
Architecture
• User Experience• automation and intelligent
guidance• Visualizing with Analytics• Interacting with Analytics• Storytelling
40
Big Data: Opportunities
COMP9321, 15s2, Week 11
• Varieties of Data• Text• Social Media• Networks• Multimedia• Machine Data• Sensors
• Analytics• Organizing Big Data• Navigating through
data• Summarizing Big Data• Process Analytics• Support decision-
making
• Integration• Integrating enterprise and
public data• Linking data/context• Entity Extraction and Integration• Knowledge Graph
• Big Data Performance• In memory• New Benchmarks and
Architecture
• User Experience• automation and intelligent
guidance• Visualizing with Analytics• Interacting with Analytics• Storytelling
Book:
Beheshti S.M.R., B
oualem Benatallah, et al. , “Process
Analytics: Concepts and techniques for querying and analysing
big process data”,
Springer, ISBN 978-3-319-25037-3 ,(2
015)
http://www.springer.com/us/book/9783319250366
41
Conclusion
Why Big Data is different from past Very Large Datasets? Meta-Data !!
Having the ability to analyse Big Data is of limited value if users cannot understand the analysis.
How can the industry and academia collaborate towards solving Big Data challenges!!
What is big today maybe not be big tomorrow!COMP9321, 15s2, Week 11
42
COMP9321, 15s2, Week 11
Thank you!