Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | scaleout-software |
View: | 2,002 times |
Download: | 0 times |
Copyright © 2011 by ScaleOut Software, Inc.
WSTA SeminarSeptember 14, 2011
Bill Bain ([email protected])
Using Distributed, In-Memory
Computing for Fast Data Analysis
WSTA Seminar2
Agenda
• The Need for Memory-Based, Distributed
Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
Map/Reduce
WSTA Seminar3
The Need for Memory-Based Storage
W eb Server W eb Server W eb Server W eb Server W eb Server W eb Server
Ethernet
Internet
Database
Server
Raid D isk
ArrayDatabase
Server
Ethernet
App. Server App. Server App. Server App. Server
Ethernet
POW ER FAU LT DATA ALARM Load-balancer
Example: Web server farm:
• Load-balancer directs
incoming client requests
to Web servers.
• Web and app. server
farms build Web pages
and run business logic.
• Database server holds all
mission-critical, LOB data.
• Server farms share fast-
changing data using a
DDG to avoid bottlenecks
and maximize scalability.
Bottleneck
Distributed, In-Memory Data Grid
Distributed, In-Memory Data Grid
WSTA Seminar4
The Need for Memory-Based Storage
App VS
Cloud Application
App VS App VS
App VS
App VS
Cloud-Based Storage
Grid VS
Grid VS
Grid VS
Distributed Data Grid
Example: Cloud Application:
• Application runs as multiple,
virtual servers (VS).
• Application instances store and
retrieve LOB data from cloud-
based file system or database.
• Applications need fast, scalable
storage for fast-changing data.
• Distributed data grid runs as
multiple, virtual servers to
provide “elastic,” in-memory
storage.
WSTA Seminar5
What is a Distributed Data Grid?
• A new “vertical” storage tier:
– Adds missing layer to boost
performance.
– Uses in-memory, out-of-process
storage.
– Avoids repeated trips to backing
storage.
Distributed
Cache
“Out-of-
Process”
Distributed
Cache
“Out-of-
Process”
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Processor
Cache
Application
Memory
“In-Process”
L2 Cache
Backing
Storage
• A new “horizontal” storage tier:
– Allows data sharing among servers.
– Scales performance & capacity.
– Adds high availability.
– Can be used independently of
backing storage.
WSTA Seminar6
Distributed Data Grids: A Closer Look
• Incorporates a client-side, in-
process cache (“near cache”):
– Transparent to the application
– Holds recently accessed data.
• Boosts performance:
– Eliminates repeated network data
transfers & deserialization.
– Reduces access times to near “in-
process” latency.
– Is automatically updated if the
distributed grid changes.
– Supports various coherency models
(coherent, polled, event-driven)
Application
Memory
“In-Process”
Client-side
Cache
“In-Process”
Distributed
Cache
“Out-of-
Process”
WSTA Seminar7
Performance Benefit of Client-side Cache
• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.
0
500
1000
1500
2000
2500
3000
3500
DDG DBMS
Mic
roseco
nd
s
Average Response Time10KB Objects
20:1 Read/Update
WSTA Seminar8
Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
response times low
3. High availability to prevent data loss if a grid server (or network
link) fails
4. Shared access to data across
the server farm
5. Advanced capabilities
for quickly and easily mining
data using scalable,
“map/reduce,” analysis. Ac
ce
ss
La
ten
cy (
mse
c)
Throughput (accesses / sec)
Grid DBMS
Access Latency vs. Throughput
WSTA Seminar9
Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
scaling:– Avoid centralized scheduling to eliminate hot spots.
– Use data partitioning and maintain load-balance to allow scaling.
– Use fixed vs. full replication
to avoid n-fold overhead.
– Use low overhead
heart-beating.
• Example of linear
throughput scaling:
Read/Write Throughput
10KB Objects
0
20,000
40,000
60,000
80,000
4 16 28 40 52 64
Accesses / S
eco
nd
Nodes
16,000 ------------------------------------------- 256,000 #Objects
WSTA Seminar10
Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
Ethernet
Cache
Service
Cache
Service
Cache
Service
Cache
Service
Object
Client
Application
Client
Library
Distributed Cache
Retrieve
Cached
Copy
ReplicaCopy
WSTA Seminar11
Wide Range of Applications
Financial Services
• Portfolio risk analysis
• VaR calculations
• Monte Carlo simulations
• Algorithmic trading
• Market message caching
• Derivatives trading
• Pricing calculations
Other Applications
• Edge servers: chat, email
• Online gaming servers
• Scientific computations
• Command and control
E-commerce
• Session-state storage
• Application state storage
• Online banking
• Loan applications
• Wealth management
• Online learning
• Hotel reservations
• News story caching
• Shopping carts
• Social networking
• Service call tracking
• Online surveys
WSTA Seminar12
Importance for Cloud Computing
• Cloud computing:
– Make elastic resources readily available, but…
– Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
– Allow data sharing across a group of virtual servers.
– Elastically scale throughput as needed.
– Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
analysis.
• DDGs provides the efficiency and scalability needed to
overcome the cloud’s limited interconnect speed.
WSTA Seminar13
DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-premise and cloud-based data grids to unify access.
• This enables seamless access to data across multiple sites.
Automatically Migrate Data
Cloud of Virtual Servers User’s On-Premise Application
SOSS VS
SOSS VS
SOSS VS
Cloud-Based Distributed Cache
App VS
Cloud Application
App VS App VS
App VS
App VS
SOSS Host
SOSS Host
On-Premise Cache
Server App
On-Premise Application 2
Cloud of Virtual Servers
User’s On-Premise Application
Server App
AutomaticallyMigrate Data
Backing
Store
Cloud hostedDistributed Data Grid
On-PremiseDistributed Data Grid
Cloud Application
On-Premise Application 2
App VS
App VS App VS
App VS
App VS
Server App Server App
SOSS Host SOSS HostSOSS VS
SOSS VS
SOSS VS
WSTA Seminar14
DDGs Enable Seamless Global Access
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Global Distributed Data Grid
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid
SOSS SVR
SOSS SVR
SOSS SVR
Mirrored Data Centers
Satellite Data Centers
WSTA Seminar15
• The goal:
– Quickly analyze a large set of data for patterns and trends.
– How? Run a method E (“eval”) across a set of objects D in parallel.
– Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis:
– '80s: “SIMD/SPMD” (Flynn, Hillis)
– '90s: “Domain decomposition” (Intel, IBM)
– '00s: “Map/reduce” (Google, Hadoop, Dryad)
• Applications:
– Search, financial services,
business intelligence, simulation
Introducing Parallel Data Analysis
E M
DD DD
DD DD
DD DD
DD DD
Result
WSTA Seminar16
Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
WSTA Seminar17
Stage the Data for Analysis
• Step 1: Populate the distributed data grid with objects each of which
represents a price history for a ticker symbol:
WSTA Seminar18
Code the Eval and Merge Methods
• Step 2: Write a method to evaluate a stock history based on parameters:
• Step 3: Write a method to merge the results of two evaluations:
• Notes:
– This code can be run a sequential calculation on in-memory data.
– No explicit accesses to the distributed data grid are used.
Results EvalStockHistory(StockHistory history, Parameters params)
{
<analyze trading strategy for this stock history>
return results;
}
Results MergeResuts(Results results1, Results results2)
{
<merge both results>
return results;
}
WSTA Seminar19
Run the Analysis
• Step 4: Invoke parallel evaluation and merging of results:
Results Invoke(EvalStockHistory, MergeResults, querySpec,
params);
EvalStockHistory()
MergeResults()
WSTA Seminar20
stock
history
stock
history
stock
history
stock
history
stock
history
stock
history
.eval()
results results results results results results
.merge() .merge() .merge()
results results results
.merge()
resultsresults returned
to client
Start parallel
analysis
WSTA Seminar21
DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
• Memory-based DDG analyzes data in place:
D D D D D D D D D
D D D D D D D D D
Grid ServerGrid ServerGrid Server
E E E
M/R Server
E
M/R Server
E
M/R Server
E
File System /
Database
Server
Memory
Distributed
Data Grid
WSTA Seminar22
stock
history
stock
history
stock
history
stock
history
stock
history
stock
history
.eval()
results results results results results results
.merge() .merge() .merge()
results results results
.merge()
resultsresults returned
to client
Start parallel
analysis
File I/O
File I/O
File I/O
WSTA Seminar23
Performance Impact of Data Motion
Measured random access to DDG data to simulate file I/O:
WSTA Seminar24
Comparison of DDGs and File-Based M/R
DDG File-Based M/R
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic
scalability
Yes Application
dependent
Best use Quick-turn analysis of
memory-based data
Complex analysis of
large datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based
WSTA Seminar25
Walk-Away Points
• Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
– Fast access time & scalable throughput
– Highly available data storage
– Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
– Support scalable data access for “elastic” applications.
– Efficiently and easily migrate data across sites.
– Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
– Make it easy to develop applications and configure clusters.
– Avoid file I/O overhead for datasets that fit in memory-based grids.
– Deliver automatic, highly scalable performance.
Distributed Data Grids forServer Farms & High Performance Computing
www.scaleoutsoftware.com