Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | tamsyn-warren |
View: | 217 times |
Download: | 0 times |
Aurora
Group 19 :
Chu Xuân Tình
Trần Nhật Tuấn
Huỳnh Thái Tâm
Lec:
Associate Professor Dr.techn. Dang Tran Khanh
A new model and architecture for data stream management
Outline
2
The Aurora stream query algebra
Run–time Architecture
Introduction
Aurora-system architecture
Aurora: a new model and architecture for data stream management, a new system to manage data streams for monitoring applications.
The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires
Aurora - a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T.
3
Currently used DB systems
Classical DBMS: Passive repository storing data (HADP – human-active, DBMS-
passive model) Only current state of data is important Data synchronized; queries have exact answers (no support for
approximation) Monitoring applications are difficult to implement in traditional
DBMS First, the basic computation model is wrong: DBMSs have a
HADP model while monitoring applications often require a DAHP model.
Triggers and alerters are second-class citizens Problems with getting required data from historical time series Development of dedicated middleware is expensive
Conclusion: these systems are ill suited for applications used to alert human when abnormal situation occurs (expected DAHP model – DBMS-active, human-passive)
4
Aurora – main assumptions
Data comes from various, uniquely identified data sources (data streams)
Each incoming tuple is timestamped Aurora is expected to process incoming streams Tuples are transferred through loop-free, directed graph Outputs from the system are presented to applications Maintains historical storage
5
6
Aurora system overview
7
Any box can filter stream (select operation)
Box can compute stream aggregates applying aggregate function accross a window of values in the stream
Output of any box can be an input for several other boxes (split operation)
Each box can gather tuples from many inputs (union operation)
Aurora query model
8
b1
b7
b2
b6
b5b4
b3 Appl
Appl
Connection points
Storage S1 Storage S2
Storage S3
Continuous query
View
Ad-hoc query
„Keep 2 hr”
QoS spec
QoS spec
QoS spec
Each CP and view should have a persistence specification (e.g. „keep data for 2 hr”)
Each output is associated with QoS specification (helps to allocate the processing elements along the path)
Queries in the aurora
Continuous queries Query continuously processes tuples Output tuples are delivered to an application
Ad-hoc queries System will process data and deliver answer from the earliest time
stored in the connection point Semantic is the same as continuous query that started execution at
tnow – (persistence specification) Query continues until explicit termination
Views Similar to materialized or partially-materialized views in classical
DB systems Application may connect to the end of this path whenever there is a
need9
Queries in the aurora
Connection points Support for dynamic modification of network Support for data caching (persistence specification) – helpful
for ad-hoc queries Connection point without upload stream can be used as a
stored data set (like in classical DBMS) Tuples from connection point can be pushed through the
system (e.g when connection point is „materialized” and stored tuples are passed as a stream to the downstream nodes)
Alternatively, downstream node can pull the data (helpful in the execution of filtering or joining operations)
10
Application Domains Online Auctions Network Traffic Management Habitat Monitoring Military Logistics Immersive Environments Road Traffic Monitoring System Monitoring
11
SQuAl
The Aurora [S]tream [Qu]ery [Al]gebra 7 operators:
Order-agnostic (Filter, Map, Union) Order-sensitive (BSort, Aggregate, Join, Resample)
Model:
A stream is an append-only sequence of tuples with uniform type
A stream type has the form:(TS, A1,…, An)
Steam tuples have the form:(ts, v1,…, vn)
Ai: application-specific data fields
ts: timestamp
Order-agnostic operators
Input tuples have the form:
t = (TS = ts, A1 = v1,…, Ak = vk) 3 operators:
Filter:• similar to relational selection• filter on multiple predicates• route tuples according to which predicates they satisfy
Map:• similar to relational projection• apply arbitrary functions to tuples (including user-defined
functions)
Union:• merge 2 or more streams of common schema
Filter
Acts much like a case statement Can be used to route input tuples to alternative streams Form:
Filter(P1,…,Pm)(S)• Pi: predicates over tuples on the input stream S
Its output consists of m + 1 streams Output tuples have the same schema and values as input
tuples, including their QoS timestamp
Map
Is a generalized projection operator Form:
Map(B1 = F1,…, Bm = Fm)(S)• Bi: name of attribute
• Fi: function over tuple on the input stream S
Output tuple for each input tuple t has the form:
(TS = t.TS, B1 = F1(t),…, Bm = Fm(t)) Resulting stream can have a different schema than the input
stream, but the timestamps of input tuples are preserved in corresponding output tuples
Union
Is used to merge 2 or more streams into a single output stream Form:
Union(S1,…,Sn)• Si: stream, common schema
Union can output tuples in any order Output tuples have the same schema and values as input tuples
including their QoS timestamps
Order-sensitive operators
Require order specification arguments Order specification: describes the tuples arrival order they expect Order specifications have the form:
Order(On A, Slack n, GroupBy B1,…,Bm)• A, Bi: attribute• n: non-negative integer
4 operators:
Bsort:• is an approximate sort operator with semantics equivalent to a bounded pass
bubble sort
Aggregate:• applies a window function to sliding windows over its input stream
Join:• is a binary operator that resembles a band join• applied to infinite streams
Resample:• is an interpolation operator used to align streams
BSort
Is an approximate sort operator Form:
Bsort(Assuming O)(S)• O = Order(On A, Slack n, GroupBy B1,…,Bm) is a
specification of the assumed ordering over the output stream
Performs a buffer-based approximate sort
Equivalent to n passes of a bubble sort
BSort
Aggregate
Applies “window functions” to sliding windows over its input stream Form:
Aggregate(F, Assuming O, Size s, Advance i)(S)• F: “window function” (SQL-type aggregate operation, Postgres-style
user-defined function)• O = Order(On A, Slack n, GroupBy B1,…,Bm) is an order specification
over input stream S• s: size of the window (measured in terms of values of A)• i: integer, predicate that specifies how to advance the window when it
slides Output tuples have the form:
(TS = ts, A = a, B1 = u1,…, Bm = um) ++ (F(W))• W: “window” of tuples from the input stream with values of A between a
and a + s – 1• ts: the smallest timestamps associated with tuples in W• ++: denotes concatenation of 2 tuples
Aggregate
Aggregate
Slack = 1 or more Blocking: waiting for lost or late tuples to
arrive in order to finish window calculations
Optional Timeout argument:• Aggregate(F, Assuming O, Size
s, Advance i, Timeout t)
Join
Is a binary join operator Form:
Join(P, Size s, Left Assuming O1, Right Assuming O2)(S1, S2)• P: predicate over pairs of tuples from input streams S1 and S2
• s: integer• O1: order specification on some numeric or time-based attribute of
S1 (A)
• O2: order specification on some numeric or time-based attribute of S2 (B)
For every in-order tuple t in S1 and u in S2, the concatenation of t and u (t++u) is output if:
• |t.A – u.B| ≤ s• P holds of t and u
The QoS timestamp for the output tuple is the minimum timestamp of t and u
Join
Resample
Is an asymmetric, semijoin-like synchronization operator Can be used to align pairs of streams Form:
Resample(F, Size s, Left Assuming O1, Right Assuming O2)(S1, S2)• F: “window function” over S1
• s: integer• O1: order specification on some numeric or time-based attribute of
S1 (A)
• O2: order specification on some numeric or time-based attribute of S2 (B)
For every tuple t from S1, output tuple:
(B1 : u.B1,..., Bm : u.Bm, A : t.A) + +F(W(t))• W(t) = {u S∈ 2|u in order wrt O2 in S2 |t.A − u.B| ≤ s}∧
Resample
Run-time architecture
Router
Scheduler
Load Shedder
QoS Monitor
Storage manager
Box Processors
Q1
Q2
Qi
Qn
Qj
Buffer Manager
Persistent Storage
OutputsInputs
Quality of Server - QoS
QoS, in general, is a multidimensional function of several attributes of an Aurora system. Response times (production of output tuples) Tuple drops Values produced (importance of produced values)
Administrator specifies QoS graphs for output based on one or more of mentioned functions
Other types of QoS functions can be defined too
QoS graphs
Graphs are expected to be normalized Graphs should allow a properly sized network to operate with
all outputs in a ‘good zone’ Graphs should be convex (the value-based graph is an
exception)
1
0Delay
1
0% tuples delivered
1
0Output value
good zone
Aurora Storage Manager (ASM) – Queues management
There is one queue at the output of each box; this queue is shared by all successor boxes
Queues are stored in memory and on disksQueues may change length
b2 b1
timeQueue organization
Processed tuples
Scheduling in Aurora
Scheduler (and Aurora) aims to reduce overall tuple execution cost
Exploit of two nonlinearities in tuple processing Interbox nonlinearity:
• Minimaze tuple trashing (if buffer space is not sufficient tuples has to be shuttled between memory and disk)
• Avoiding to copy data from output to buffer (a possibility of bypassing ASM when one box is scheduled right after another)
Intrabox nonlinearity: • The cost of tuple processing may decrease as the number of
available tuples in the queue increases
Scheduling in Aurora
Aurora’s approach: (1) have box queues as many tuples as possible, (2) process it at once – train scheduling, and (3) pass them to subsequent boxes without going to disk – superbox scheduling
Two goals: (1) minimize number of I/O operations and (2) minimize number of box calls per tuple
Scheduler performanceT
ime
(ms)
0
50
100
150
200
250
300Execution costs
Scheduling overhead
Tuple at a time Trains Superboxes
Priorities assignment in Scheduler
The latency of each output tuple is the sum of the tuple’s processing delay and its waiting delay (is primarily the function of scheduling)
The goal of scheduler: to assign priorities to boxes outputs that maximize the overall QoS
The Scheduler’s approach is divided into two aspects: state-based analysis that assigns priorities to outputs
and picks for scheduling the output with the highest utility
feedback-based analysis that observes overall system and increases the priorities of outputs not doing well (base on QoS graph)
Load shedding
Reaction to overloadDrop is a system level operator that enables to
drop randomly tuples from stream at specified rate
1. Load shedding by dropping tuples2. Load shedding by filtering tuples
Load shedding
Load shedding by dropping tuples
Reduces the amount of Aurora processing by dropping randomly selected tuples at strategic points in the network
Load shedding
Load shedding by filtering tuples Idea: remove less important tuples rather
than randomly chosen It use value-based QoS information
Questions
1:Which of the following operators output tuples that have the same schema and values as input tuples?a. Aggregateb. b. BSort (x)c. Filter (x)d. Joine. e. Mapf. Resampleg. Union (x)
Questions
2. What does Aurora's primary run-time architecture include?a. Routerb. Storage manager (x)c. Scheduler (x)d. Box processor. e. QoS monitor (x)f. Resampleg. Load shedder (x)
Three broad application types
Aurora addresses three broad application types in a single, unique framework:1.Real-time monitoring applications continuously monitor the present state of the world and are, thus, interested in the most current data as it arrives from the environment. In these applications, there is little or no need (or time) to store such data.2.Archival applications are typically interested in the past. They are primarily concerned with processing large amounts of finite data stored in atime-series repository.3.Spanning applications involve both the present and past states of the world, requiring combining and comparing incoming live data and stored historical data. These applications are the most demanding as there is a need to balance real-time requirements with efficient processing of large amounts of disk-resident data.