Post on 13-Dec-2015
transcript
Joseph M. Hellerstein
Peter J. Haas
Helen J. Wang
Presented by:
Calvin R Noronha (1000578539)
Deepak Anand (1000603813)
By:
AGENDAMotivationOnline Aggregation
Basic ApproachGoalsBuilding an Online Aggregation system
OptimizationRunning confidence intervalsConclusionFuture work
MotivationAggregation in traditional databases
Long delay in query execution and user is forced to wait without feedback till query completes execution.
Users want to see the aggregation information right away. Aggregation queries are typically used to get a ‘rough picture” but they are
computed with painstaking precision.
This paper suggests the following changes: Perform aggregation online so that:
Progress can be observed. execution of the queries can be controlled on the fly.
An ExampleConsider the following example:
SELECT AVG(final_grade)
FROM grades
WHERE course_name = ‘CS186’
If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result.
AVG--------------| 2.631046 |--------------
An alternative approach
Running aggregate An estimate of the
final result based on the records retrieved
so farRunning confidence interval
2.6336 +/- 0.0652539 with 95% probability
Progress Bar
Online Aggregation Interface with Groups If the records are retrieved in the random order, a good approximate result can be
obtained We can stop sampling once the length of the confidence interval becomes
sufficiently small.
Consider a GROUP BY query with 6 groups in the output
The user is presented with 6 outputs and 6 “Stops-sign” buttons
Stopping condition can be set on the fly
Easy to understand for non-statistical user
Stop Button
Usability goalsContinuous observation: Users can observe the processing in
the GUI and get a sense of the current level of precision.
Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision)
Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.
Performance goalsMinimum time to accuracy: Minimize time required to
produce a useful estimate of the final answer.
Minimum time to completion: Minimize time required to produce the final answer.
Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.
Building an Online Aggregation System
There are two approaches that can be taken:
1. A Naive approach:
• Trivial implementation without modification to POSTGRES.• User defined functions can be written in C.• Cannot be used with GROUP BY clause.
2. Modifying the DBMS: • Difficult to implement online aggregation as user level addition.• Modifying the database engine to support Online Aggregation.
SELECT running_avg(final_grade) running_confidence(final_grade)
running_interval(final_grade)FROM grades
Estimates of the running aggregates is accurate when records are retrieved randomly.
1. Heap Scans• Simple heap scans can be effective in traditional heap file access methods where
records are stored in unspecified order.• Need to choose different method for the aggregate attributes, which are correlated to
the logical order of formation of heap.
2. Index Scans• Can be used if aggregate attributes are not used for indexing.
3. Sampling from Indices• Techniques for pseudo random sampling from various index structures can be used.
[Olken’s work]
Random Access to Data
Non-blocking/Fair access GROUP BY and DISTINCT
Groups should receive updates in fair manner
Solution: Sorting ??
No, because sorting blocksMust use hash based techniques
Pros: Non-blocking Cons: Does not perform well as the number of groups
grow.Solution: Hybrid hashing.
Optimized version: Hybrid cache
For DISTINCT columns, a similar hashing technique can be used.
Index Striding
Updates for the groups with few members will be very infrequent.
For fair group byRead tuples in round robin fashion (a tuple from group 1, a tuple
from group 2, …)Supported by technique index striding
What is Index Striding ?
Additional advantagesGroup updating rate can be controlledParticular group processing can be stopped
POSTGRES with index striding
Speed control
Non-blocking Join Algorithms
For interactive display of online aggregation, avoid algorithms that block.
Sort-merge joinUnacceptable as sorting is blocking operation
Merge JoinOK but produces sorted output
Hybrid hash joinNot good if inner relation is large
Nested loops join is always good, In case of large un-indexed inner relation its too slow
An optimizer must be used to choose between these strategies.
Optimization
Avoid sorting unless explicitly requested by the user.
Blocking sub-operations have costs and appropriate costs should be considered.
Cost function = f(to) + g(td )
There are 2 components in cost function:dead time (td ): time spent doing “invisible” workoutput time (to ): time spent producing output
Preferences to the plans that maximize user control (index striding)
Extended aggregate functions
Standard set of aggregate functions must be extended
Aggregate functions must be written that provides running estimates
Running computation SUM, COUNT, AVG – straight forwardVAR, STD DEV – can be implemented using algorithms
Aggregate functions returning running confidence must be defined.
API
Current API uses built-in methods e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval)
slowDownGroup(cursor,groupval)
setSkipFactor(cursor_name,integer)
Skip Factor
Statistical IssuesRunning confidence interval
Given an estimate, probability p that we’re within of the right answer Mu
A large value of means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result.
Types of running confidence interval s:Conservative confidence interval
For n (no of tuples retrieved) >= 1 Answer guaranteed to be >= probability p [based on Hoeffding’s inequality]
Large-sample confidence intervalsDeterministic confidence intervals
Running confidence interval can be dynamically adjusted depending on the value of n.
Performance Issues
Conclusion
An interactive, intuitive and user-controllable approach to aggregation is needed.
This can be achieved by significant extensions to the database engine.
These extensions satisfy the usability and performance goals.
Ability to produce statistical confidence intervals for running aggregates.
Future work
Better UI
Nested Queries
Control without Indices
Checkpointing / Continuation
TIME TO ASK QUESTIONS