BUILDING A DATABASE SYSTEM FOR ORDER
New England Database Seminars April 2002
Alberto Lerner – ENST ParisDennis Shasha – NYU
{lerner,shasha}@cs.nyu.edu
NEDS April 2002 – Lerner and Shasha
Agenda
Motivation
SQL + Order Transformations Conclusion
NEDS April 2002 – Lerner and Shasha
MotivationThe need for ordered data
Some queries rely on order
Examples:
Moving averages
Top N
Rank
“SQL can handle it.” Can it really?
NEDS April 2002 – Lerner and Shasha
MotivationMoving Averages: algorithmically linear
Sales(month, total)
SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverageFROM Sales AS t1, Sales AS t2, Sales AS t3WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2
Can optimizer make a 3-way (in general, n-way) join linear time?
Ref: Data Mining and Statistical Analysis Using SQLTrueblood and LovettApress, 2001
NEDS April 2002 – Lerner and Shasha
MotivationTop N
Employee(Id, salary)
SELECT DISTINCT count(*), t1.salaryFROM Employee AS t1, Employee AS t2WHERE t1.salary <= t2.salaryGROUP BY t1.salaryHAVING count(*) <= N
How many elements of cross-product have salaries at least as large as t1.salary? Will optimizer see essential sort-count trick?
Ref: SQL for SmartiesJoe CelkoMorgan Kauffman, 1995
NEDS April 2002 – Lerner and Shasha
MotivationProblems Extending SQL with Order
Queries are hard to read
Cost of execution is often non-
linear (would not pass basic
algorithms course)
Few operators preserve order, so
optimization hard.
NEDS April 2002 – Lerner and Shasha
Agenda
Motivation
SQL + Order Transformations Conclusion
NEDS April 2002 – Lerner and Shasha
SQL + OrderDesirable Features
Express order-dependent predicates
and clauses in a readable, clear way
Make optimization opportunities
explicit (by getting rid of complex
idioms, see above)
Execution in linear (or n log n) time
when possible
NEDS April 2002 – Lerner and Shasha
SQL + Orderthree steps in solution
1. Give SQL a vector-oriented semantics – Database is a set of array-tables “arrables”; variables in the queries do not refer to a single tuple at a time anymore, but to a whole column vector
2. Provide new vector-to-vector functions – Supporting order-based manipulations of column vectors
3. Streaming: new data may need special treatment.
NEDS April 2002 – Lerner and Shasha
SQL + OrderMoving Averages
Sales(month, total)
SELECT month, avgs(8, total)FROM Sales ASSUMING ORDER month
Execution (Sales is an arrable):1. FROM clause – enforces the order in
ASSUMING clause2. SELECT clause – for each month yields
the moving average (window size 8) ending at that month. No 8-way join.
avgs: vector-to-vector function, order-
dependant and size-preserving
order to be used on vector-to-vector
functions
NEDS April 2002 – Lerner and Shasha
SQL + OrderTop N
Employee(ID, salary)
SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary
first: vector-to-vector function, order-
dependant and non size-preserving
Execution:1. FROM clause – orders arrable by Salary2. SELECT clause – applies first() to the
‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).
NEDS April 2002 – Lerner and Shasha
SQL + OrderRanking
SalesReport(salesPerson, territory, total)
SELECT territory, salesPerson, total, rank(total)FROM SalesReport WHERE rank(total) < N rank: vector-to-vector
function, non order-dependant and size-
preservingExecution:1. FROM clause – assuming is NOT needed.2. rank is applied to the ‘total’ vector and
maps each position into an integer.
NEDS April 2002 – Lerner and Shasha
SQL + OrderVector-to-Vector Functions
prev, next, $, []
avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse,
…
drop, first, lastorder-dependant
non order-dependant
size-preserving
non size-preserving
rank, tile min, max, avg, count
NEDS April 2002 – Lerner and Shasha
SQL + OrderComplex queries: Best spread
In a given day, what would be the maximum difference between a buying and selling point of each security?
Ticks(ID, price, tradeDate, timestamp, …)
SELECT ID, max(price – mins(price))FROM Ticks ASSUMING ORDER timestampWHERE tradeDate = ‘99/99/99’GROUP BY IDExecution:1. For each security, compute the running
minimum vector for price and then subtract from the price vector itself; result is a vector of spreads.
2. Note that max – min would overstate spread.
max
min
bestspread
running min
NEDS April 2002 – Lerner and Shasha
SQL + OrderComplex queries: Crossing averages part I
When does the 21-day average cross the 5-month average?
Market(ID, closePrice, tradeDate, …)TradedStocks(ID, Exchange,…)
INSERT INTO temp FROMSELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDateGROUP BY ID
NEDS April 2002 – Lerner and Shasha
SQL + OrderComplex queries: Crossing averages part I
Execution:1. FROM clause – order-preserving join2. GROUP BY clause – groups are defined
based on the value of the Id column3. SELECT clause – functions are applied;
non-grouped columns become vector fields so that target cardinality is met. Violates first normal form
groups in ID and non-grouped column
grouped ID and non-grouped column
Vector
field
two columns withthe same cardinality
NEDS April 2002 – Lerner and Shasha
SQL + OrderComplex queries: Crossing averages part II
Get the result from the resulting non first normal form relation temp
SELECT ID, tradeDateFROM flatten(temp)WHERE a21 > a5 AND pa21 <= pa5
Execution:1. FROM clause – flatten transforms temp
into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query.
2. Standard query processing after that.
NEDS April 2002 – Lerner and Shasha
SQL + OrderRelated Work: Research
SEQUIN – Seshadri et al. Sequences are first-class objects Difficult to mix tables and sequences.
SRQL – Ramakrishnan et al. Elegant algebra and language No work on transformations.
SQL-TS – Sadri et al. Language for finding patterns in
sequence But: Not everything is a pattern!
NEDS April 2002 – Lerner and Shasha
SQL + OrderRelated Works: Products
RISQL – Red Brick Some vector-to-vector, order-dependent,
size-preserving functions Low-hanging fruit approach to language
design.
Analysis Functions – Oracle 9i Quite complete set of vector-to-vector
functions But: Can only be used in the select clause;
poor optimization (our preliminary study) KSQL – Kx Systems Arrable extension to SQL but syntactically
incompatible. No cost-based optimization.
NEDS April 2002 – Lerner and Shasha
Agenda
Motivation SQL + Order Transformations Conclusion
NEDS April 2002 – Lerner and Shasha
SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice)FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDateGROUP BY Id
TransformationsEarly sorting + order preserving operators
(1) Sort then joinpreserving order
(2) Preserve existingorder
(3) Join then sortbefore grouping
op
sort
g-by
avgs
op avgs
g-by
op
op
avgs
g-byop
sort
(4) Join then sortafter grouping
avgs
g-by
sort
NEDS April 2002 – Lerner and Shasha
TransformationsEarly sorting + order preserving operators
0
2040
60
80
100120
140
110
020
030
040
050
058
1
Number of traded Securities (total of 581 securities and 127062 quotes)
Tim
e in
mil
isec
on
ds
Sort before op join
Existing order
Sort after a reg join
Sort after reg join andg-by
NEDS April 2002 – Lerner and Shasha
TransformationsUDFs evaluation order
Gene(geneId, seq)SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS tWHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq)
posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences.
posA
dist
dist
posA
(2)(1) (3)
Switch dynamically
between (1) and (2)
depending on the
execution history
NEDS April 2002 – Lerner and Shasha
TransformationsUDFs Evaluation Order
110
1001000
10000100000
1000000
10 100 1000
Number of sequences
Tim
e in
mili
se
co
nd
s
dist then pos
pos then dist
NEDS April 2002 – Lerner and Shasha
TransformationsOrder preserving joins
select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineidfrom order, lineitem assuming order lineidwhere order.date > 45and order.date < 55 and lineitem.orderid = order.orderid
• Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty.
• Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows.The relevant order rows are then fetched using the surviving lineitem.orderid.
Strategy 2 is often 3-10 times faster.
NEDS April 2002 – Lerner and Shasha
Transformations Building Blocks
Order optimization Simmens et al. `96 – push-down sorts over joins, and
combining and avoiding sorts Order preserving operators KSQL – joins on vector Claussen et al. `00 – OP hash-based join
Push-down aggregating functions Chaudhuri and Shim `94, Yan and Larson `94 –
evaluate aggregation before joins UDF evaluation Hellerstein and Stonebraker ’93 – evaluate UDF
according to its ((output/input) – 1)/cost per tuple Porto et al. `00 – take correlation into account
NEDS April 2002 – Lerner and Shasha
Agenda
Motivation SQL + Order Transformations Conclusion
NEDS April 2002 – Lerner and Shasha
Conclusion
Arrable-based approach to ordered databases may be scary – dependency on order, vector-to-vector functions – but it’s expressive and fast.SQL extension that includes order is possible and reasonably simple. Optimization possibilities are vast.