CS561 - XJoin 1
XJoin: A Reactively-Scheduled Pipelined Join Operator
IEEE Bulletin, 2000by Tolga Urhan and Michael J. Franklin
CS561 - XJoin 2
Goal of XJoin
Efficiently evaluate equi-join in online query processing over distributed data sources
Optimization objectives: Having small memory footprint Fast initial result delivery Hiding intermittent delays in data arrival
CS561 - XJoin 3
Outline
Hash Join History Motivation of XJoin Challenges in Developing XJoin Three Stages of XJoin Preventing Duplicates Experimental Results Conclusion
CS561 - XJoin 4
Classic Hash Join
key2 R tuples
key1 R tuples
key3 R tuples
key4 R tuples
Key5 R tuples
1. Build S tuple 1
S tuple 2
S tuple 3
S tuple 4
S tuple 5
2. Probe
2-phase: build and probe Only one table is hashed in memory
CS561 - XJoin 5
Hybrid Hash Join One table is hashed both to disk and memory (partitions) G. Graefe, “Query Evaluation Techniques for Large Databases”.
ACM 1993.
Disk
Bucket i
Bucket i+1
Bucket i+2
Bucket …
Bucket j-1
Bucket j
R tuples
R tuples
R tuples
R tuples
R tuples
R tuplesBucket n
Bucket n+1
Bucket n+2
Bucket …
Bucket m-1
Bucket m
R tuples
R tuples
R tuples
R tuples
R tuples
R tuples
Memory
S tuple 1
S tuple 2
S tuple 3
S tuple 4
S tuple …
CS561 - XJoin 6
Symmetric Hash Join (Pipelined) Both tables are hashed (both kept in main memory only) A. Wilschut, P. M.G. Apers, “Dataflow Query Execution in a
Parallel Main-Memory Environment”, DPD 1991.
Source R
OUTPUT
Source S
Key n
Key n+1
Key n+2
Key …
Key m-1
Key m
R tuples
R tuples
R tuples
R tuples
R tuples
R tuples
BUILD
PROBE
R tuple S tuple
Key i
Key i+1
Key i+2
Key …
Key j-1
Key j
S tuples
S tuples
S tuples
S tuples
S tuples
S tuples
BUILD
PROBE
R tuple S tuple
CS561 - XJoin 7
Problems of SHJ:
Rather memory intensive
Won’t work for large input streams.
Won’t allow for many joins to be processed in a pipeline (or even in parallel).
CS561 - XJoin 8
New Problems in Online Query Processing over Distributed Data Sources Unpredictable data access due to link
congestion, load balances, etc. Three classes of delays
Initial Delay: first tuple arrives from remote source more slowly than usual
Slow Delivery: data arrives at a constant, but slower than expected rate
Bursty Arrival: data arrives in a fluctuating manner
CS561 - XJoin 9
Question: Why are delays undesirable?
Prolongs the time for first output Slows the processing if wait for data to first be
there before acting If too fast, you want to avoid loosing any data Waste time if you sit idle while no data is coming Unpredictable, one single strategy won’t work
CS561 - XJoin 10
Motivation of XJoin
Produce results incrementally when available Tuples returned as soon as produced
Exploit available main memory as long as possible Favor main-memory join when possible
Allow progress to be made when one or more sources experience delays by: Background processing performed on previously received
tuples so results are produced even when both inputs are stalled
CS561 - XJoin 11
XJoin Design
Tuples are stored in partitions (Hash Join):
A memory-resident (m-r) portion
A disk-resident (d-r) portion
CS561 - XJoin 12
Memory-resident partitions of source B
Tuple B
hash(Tuple B) = n
SOURCE-BSOURCE-A
D I S K
M E
M O
R Y 1
. . . . . . nn1
Memory-resident partitions of source A
1
. . . . . . . . . . . . n
1
Disk-residentpartitions of source A
. . . n
Disk-residentpartitions of source B
. . . . . .1 nk
k
flush
Tuple A
hash(Tuple A) = 1
CS561 - XJoin 13
Challenges in Developing XJoin Manage flow of tuples between memory and
secondary storage (when and how to do it) Control background processing when inputs
are delayed (reactive scheduling idea) Provide both quick initial result as well as
good overall throughput Ensure the full answer is produced Ensure duplicate tuples are not produced
CS561 - XJoin 14
XJoin Stages
XJoin proceeds in 3 stages (separate threads)
M : M
M : D
D : D
CS561 - XJoin 15
M E
M O
R Y
Partitions of source B
. . . . . . . . .i j
SOURCE-B
hash(record B) = j
Tuple B
SOURCE-A
Tuple Ahash(record A) = i
i j
Partitions of source A
. . . . . . . . .
Output
Insert Probe InsertProbe
1st Stage: Memory-to-Memory Join
CS561 - XJoin 16
1st Stage: Memory-to-Memory Join Join processing continues as long as:
Memory permits, and One of the inputs is producing tuples
If memory is full, one partition is picked to be flushed to disk and appended to end of disk-resident portion
If no new input, then stage 1 is blocked and stage 2 starts
CS561 - XJoin 17
Why Stage 1?
In-memory operations are much faster and cheaper than on-disk operations
Thus this guarantees that results are produced as soon as possible.
CS561 - XJoin 18
Question:
What does the 2nd Stage do? When does the 2nd Stage start?
Hint: What occurs when data input (tuples) are too large for
memory? Answer:
The 2nd Stage joins Memory-to-Disk Occurs when both inputs are blocking
CS561 - XJoin 19
Output
i . . . . . . .. . . . . . .i. . . . . . .. . . . . . .
M E
M O
R Y
Partitions of source BPartitions of source A
D I
S K
Partitions of source BPartitions of source A
ii . . . . .. . . . .. . . . .. . . . .
DPiA MPiB
Stage 2
CS561 - XJoin 20
2nd Stage: Memory-to-Disk Join Activated when 1st Stage is blocked Performs 3 steps:
1. Choose partition according to throughput and size of partition from one source
2. Use tuples from d-r portion to probe m-r portion of other source and output matches, until d-r completely processed
3. Check if either input resumed producing tuples. If yes, resume 1st Stage. If no, choose another d-r portion and continue 2nd Stage.
CS561 - XJoin 21
Controlling 2nd Stage Cost of 2nd Stage is hidden when both inputs
experience delays Tradeoffs ? What are the benefits of using second stage?
Produces results when input sources are stalled Allows varying input rates
What is the disadvantage? The second stage must complete a d-r portion before
checking for new input (overhead) To address tradeoff, use an activation threshold:
Pick a partition likely to produce many tuples right now
CS561 - XJoin 22
3rd Stage: Disk-to-Disk Join Clean-up stage
Assume that all data for both inputs has arrived Assume that 1st and 2nd stage have completed
Why is this step necessary? Completeness of answer: make sure that all result
tuples are being produced. Reason: some tuples in disk-resident portions
may not have had chance to join each other.
CS561 - XJoin 23
Preventing Duplicates
When could duplicates be produced? Duplicates could be produced in both 2nd and 3rd
stages which may perform overlapping work.
How to address it? XJoin prevents duplicates with timestamps.
When address this? During processing when trying to join two tuples.
CS561 - XJoin 24
Time Stamping : Part 1 2 fields are added to each tuple:
Arrival TimeStamp (ATS) Indicates time when tuple first arrived in memory
Departure TimeStamp (DTS) Indicates time when tuple was flushed to disk
[ATS, DTS] indicates when tuple was in memory
When did two tuples get joined in 1st state? If Tuple A’s DTS is within Tuple B’s [ATS, DTS]
Tuples that meet this overlap condition are not considered for joining at 2nd or 3rd stage
CS561 - XJoin 25
Tuple B1 178 198
Tuples joined in first stage
B1 arrived after A and before A was flushed to disk
Tuple A 102 234
DTSATS
Tuple B2 348 601
Tuples not joined in first stage
B2 arrived after A and after A was flushed to disk
Tuple A 102 234
DTSATS
Non-Overlapping
Detecting Tuples Joined in 1st Stage
Overlapping
CS561 - XJoin 26
Time Stamping : Part 2 For each partition, keep track of :
ProbeTS: time when a 2nd stage probe was done DTSlast: the DTS of last tuple of disk-resident portion
Several such probes may occur Keep an ordered history of such probe descriptors
Meaning : All tuples before and including at time DTSlast were joined in
stage 2 with all tuples in main memory at time ProbeTS
CS561 - XJoin 27
Detecting Tuples Joined in 2nd stage
All A tuples in Partition 2 up to DTSlast 350,were joined with m-r tuples that arrived before Partition 2’s ProbeTS.
100 300 800 900
20 340 350 550 700 900Tuple A 100 200
Tuple B 500 600
ATS DTS
ATS DTS
overlap
DTSlast ProbeTS
History list for corresponding partition.
Partition 2
Partition 2
CS561 - XJoin 28
Experiments
HHJ (Hybrid Hash Join)
XJoin (with 2nd stage and with caching)
XJoin (without 2nd stage)
XJoin (with aggressive usage of 2nd stage)
CS561 - XJoin 29
Case 1: Slow NetworkBoth Sources Are Slow
CS561 - XJoin 30
Case 1: Slow NetworkBoth Sources Are Slow (Bursty) XJoin improves delivery time of initial
answers -> interactive performance The reactive background processing is an
effective solution to exploit intermittent delays to keep continued output rates
Shows that 2nd stage is very useful if there is time for it
CS561 - XJoin 31
Case 2: Fast NetworkBoth Sources Are Fast
CS561 - XJoin 32
Case 2: Fast NetworkBoth Sources Are Fast All XJoin variants deliver initial results earlier. XJoin also can deliver the overall result in
equal time to HHJ HHJ delivers the 2nd half of the result faster
than XJoin. 2nd stage cannot be used too aggressively if
new data is coming in continuously
CS561 - XJoin 33
Conclusion
Can be conservative on space (small footprint)
Can produce initial result as early as possible Can hide intermittent data delays Can be used in conjunction with online query
processing to manage data streams (limited)
CS561 - XJoin 34
How to Further Optimize XJoin? Resuming Stage 1 as soon as data arrives Removing no-longer-joining tuples in timely
manner Other ideas ? …
CS561 - XJoin 35
References
Urhan, Tolga and Franklin, Michael J. “XJoin: Getting Fast Answers From Slow and Bursty Networks.”
Urhan, Tolga and Franklin, Michael J. “XJoin: A Reactively-Scheduled Pipelined Join Operator.”
Hellerstein, Franklin, Chandrasekaran, Deshpande, Hildrum, Madden, Raman, and Shah. “Adaptive Query Processing: Technology in Evolution”. IEEE Data Engineering Bulletin, 2000.
Hellerstein and Avnur, Ron. “Eddies: Continuously Adaptive Query Processing.”
Babu and Wisdom, Jennifer. “Continuous Queries Over Data Streams”.
CS561 - XJoin 36
Stream: New Query Context
Challenges faced by XJoin Potentially unbounded growing join state Indefinite delay of some join results
Solutions Exploit semantic constraints to remove no-longer-
joining data in timely manner Constraints:
sliding window punctuations
CS561 - XJoin 37
Punctuation Punctuation is predicate on stream elements
that evaluates to false for every element following the punctuation.
9961234 Edward 17
9961235 Justin 19
9961238 Janet 18
* * (0, 18]
no more tuples for students whose age
are less than or equal to 18!
ID Name Age
9961256 Anna 20
…
CS561 - XJoin 38
An Example
Open Stream
Group-byitem_id (sum(…))
Open Stream
item_id | seller_id | open_price | timestamp1080 | jsmith | 130.00 | Nov-10-03 9:03:00<1080, *, *, *>1082 | melissa | 20.00 | Nov-10-03 9:10:00<1082, *, *, *>…
item_id | bidder_id | bid_price | timestamp1080 | pclover | 175.00 | Nov-14-03 8:27:001082 | smartguy | 30.00 | Nov-14-03 8:30:001080 | richman | 177.00 | Nov-14-03 8:52:00<1080, *, *, *>…
Bid Stream
Query: For each item that has at least one bid, return its bid-increase value.
Select O.item_id, Sum (B.bid_price - O.open_price)From Open O, Bid BWhere O.item_id = B.item_idGroup by O.item_id
Bid Stream
Joinitem_id
Out1
(item_id)Out2
(item_id, sum)No more bids for item 1080!
CS561 - XJoin 39
PJoin Execution Logic
Hash TableHash Table
Join State (Disk-Resident Portion)
Join State (Memory-Resident Portion)
… 35399
…
Hash Table
5935
…
State of Stream A (Sa) State of Stream B (Sb)
Stream A Stream B
3
Hash(ta) = 1
Tuple ta
33
Purge Cand. Pool3
Purge Cand. Pool
Hash Table
…
1
2
4
3 <10Punct. Set (PSb)Punct. Set (PSa)
CS561 - XJoin 40
PJoin Execution Logic
Hash TableHash Table
Join State (Disk-Resident Portion)
Join State (Memory-Resident Portion)
… 35399
…
Hash Table
5935
…
State of Stream A (Sa) State of Stream B (Sb)
Stream A Stream B
3
Hash(pa) = 1
Punctuation pa
Purge Cand. Pool
3
Purge Cand. Pool
Hash Table
…
<10Punct. Set (PSb)Punct. Set (PSa)
CS561 - XJoin 41
PJoin vs. XJoin: Memory Overhead
02000400060008000
1000012000140001600018000
0 10000 20000 30000 40000 50000Ti me (mi l l i seconds)
# of
Tup
les
in J
oin
Stat
es XJ oi nPJ oi n
Tuple inter-arrival: 2 millisecondsPunctuation inter-arrival: 40 tuples/punctuation
CS561 - XJoin 42
PJoin vs. XJoin: Tuple Output Rate
0100000200000300000400000500000600000700000800000
0 10000 20000 30000 40000 50000 60000Ti me (mi l l i seconds)
# of
Out
put
Tupl
es
PJ oi nXJ oi n
Tuple inter-arrival: 2 millisecondsPunctuation inter-arrival: 30 tuples/punctuation
CS561 - XJoin 43
Conclusion Memory requirement for PJoin state almost
insignificant compared to XJoin’s. Increase in join state of XJoin leading to
increasing probe cost, thus affecting tuple output rate.
Eager purge is best strategy for minimizing join state.
Lazy purge with appropriate purge threshold provides significant advantage in increasing tuple output rate.