+ All Categories
Home > Documents > Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches,...

Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches,...

Date post: 01-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
16
Parallel Index-based Stream Join on a Multicore CPU Amirhesam Shahvarani, Hans-Arno Jacobsen Technical University of Munich [email protected] ABSTRACT There is increasing interest in using multicore processors to accelerate stream processing. For example, indexing sliding window content to enhance the performance of streaming queries is greatly improved by utilizing the computational capabilities of a multicore processor. However, designing an effective concurrency control mechanism that addresses the problem of concurrent indexing in highly dynamic settings remains a challenge. In this paper, we introduce an index data structure, called the Partitioned In-memory Merge- Tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming set- tings. To complement the index, we design an algorithm to realize a parallel index-based stream join that exploits the computational power of multicore processors. Our experi- ments using an octa-core processor show that our parallel stream join achieves up to 5.5 times higher throughput than a single-threaded approach. 1. INTRODUCTION For a growing class of data management applications, such as social network analysis [1], fraud detection [2], algorith- mic trading [3], and real-time data analytics [4], an informa- tion source is available as a transient, in-memory, real-time, and continuous sequence of tuples (also known as a data stream) rather than as a persistently disk-stored dataset [5]. In these applications, processing is mostly performed using long-running queries known as continuous queries [6]. Al- though its size is steadily increasing, the limited capacity of system memory is a general obstacle to processing poten- tially infinite data streams. To address this problem, the scope of continuous queries is typically limited to a sliding window that limits the number of tuples to process at any one point in time. The window is either defined over a fixed number of tuples (count based ) or is a function of time (time based ). Indexing the content of the sliding window is necessary to eliminate memory-intensive scans during searches and to en- hance the performance of window queries, as in conventional databases [7]. In terms of indexing data structures, hash ta- bles are generally faster than tree-based data structures for both update and search operations. However, hash-based indexes are applicable only for operations that use equality predicates since the logical order of indexed values is not pre- served by a hash table. Consequently, tree-based indexing is essential for applications that analyze continuous variables and employ nonequality predicates [8]. Thus, in this paper, we focus on tree-based indexing approaches, which are also applicable to operators that use nonequality predicates. Due to the distinct characteristics of the data flow in streaming settings, the indexing data structures designed for conventional databases, such as B + -Tree, are not effi- cient for indexing streaming data. Data in streaming set- tings are highly dynamic, and the underlying indexes must be continuously updated. In contrast to indexing in conven- tional databases, where search is among the most frequent and critical operations, support for an efficient index update is vital in a streaming setting. Moreover, tuple movement in sliding windows follows a specific pattern of arrival and departure that could be utilized to improve indexing perfor- mance. In addition to the index maintenance overhead arising from data dynamics, proposing a concurrency control scheme for multithreaded indexing that handles frequent updates is also a challenging endeavor. In conventional databases, the index update rate is lower than the index lookup rate, and concurrency control schemes are designed accordingly. Therefore, these approaches are suboptimal for indexing high- ly dynamic data, such as sliding windows, for which they have not been designed. Thus, dedicated solutions are de- sired to coordinate dynamic workloads with highly concur- rent index updates. These issues will be further exacer- bated because the continued leveraging of the computational power of multicore processors is becoming inevitable in high- performance stream processing. The shift in processor de- sign from the single-core to the multicore paradigm has initi- ated widespread efforts to leverage parallelism in all types of applications to enhance performance, and stream processing is no exception [9]. In terms of the underlying hardware, stream processing systems (SPSs) are divided into two categories, multi-node and single-node. Single-node SPSs are designed to exploit the computation power of a single high-performance ma- chine and are optimized for scale-up execution, such as Trill [10], StreamBox [11] and Saber [12]. In contrast, multi- node SPSs are intended to exploit a multi-node cluster and are optimized for scale-out execution, such as Storm [13], Spark [14] and Flink [15]. In general, a multi-node SPS relies on massive parallelism in the workload and the producer- consumer pattern to distribute tasks among nodes. As a consequence, multi-node SPSs achieve sub-optimal single node performance and require a large cluster to match the performance of a scale-up optimized solution using a sin- gle machine. With advances in modern single-node servers, scale-up optimized solutions become an interesting alterna- tive for high-throughput and low-latency stream processing for many applications [16]. Thus, in this paper, we address the challenges of paral- lel tree-based sliding window indexing, which is designed to exploit a multicore processor on the basis of uniform mem- ory access. The distinct characteristics of streaming data motivated us to reconsider how to parallelize a stream in- dex and design a novel mechanism dedicated to a streaming setting. We propose a two-stage data structure based on two known techniques, data partitioning and delta update, called the Partitioned In-memory Merge-Tree (PIM-Tree), that consists of a mutable component and an immutable component to address the challenges inherent to concurrent indexing in highly dynamic settings. The mutable compo- nent in PIM-Tree is partitioned into multiple disjoint ranges 1 arXiv:1903.00452v1 [cs.DB] 1 Mar 2019
Transcript
Page 1: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

Parallel Index-based Stream Join on a Multicore CPUAmirhesam Shahvarani, Hans-Arno Jacobsen

Technical University of Munich

[email protected]

ABSTRACTThere is increasing interest in using multicore processors toaccelerate stream processing. For example, indexing slidingwindow content to enhance the performance of streamingqueries is greatly improved by utilizing the computationalcapabilities of a multicore processor. However, designing aneffective concurrency control mechanism that addresses theproblem of concurrent indexing in highly dynamic settingsremains a challenge. In this paper, we introduce an indexdata structure, called the Partitioned In-memory Merge-Tree, to address the challenges that arise when indexinghighly dynamic data, which are common in streaming set-tings. To complement the index, we design an algorithm torealize a parallel index-based stream join that exploits thecomputational power of multicore processors. Our experi-ments using an octa-core processor show that our parallelstream join achieves up to 5.5 times higher throughput thana single-threaded approach.

1. INTRODUCTIONFor a growing class of data management applications, such

as social network analysis [1], fraud detection [2], algorith-mic trading [3], and real-time data analytics [4], an informa-tion source is available as a transient, in-memory, real-time,and continuous sequence of tuples (also known as a datastream) rather than as a persistently disk-stored dataset [5].In these applications, processing is mostly performed usinglong-running queries known as continuous queries [6]. Al-though its size is steadily increasing, the limited capacity ofsystem memory is a general obstacle to processing poten-tially infinite data streams. To address this problem, thescope of continuous queries is typically limited to a slidingwindow that limits the number of tuples to process at anyone point in time. The window is either defined over a fixednumber of tuples (count based) or is a function of time (timebased).

Indexing the content of the sliding window is necessary toeliminate memory-intensive scans during searches and to en-hance the performance of window queries, as in conventionaldatabases [7]. In terms of indexing data structures, hash ta-bles are generally faster than tree-based data structures forboth update and search operations. However, hash-basedindexes are applicable only for operations that use equalitypredicates since the logical order of indexed values is not pre-served by a hash table. Consequently, tree-based indexing isessential for applications that analyze continuous variablesand employ nonequality predicates [8]. Thus, in this paper,we focus on tree-based indexing approaches, which are alsoapplicable to operators that use nonequality predicates.

Due to the distinct characteristics of the data flow instreaming settings, the indexing data structures designedfor conventional databases, such as B+-Tree, are not effi-cient for indexing streaming data. Data in streaming set-tings are highly dynamic, and the underlying indexes mustbe continuously updated. In contrast to indexing in conven-

tional databases, where search is among the most frequentand critical operations, support for an efficient index updateis vital in a streaming setting. Moreover, tuple movementin sliding windows follows a specific pattern of arrival anddeparture that could be utilized to improve indexing perfor-mance.

In addition to the index maintenance overhead arisingfrom data dynamics, proposing a concurrency control schemefor multithreaded indexing that handles frequent updatesis also a challenging endeavor. In conventional databases,the index update rate is lower than the index lookup rate,and concurrency control schemes are designed accordingly.Therefore, these approaches are suboptimal for indexing high-ly dynamic data, such as sliding windows, for which theyhave not been designed. Thus, dedicated solutions are de-sired to coordinate dynamic workloads with highly concur-rent index updates. These issues will be further exacer-bated because the continued leveraging of the computationalpower of multicore processors is becoming inevitable in high-performance stream processing. The shift in processor de-sign from the single-core to the multicore paradigm has initi-ated widespread efforts to leverage parallelism in all types ofapplications to enhance performance, and stream processingis no exception [9].

In terms of the underlying hardware, stream processingsystems (SPSs) are divided into two categories, multi-nodeand single-node. Single-node SPSs are designed to exploitthe computation power of a single high-performance ma-chine and are optimized for scale-up execution, such as Trill[10], StreamBox [11] and Saber [12]. In contrast, multi-node SPSs are intended to exploit a multi-node cluster andare optimized for scale-out execution, such as Storm [13],Spark [14] and Flink [15]. In general, a multi-node SPS relieson massive parallelism in the workload and the producer-consumer pattern to distribute tasks among nodes. As aconsequence, multi-node SPSs achieve sub-optimal singlenode performance and require a large cluster to match theperformance of a scale-up optimized solution using a sin-gle machine. With advances in modern single-node servers,scale-up optimized solutions become an interesting alterna-tive for high-throughput and low-latency stream processingfor many applications [16].

Thus, in this paper, we address the challenges of paral-lel tree-based sliding window indexing, which is designed toexploit a multicore processor on the basis of uniform mem-ory access. The distinct characteristics of streaming datamotivated us to reconsider how to parallelize a stream in-dex and design a novel mechanism dedicated to a streamingsetting. We propose a two-stage data structure based ontwo known techniques, data partitioning and delta update,called the Partitioned In-memory Merge-Tree (PIM-Tree),that consists of a mutable component and an immutablecomponent to address the challenges inherent to concurrentindexing in highly dynamic settings. The mutable compo-nent in PIM-Tree is partitioned into multiple disjoint ranges

1

arX

iv:1

903.

0045

2v1

[cs

.DB

] 1

Mar

201

9

Page 2: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

Search

InsertDelete

R

S

Sliding windowIndexing data

structure

Expired tupleNew tuple

Figure 1: Index-based window join.

which are dynamically adapt to the range of the streamingtuple values. This multi-partition design enables PIM-Treeto benefit from the queries’ distribution to reduce potentialconflicts among queries and to support parallel index lookupand update through a simple and low-cost concurrency con-trol method. Moreover, leveraging a coarse-grained tupledisposal scheme based on this two-stage design, PIM-Treereduces the amortized cost of sliding window updates sig-nificantly compared to individual tuple updates in conven-tional indexes such as a B+-Tree. By combining these twotechniques PIM-Tree outperforms state-of-the-art indexingapproaches in both single- and multi-threaded settings.

To validate our indexing approach, we evaluate it in thecontext of performing a window band join. Stream join isa fundamental operation for performing real-time analyticsby correlating the tuples of two streams, and it is amongthe most computationally intensive operations in streamprocessing systems. Nonetheless, our indexing approach isgeneric and applies equally well to other streaming opera-tions.

To complement our data structure, we develop a parallelwindow band join algorithm based on dynamic load bal-ancing and shared sliding window indexes. These featuresenable our join algorithm to perform a parallel window joinusing an arbitrary number of available threads. Thus, thenumber of threads assigned for a join operation can be ad-justed at run time based on the workload and the hardwareavailable. Moreover, our join algorithm preserves the orderof the result tuples such that if tuple t1 arrives before t2,the join result of tuple t1 will be propagated into the outputstream before that for t2.

The evaluation results indicate that utilizing an octa-coreprocessor, our multithreaded join algorithm using PIM-Treeachieves up to 5.6 times higher throughput than our single-threaded implementation. Moreover, a single-threaded streamband join using PIM-Tree is 60% faster on average than thatusing B+-Tree, which demonstrates the efficiency of our datastructure for stream indexing applications. Compared witha stream band join using the state-of-the-art parallel index-ing tree index Bw-Tree [17], using PIM-Tree improves thesystem performance by a factor of 2.6 on average.

In summary, the contributions of this paper are four-fold: (1) We propose PIM-Tree, which is a novel two-stagedata structure designed to address the challenges of index-ing highly dynamic data, which outperforms state-of-the-artindexing methods in the application of window join in bothsingle- and multi-threaded settings. (2) We develop an ana-lytical model to compare the costs of window joins using theindexing approaches studied in this paper in order to providea better insight about our design decisions. (3) We proposea parallel index-based window join (IBWJ) algorithm thataddresses the challenges arising from using a shared indexin a concurrent manner. (4) We conduct an extensive ex-perimental study of IBWJ employing PIM-Tree and provide

a detailed quantitative comparison with state-of-the-art ap-proaches.

2. INDEX-BASED WINDOW JOINIn this section, we define the stream join operator se-

mantics and study Index-Based Window Join (IBWJ) us-ing three existing indexing approaches, including B+-Tree,chain-index and round-robin partitioning, in order to pointout the challenges of sliding window indexing and the short-coming of existing methods. We also provide an analyticalcomparison of processing a tuple using each approach toprovide a more clear insight about each mechanism and tohighlight their differences to our approach. The notationthat we use in this paper is as follows.

w : Size of sliding window.τc : Time complexity of comparing two tuples.σ : Join selectivity (0 ≤ σ ≤ 1).σs : Match rate (w × σ).fT : Inner node fan-out of a tree of type ’T’.λOT : Time complexity of performing an operation (O: Insert,

Search, Delete) on a node of a tree of type ’T’.

Throughout the remainder of this paper, λsb, λib and λdb

denote the time complexities of search, insert and deleteoperations at each node of B+-Tree, respectively, and fbdenotes the inner node fan-out of B+-Tree.

2.1 Window JoinThe common types of sliding windows are tuple-based and

time-based sliding windows. The former defines the windowboundary based on the number of tuples, also referred to asthe count-based window semantic, and the latter uses timeto delimit the window. We present our approach based ontuple-based sliding windows, although there is no technicallimitation for applying our approach to time-based slidingwindows.

We denote a two-way window θ-join as WR ./θ WS , whereWR and WS are the sliding windows of streams R and S,respectively. The join result contains all pairs of the form(r, s) such that r ∈ WR and s ∈ WS , where θ(r, s) evalu-ates to true. A join operator processes a tuple r arrivingat stream R as follows. (1) Lookup r in WS to determinematching tuples and propagate the results into the outputstream. (2) Delete expired tuples from WR. (3) Insert tupler into WR. The cost of each step depends on the choiceof the join algorithm and index data structure used. Tosimplify the time complexity analysis for different join im-plementations, we assume that the lengths of the slidingwindows of both streams, R and S, are identical, denotedby w. Additionally, we ignore the cost of the sliding windowupdate in our analysis since it is identical when using dif-ferent join algorithms and indexing approaches. Let CS , CDand CI represent the time complexities of search, delete, andinsert operations, respectively; then, the time complexity ofprocessing a single tuple (CT ) is given by Equation 1.

CT =

Step 1︷︸︸︷CS +

Step 2︷︸︸︷CD +

Step 3︷︸︸︷CI

(1)

2.2 Index-Based Window JoinIBWJ accelerates window lookup by utilizing an index

data structure. Although maintaining an extra data struc-ture along the sliding window increases the update cost, theperformance gain achieved during lookup offsets this extracost and results in higher overall throughput. The general

2

Page 3: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

idea of IBWJ is illustrated in Figure 1. Tuples in WR andWS are indexed into two separate index structures called IRand IS , respectively. Upon the arrival of a new tuple r intostream R, IBWJ searches IS for matching tuples. In addi-tion, IR must be updated based on the changes in the slidingwindow. Here, we examine IBWJ using B+-Tree, chainedindex and context-insensitive partitioning.

2.2.1 IBWJ using B+-TreeWe now derive the time complexity of IBWJ based on B+-

Tree. Let Hb be the height of the B+-Tree storing w records(Hb ≈ logwfb). The join algorithm processes a given tuple rfrom stream R as follows. (1) Search IS to reach a leaf node(Hb · λsb); then, linearly scan the leaf node to determine allmatching tuples (σs · τc). (2) Delete the expired tuple fromIR (Hb · λdb). (3) Insert the new tuple, r, into IR (Hb · λib).The time complexity of processing a tuple using IBWJ basedon a B+-Tree (CBJ) is given in Equation 2.

CBJ =

Step 1︷ ︸︸ ︷Hb · λsb + σs · τc +

Step 2︷ ︸︸ ︷Hb · λdb +

Step 3︷ ︸︸ ︷Hb · λib

(2)

2.2.2 IBWJ using Chained IndexLin et al. [20] and Ya-xin et al. [21] proposed chained in-

dex to accelerate stream join processing. The basic idea ofchained index is to partition the sliding window into discreteintervals and construct a distinct index per each interval.Figure 2 depicts the basic idea of chained index. As new tu-ples arrive into the sliding window, they are inserted into theactive subindex until the size of the active subindex reachesits limit. When this situation occurs, the active subindex isarchived and pushed into the subindex chain, and an emptysubindex is initiated as a new active subindex. Using thismethod, there is no need to delete expired tuples incremen-tally; rather, the entire subindex is released from the chainwhen it expires.

We now derive the time complexity of IBWJ when bothIR and IS are set to a chain index of length L (L ≥ 2)and all subindexes are B+-Trees. Let Hc be the height ofeach subindex (Hc ≈ Hb − logLfb ; we also considered theheight of the active subindex being equal to that of archivedsubindexes to simplify the equations). The join algorithmprocesses a given tuple r from stream R as follows. (1)Search all subindexes of IS to their leaf nodes (L ·Hc · λsb)and linearly scan leaf nodes to find matching tuples andfilter out expired tuples during the scan. The number ofexpired tuples that need to be removed from the result setis σs/(2 · (L− 1)) on average. (2) Check whether the latestsubindex of IR is expired and discard the entire subindex.The cost of this step is negligible, and we consider it to bezero. (3) Insert the new tuple, r, into the active subindex ofIR (Hc.λ

ib). The time complexity of processing a tuple using

IBWJ on a chained index (CCJ) is given in Equation 3.

CCJ =

Step 1︷ ︸︸ ︷L ·Hc · λsb + σs · τc(1 +

1

2 · (L− 1)) +

0︸︷︷︸Step 2

+Hc · λib︸ ︷︷ ︸Step 3

(3)

Comparing the cost of the index operations using chainedindex and B+-Tree indicates that using chained index toindex sliding windows is more efficient in terms of indexupdate costs than using a single B+-Tree, whereas range

•••

Sub-index 1Sub-index L-1Sub-index L

Sliding window

Active sub-index

Figure 2: Chained index.

queries are more costly using chained index because it needsto search multiple individual subindexes.

2.2.3 IBWJ using Round-robin PartitioningA group of parallel stream join solutions, such as hand-

shake join [22], SplitJoin [23] and BiStream [20], are basedon context-insensitive partitioning. In all these mentionedapproaches, a sliding window is divided into disjoint parti-tions using round-robin partitioning which is based on thearrival order of tuples rather than tuple values, and eachjoin-core is associated with a single window partition. Toaccelerate the lookup operation, each thread may maintaina local index for its associated partition. Because indexesare local to each thread, there is no need for a concurrencycontrol mechanism to access indexes. In fact, the parallelismin these approaches is achieved by dividing a tuple execu-tion task into a set of independent subtasks rather thanutilizing a shared index data structure and distributing tu-ples among threads. As a drawback of approaches basedon context-insensitive partitioning, it is required to have alljoining threads available to generate the join result of a sin-gle tuple because each thread can only generate a portion ofthe join result.

Here, we explain the cost of IBWJ using the low-latencyvariant of handshake join (LHS) employing P threads. Fig-ure 3 illustrates the join-core arrangement and the flow ofstreams in LHS. In LHS, join-cores are linked as a linearchain such that each thread only communicates with its twoneighbors, and data streams R and S propagate in two oppo-site directions. In the original handshake join, tuples arriveand leave each join-core in sequential order, and tuples mayhave to queue for a long period of time before moving tothe next join-core. This results in significant latency in joinresult generation and in higher computational complexitybecause all tuples are required to be inserted and deletedfrom each local index. In LHS, however, tuples are fast for-warded toward the end of the join-core chain to meet alljoin-cores faster. Moreover, each tuple is only indexed by asingle join-core, which is assigned in a round-robin manner.Consequently, LHS results in higher throughput and lowerlatency than the original handshake join.

We now derive the time complexity of the index opera-tions required to process a single tuple using round-robinpartitioning with P join-cores. Let all join-cores use B+-Tree as local indexes and Hp be the height of each localindex (Hp ≈ Hb − logPfb). The cost of processing a giventuple r from stream R is as follows. (1) Tuple r is propa-gated among all join-cores, and all cores search their localIS until the leaf nodes (P · Hp · λsb) and linearly scan leafnodes to find matching tuples (σs.τc). (2) The join-core thatis assigned to index tuple r deletes the expired tuple fromits IR (Hp · λdb). (3) The same join-core as in Step 2 insertsthe new tuple, r, into its IR (Hp · λib). The time complexityof processing a tuple using round-robin partitioning (CRRJ)is given in Equation 4.

CRRJ =

Step 1︷ ︸︸ ︷P ·Hp · λsb + (σs · τc) +

Step 2︷ ︸︸ ︷(Hp · λdb) +

Step 3︷ ︸︸ ︷(Hp · λib)

(4)

3

Page 4: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

Figure 3: Low-latency handshake join.

Comparing the cost of the index operations using round-robin partitioning with the cost of IBWJ using B+-Treeresults in the following: Using round-robin partitioning ismore efficient for inserting or deleting a tuple from slidingwindow than using a single B+-Tree because the heights ofthe local indexes for each partition are less than a singleB+-Tree indexing w tuples (Hp < Hb). However, becauseit is necessary to search multiple local indexes using round-robin partitioning to find matching tuples, using a singleB+-Tree is more efficient in terms of range querying. Gen-erally, as the number of join-cores increases, the total cost ofsearching local indexes using round-robin partitioning alsoincreases, which is a consequence of context-insensitive win-dow partitioning. This redundant index search limits theefficiency of approaches based on round-robin partitioningin the application of IBWJ.

3. CONCURRENT WINDOW INDEXINGIn this section, we present the design of our indexing data

structures for join processing.

3.1 OverviewWe propose a novel two-stage indexing mechanism to ac-

celerate parallel stream join by combining two previouslyknown techniques, delta merging and data partitioning re-sulting in a highly efficient indexing solution for both single-and multi-threaded sliding window indexing. Our indexingsolution consists of a mutable component and an immutablecomponent. The mutable component is an insert-efficient in-dexing data structure in which all the new tuples are initiallyinserted. The immutable component is a search-efficientdata structure where updates are applied using delta merg-ing. Utilizing the strength of each indexing component anda coarse-grained tuple disposal method, our two-stage datastructure results in more efficient sliding window indexingcompared with a single-component indexing data structure.Moreover, we extend our indexing solution by splitting themutable component into multiple mutable partitions, wherepartitions are assigned to disjoint ranges. Consequently, op-erations on different value ranges can be performed con-currently. This technique enables our indexing solution toleverage the queries’ distribution to support efficient taskparallelism with a lightweight concurrency control mecha-nism.

Throughout this section, we first study the effect of deltamerging in the application of sliding window indexing andthen we extend the delta merging method with index parti-tioning to support parallel sliding window indexing.

In this work, we use two different B+-Tree designs thathave distinct performance characteristics. The first design isthe classic B+-Tree design, where each node explicitly storesthe references to its children. This design, which we simplyrefer to as B+-Tree, supports efficient incremental updates.In contrast, as an immutable data structure, B+-Tree nodescan be arranged into an array in a breadth-first fashion. Inthis representation, given a node position, it is possible toretrieve the location of its children implicitly without need-ing to store actual references. By eliminating child refer-ences, more space is available in inner nodes for keys, and it

is feasible to achieve a higher fan-out and decrease the treedepth. Therefore, lookup operations in this design, which wecall immutable B+-Tree, are faster than in the classic designbased on node referencing. As a drawback, it is inefficient toperform individual updates in an immutable B+-Tree sincethe entire tree must be reconstructed; however, this type ofaccess is not required in our use of the index.

Throughout this paper, λsib denotes the time complexityof search at each node of the immutable B+-Tree, and fibdenotes the inner node fan-out of immutable B+-Tree.

3.2 In-memory Merge-TreeWe now describe our In-memory Merge-Tree (IM-Tree),

which is designed to accelerate sliding window indexing. IM-Tree consists of two separate indexing components (TI andTS). TI is a regular B+-Tree that is capable of performingindividual updates, and TS is an immutable B+-Tree that isonly efficient for bulk updates. All new tuples are initiallyindexed by TI . When the size of TI reaches a predefinedthreshold, the entire TI is merged into TS , and simultane-ously, all expired tuples in TS are discarded. The mergingthreshold is defined as m × w, where m is a parameter be-tween zero and one (0 < m ≤ 1), referred to as the mergeratio. To query a range of tuples, it is necessary to searchboth components, TI and TS , separately. Additionally, it isnecessary to filter out expired tuples of TS from the resultset. When a tuple expires, it is flagged in the sliding win-dow as expired but not eliminated. To drop expired tuplesfrom the index search results, every result tuple is checkedin the sliding window to determine whether it is flagged asexpired. At the end, all expired tuples are eliminated fromboth the sliding window and the index data structure duringthe merge operation.

Both chained index and IM-Tree utilize a coarse-grainedtuple disposal technique to alleviate the overhead of tupleremoval, but the tuple disposal techniques differ betweenthese indexing approaches. Chained index disposes of anentire subtree, whereas IM-Tree eliminates expired tuplesperiodically during the merge operation. The periodic mergeenables IM-Tree to maintain all indexed tuples in only twoindex components and to provide better search performancethan chained index.

Although both LSM-Tree [24] and IM-Tree are multi com-ponent indexing solutions which use the delta update mech-anism to transfer data between their components, the twodata structures are designed differently to tackle distinctproblems. Components in LSM-Tree are configured to beused in different storage media and LSM-Tree applies deltaupdate to alleviate the cost of write operations in low band-width storage media. In contrast, IM-Tree consists of twoin-memory components specialized for different operationsand IM-Tree applies periodic merges in order to enhance theperformance of range queries. Moreover, LSM-Tree is basedon incremental merging between its components which is notapplicable on immutable data structures such as immutableB+-Tree used in our IM-Tree.

3.2.1 IBWJ using IM-TreeLet HI and HS be the heights of TI and TS , respec-

tively. The time complexity of processing a tuple s arriv-ing at stream S for IBWJ using IM-Tree is as follows. (1)Search both TI and TS of the opposite stream to the leafnodes (HI · λsb + HS · λsib) and perform a linear scan of theleaf node to determine matching tuples (σs · τc) and filterout expired tuples (σs · τc · m2 ). (2) Tuples in IM-Tree are

4

Page 5: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

•• •

•• •

•• •

•• ••• ••• • •• • •• •• • • • •• •• •• •

•• • • •

• •

•• ••• •

•• • •• ••• • •• •

•• •

•• •

•• • •• ••• •

•• ••• • •• •

••••••

•••

••••••

•••

0

1

D

N

•••

•••

Depth

•• • •• • •• ••• •

•• • •• •••

• •

I

B0 B1 Bn-1 Bn

••••••

TS

TI

Figure 4: Structure of PIM-Tree (blue and red sections are TSand TI components, respectively).

deleted in a batch during a TI and TS merge. Let M be thetime complexity of the merge; then, the average cost pertuple is M/(m · w). (3) Insert the new tuple into the indexof stream S (HI · λib). The time complexity of processinga single tuple using IBWJ based on an IM-Tree (CMJ) isgiven by Equation 5.

The stepwise comparison between the window join usingB+-Tree and IM-Tree is controlled by the merge ratio m.Assigning a proper value for m is subject to various trade-offs. A late merge creates a larger TI on average and resultsin a more expensive insert and search of TI . Additionally,it increases the average number of expired tuples in TS andresults in an inefficient lookup in TS . Meanwhile, mergeoperations are costly, and overdoing such operations resultsin a significant performance loss. Generally, increasing thevalue of m causes the costs of Steps 1 and 3 to increase andthe cost of Step 2 to decrease.

CMJ =

Step 1︷ ︸︸ ︷HS · λsib +HI · λsb + σs · τc · (1 +

m

2) +

M/(m · w)︸ ︷︷ ︸Step 2

+HI · λib︸ ︷︷ ︸Step 3

(5)

3.3 Partitioned In-memory Merge-TreePartitioned In-memory Merge-Tree (PIM-Tree) is an ex-

tended variant of IM-Tree that is designed to address thechallenges of parallel sliding window indexing. Similar toIM-Tree, PIM-Tree is also composed of two components inwhich recently inserted tuples are periodically merged intoa lookup-efficient index. In fact, the key difference is in thedesign of the insert-efficient component TI . Rather than us-ing a single B+-Tree for all incoming tuples, we opt to use aset of B+-Trees that are associated with disjoint tuple valueranges. To provide a uniform workload among trees, theseranges periodically adapt to the distribution of values in thesliding window. Each B+-Tree is associated with a lock thatallows only a single thread to access the tree to handle par-allel updates and lookups. Unlike approaches that targetresolving concurrency at the tree node level, such as Bw-tree [17] or B-link [25], parallelism in PIM-Tree is based onconcurrent operations over disjoint partitions and relies onthe distribution of incoming tuples. An advantage of ourapproach is that the routines for performing operations areas efficient as those of the single-threaded approach, andtheir only overhead is to obtain a single lock per each treetraversal.

3.3.1 PIM-Tree Structure

Figure 4 provides an overview of the PIM-Tree struc-ture. PIM-Tree consists of two separate components, TSand TI . TS is an immutable B+-Tree; it is similar to ourIM-Tree, which stores static data. TI represents a set ofsubindexes named B0, .., Bn attached to TS at depth DI (in-sertion depth), where each Bi is associated with the samerange of values as the ith node of TS at the insertion depth.Each Bi is an independent B+-Tree, where the tail leaf nodeof each Bi (0 ≤ i < n) is connected to the head leaf node ofthe successor B+-Tree (Bi+1) to create a single sorted linkedlist of all elements in TI .

To insert a new record, the update routine first searchesTS until the depth of DI to identify the matching Bi thatis associated with the range that includes the given value.Then, the routine inserts the record into Bi using the B+-Tree insert algorithm. Similar to IM-Tree, the two compo-nents of PIM-Tree need to be periodically merged for main-tenance. This maintenance occurs when the total numberof tuples in TI equals m × w. Merging eliminates expiredtuples in TS and arranges the remaining tuples to be com-bined with those from TI into a sorted array that is takenas the last level of the new TS . Subsequently, TS is builtfrom the bottom up, and every Bi is initialized as an emptyB+-Tree.

3.3.2 IBWJ Using PIM-TreeLet H ′I be the average height of Bi, 0 ≤ i ≤ n. The join

algorithm processes a given tuple r from stream R as fol-lows. (1) Search the index of stream S to identify matchingtuples, which requires first searching TS (HS · λsib) and thecorresponding Bi (H ′I · λsb) to the leaf nodes and then per-forming a leaf node scan to determine matching tuples andfilter out expired tuples (σs · τc · (1 +m/2)). (2) Similar toIM-Tree, tuples are deleted in a batch during the merge ofTI and TS ; thus, the average cost per tuple is M ′/(m · w),where M ′ is the cost of merging TI and TS in PIM-Tree. (3)Insert the new tuple, r, into TI , which requires first travers-ing TS to depth DI (DI · λsib) and then inserting the tupleinto the corresponding Bi (H ′I ·λib). The total cost of IBWJusing PIM-Tree per tuple (CPJ) is given by Equation 6.

CPJ =

Step 1︷ ︸︸ ︷HS · λsib +H ′I · λsb + σs · τc · (1 +

m

2) +

M ′/(m · w)︸ ︷︷ ︸Step 2

+DI · λsib +H ′I · λib︸ ︷︷ ︸Step 3

(6)

Comparing the costs of IBWJ using IM-Tree and PIM-Tree, we obtain the following. Searching in PIM-Tree isfaster because the average height of a subindex in PIM-Treeis less than TI in IM-Tree. The costs for merging TI andTS in both trees are almost identical (M = M ′), and con-sequently, the overall cost of tuple deletion is the same inboth trees. The insertion costs in PIM-Tree and IM-Tree arecontrolled by the number of tuples in TI . Let the numberof tuples in TI be represented by |TI |. For |TI | = 0 (aftermerge), the constant overhead of traversing TS to depth DIin PIM-Tree is dominant and results in slower insertion inPIM-Tree. As |TI | increases, the cost of insertion in IM-Treeincreases faster and eventually surpasses the insertion costin PIM-Tree.

3.3.3 Concurrency Control in PIM-Tree

5

Page 6: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

•• •

Next available taskQueue head

Completed

Active

Available

•• •

Task 0 1 2 n-1 n•• •

Figure 5: Shared task queue with task size of 2.

To protect the PIM-Tree structure during concurrent in-dexing, each subindex (Bi) is associated with a lock that co-ordinates the accesses of the threads to the subindex. More-over, a searching thread may move from a Bi to its successor(Bi+1) during the leaf node scan to determine matching tu-ples. To address this issue, the last leaf node of each Bi isflagged such that the searching thread recognizes the move-ment from one subindex to another. In this case, the search-ing thread releases the lock and acquires the one associatedwith the successor.

Traversing TS is completely lock-free since its structurenever changes, and there is no need for a concurrency controlmechanism to avoid race conditions.

4. PARALLEL STREAM JOIN USINGSHARED INDEXES

In this section, we present our parallel window join algo-rithm which addresses the challenges of using shared indexesin a multi-threaded setting. During concurrent join, tuplesmight be inserted into indexes in an order different to theirarrival order depending on the threads’ scheduling in thesystem. We design a join algorithm which is aware of theindexing status of tuples in order to avoid duplicated ormissing results. Moreover, our join algorithm is based on anasynchronous parallel model which enables threads to joinor leave the operator dynamically depending on the systemload.

4.1 Concurrent Stream Join AlgorithmOur parallel join algorithm processes incoming tuples in

four steps: (1) task acquisition, (2) result generation, (3)index update, and (4) result propagation.

Task acquisition – A task represents a unit of work toschedule, which is a set of incoming tuples. The task sizeis the number of tuples assigned to a thread per each taskacquisition round, which determines the trade-off betweenmaximizing throughput and minimizing response time. Largetasks reduce scheduling and lock acquisition overhead but si-multaneously increase system response time, whereas smalltasks result in the opposite. In our join algorithm, tasksare distributed among threads based on dynamic schedul-ing; thus, a thread is assigned with a task whenever thethread is available. This method enables our join algorithmto utilize an arbitrary number of threads and not stall be-cause threads are unavailable.

We arrange incoming tuples into a shared work queue ac-cording to their arrival order, regardless of which streamthey belong to; and we protect the accesses to this queueusing a shared mutex. Each tuple in the work queue is as-signed a status flag: available indicates that the tuple isready to be processed but not yet assigned to any thread,active indicates that the tuple is assigned to a thread butthe join results are not ready, and completed indicates thatprocessing of the tuple is completed and the join results areready but the results are not propagated. When a tuplearrives in the queue, its status is initialized to available.Figure 5 illustrates the status of the work queue during awindow join with a task size of 2.

Linear search

•• ••• • •• •

Next tupleIndex searchIndexed tuple

Not indexedtuple

Sliding windowEdge-tuple

Figure 6: Sliding window during parallel stream join.

During a concurrent stream join, sliding windows muststore all tuples that are required to process active tuplesof the opposite stream, which generally results in windowslarger than w. In the case of a time-based sliding window,it is possible to filter out unrelated tuples using timestamps;however, for count-based sliding windows, it is necessary torecord the boundaries of the opposite window at the pointin time when a tuple is assigned to a thread. We refer tothese boundaries as tl (latest tuple) and te (earliest tuple).When a thread acquires a task, it changes the status of thetuples to active and saves tl and te for each tuple.

Result generation – To avoid duplicate or missing re-sults, we keep references to the earliest nonindexed tupleof each sliding window, referred to as the edge tuple. Thistuple declares that all tuples before it are already indexed,whereas the statuses of the subsequent tuples are undeter-mined. When a thread starts to process a tuple, it storesthe position of the edge tuple in a local variable since thevalue might be updated during processing. Using an oldvalue of the edge tuple might increase the computationalcost slightly, but it is safe in terms of result correctness.The lookup algorithm determines matching tuples in twosteps. First, it queries the index for matching tuples andfilters out those after the edge tuple or before tl. Second,it linearly searches the sliding window from the edge tupleto te and adds any results to the previously found results.Figure 6 illustrates the sliding window during the join oper-ation. When a thread finishes processing a tuple, it storesthe results in shared memory and updates the task status tocompleted in the shared queue but does not yet propagatethe results into the output stream at this step.

Index update – After a thread generates the join resultsfor a tuple, it inserts the tuple into the index and marksthe tuple in the sliding window as indexed. Subsequently,the thread attempts to update the edge tuple accordingly.To avoid a race condition, a shared mutex coordinates writeaccesses to the edge tuple. Using a test-and-set operation,the thread checks whether the mutex is held by anotherthread. If so, it avoids the edge tuple update and continuesto the next step. Otherwise, it increments the edge tuple tothe next nonindexed tuple in the sliding window and releasesthe mutex.

Result propagation – In the final step, a thread at-tempts to propagate the results of completed tuples. Sim-ilar to the edge tuple update routine, a shared mutex co-ordinates threads during result propagation. The threadchecks the status of the mutex. In the case that the mutexis already held by another thread, the thread skips this stepand begins to process another task. Otherwise, it verifieswhether the results for the tuple at the work queue head arecompleted. If so, it propagates the results into the outputstream and removes the tuple from the work queue. Thisroutine is repeated until the status of the tuple at the workqueue head is either active or available. Finally, the threadreleases the mutex and starts to process another task.

4.2 Nonblocking Merge and Indexing

6

Page 7: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

Concurrent join

Concurrent joinwithout index update

Merging

Phase 1 Phase 2

Thread 1

Thread 2

Thread 3

Thread 4

Figure 7: Nonblocking merge.

Performing merging as a blocking operation negatively im-pacts system availability and latency, which are both oftencritical concerns for stream processing applications. To ad-dress this challenge, we propose a nonblocking merge method.Our approach enables the stream join processing threadsto continue the join without significant interruption duringmerge processing. Figure 7 illustrates the overall schemeof performing a nonblocking merge. The operation consistsof two phases: first, creating a new PIM-Tree, and second,applying pending updates.

Whenever merging is needed, a thread called the merg-ing thread is assigned to perform the merge operation. Atthe beginning of each stage, the merging thread blocks theassignment of new tasks until all active threads finish theircurrently processing tasks. During the first phase, the merg-ing thread creates a new PIM-Tree without modifying theprevious index tree. Concurrently, other threads resumeperforming tasks without an index update. When the merg-ing thread finishes creating the updated PIM-Tree, it startsthe next phase. At the beginning of the second phase, themerging thread swaps the old index with the new one be-fore it unblocks the task assignment process. During thesecond phase, the merging thread applies pending updatesand other threads begin to perform the join operation withindex update. When the pending updates are finished, themerging thread leaves merge operation and begins to per-form the join operation.

During the first phase of nonblocking merge, the indexdata are not updated; therefore, the position of the edgetuple does not change during this phase. Consequently, thelinear search in the nonindexed portion of the sliding windowbecomes more expensive.

5. EVALUATIONIn this section, we present a set of experiments to bench-

mark the efficiency of the approaches introduced in this pa-per and empirically determine the corresponding parame-ters, such as merge ratio and insertion depth. Moreover,we study the influence of join selectivity and skewed valuedistribution on the performance of our parallel window joindesign. As the query workload, two streams, R and S, arejoined via the following band join.

SELECT * FROM R, SWHERE ABS(R.x - S.x) <= diff

The join attributes (R.x and S.x) are assumed to be ran-dom integers generated according to a uniform distributionand the input rates of streams R and S are symmetric unlessotherwise stated.

Because we evaluate each experiment for different windowlengths (w), considering a fixed value for diff results in var-ious join match rates (i.e., the match rate of band join withw = 225 will be 215 times higher than that with w = 210),which influences the overall join performance. For a morecomprehensive comparison, the value of diff is adjusted ac-cording to the window length such that the match rate (σs)

is always two except for the one that exclusively studies theinfluence of join selectivity.

We used two forms of band join: two-way join and self-join. In the former, R and S are two distinct streams, and inthe latter, an identical stream is used as both R and S. Theexperiments are generally based on two-way join, except forthose where we explicitly declare that self-join is used.

We evaluate our approaches on an octa-core (16 CPUthreads, hyper-threading enabled) Intel Xeon E5-2665. Forall multithreaded experiments, we utilized all 16 threads un-less otherwise stated. We employ the STX-B+-Tree imple-mentation, which is a set of C++ template classes for anin-memory B+-Tree [26], and we used our own CSS-Treeimplementation as immutable B+-Tree [27].

5.1 Comparison of Existing ApproachesRound-robin window partitioning – The purpose of

this experiment is to study the efficiency of round-robin par-titioning based approaches, such as low-latency handshakejoin, SplitJoin and BiStream, in the application of index-accelerated stream join. We evaluate five implementationsof the window join: (1) single-threaded Nested-Loop Win-dow Join (NLWJ), (2) multithreaded NLWJ based on round-robin partitioning, (3) single-threaded IBWJ using B+-Tree,(4) multithreaded IBWJ based on round-robin partition-ing, and (5) multithreaded IBWJ using Bw-tree. Figure 8apresents the results for varying window sizes.

Comparing the join algorithms, we observe that NLWJis more vulnerable to the sliding window size because itsperformance linearly decreases as the window size increases.In contrast, the performance of IBWJ is less sensitive tothe sliding window size. Multithreaded join using round-robin partitioning improves the performances of NLWJ andIBWJ by factors of 8 and 2.5, respectively. This result im-plies that although approaches based on round-robin win-dow partitioning are effective for NLWJ, these approachescannot efficiently exploit the computational power of multi-core processors for IBWJ.

Moreover, the performance result of parallel IBWJ usingBw-Tree indicates that the efficiency of concurrent opera-tions in Bw-Tree improves as the size of Bw-Tree increases.The larger the indexing tree, the lower is the probability ofaccessing the same node by different threads at the sametime; consequently, the multithreading efficiency increases.For the smallest sliding window size (w = 214), parallelIBWJ using Bw-Tree results in 65% lower throughput thanparallel IBWJ using round-robin partitioning, but for thelargest window size (w = 225) evaluated, parallel IBWJ us-ing Bw-Tree outperforms round-robin based method and re-sults in 75% higher throughput.

Chained index – Figure 8b shows the throughput ofIBWJ using chained index [20] for varying chain lengths. Wepropose and evaluate two different designs for chained index,referred to as B+-Tree chain (B-chain) and Immutable B+-Tree chain (IB-chain). In the former design, all subindexesare B+-Trees, including the active subindex (the one wherenewly arriving tuples are inserted) and all archived subindexes.In the latter design, only the active subindex is a B+-Tree,and before archiving an active subindex, it is converted intoan immutable B+-Tree; thus, all archived subindexes areimmutable B+-Trees.

We observe that the IB-chain results in 50% higher through-put than the B-chain on average, which indicates that theimmutable B+-Tree vastly outperforms the regular B+-Tree

7

Page 8: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

2-122-102-82-62-42-2202224

14 15 16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window length (2x)

NLWJ using Handshake joinIBWJ using Handshake join

Single-threaded NLWJ

Single-threaded IBWJ Multi-Threaded IBWJ

using Bw-Tree

(a)

0

0.5

1

1.5

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Mill

ion

tupl

es p

er s

econ

d

Chain length

B+-Tree B-Chain IB-chain

(b) 0

1

2

3

4

16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

DI = 1 DI = 2 DI = 3 DI = 4

(c)

0

2

4

6

8

10

16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

DI = 1 DI = 2 DI = 3 DI = 4

(d)

Figure 8: a) Performance evaluation of multi-threaded window join using round-robin partitioning. b) Throughput comparison of IBWJusing chained index and B+-Tree (w = 220). c) Throughput vs. insertion depth (DI) for single-threaded IBWJ using PIM-Tree. d)Throughput vs. insertion depth (DI) for parallel IBWJ using PIM-Tree.

for search queries in this scenario. For both the B-chain andIB-chain, the shortest chain length, which is two, results inthe best throughput. However, the performance noticeablydecreases when the chain length increases. The main draw-back of chained index is the higher search complexity, whichincreases almost linearly with the chain length. Althoughthe index chain reduces the overhead of tuple removal usingcoarse-grained data discarding, the higher search overheaddegrades its overall performance.

5.2 IBWJ using PIM-Tree and IM-TreeInsertion depth – In this experiment, we study the im-

pact of the insertion depth (DI) on the performance of PIM-Tree. Increasing DI results in smaller subindexes (Bis),which accelerates the operation on subindexes, and it si-multaneously increases the overhead of searching TS to findthe corresponding Bi. Figures 8c and 8d show the through-puts of single-threaded and parallel IBWJ, respectively, us-ing PIM-Tree for different DIs ranging from one to four,considering that the root node is at a depth of zero. Forthe window sizes of 216 to 219, there are only four levels ofinner nodes (including the root node); thus, the maximumfeasible DI is three.

The results for DI = 1 reveal that the number of innernodes at depth DI highly influences the performance of par-allel IBWJ. If the number of subindexes in TI (which is equalto the number of inner nodes at depth DI) is not sufficient,then the performance significantly decreases due to the highpartition locking congestion. From w = 216 to 220, the sys-tem throughput rapidly increases since the number of innernodes at DI = 1 also increases. At w = 221, the number ofinner nodes at DI = 1 decreases since the tree depth is in-cremented by one, which also causes a decrease in the IBWJthroughput. For larger values of DI (three and four), theIBWJ throughput does not improve, which suggests thatthe multithreading is no longer bounded by the number ofsubindexes.

For the case of single-threaded IBWJ, the achieved through-put for different DIs is less dependent on the window size.However, setting DI to the highest feasible value results ina higher overhead for searching TS and lowers the overallperformance.

Merge ratio (m) – To determine the empirically optimalmerge ratio for IM-Tree and PIM-Tree, we conduct an exper-iment for each data structure. Figures 9c and 9d illustratethe throughputs of single-threaded IBWJ using IM-Tree andPIM-Tree, respectively, with merge ratios ranging from 2−6

to 1. The results for both data structures follow a similarpattern, but the average throughput employing PIM-Treeis higher than that using IM-Tree. Additionally, the sys-tem does not perform efficiently for either very low or veryhigh values of the merge ratio. This underperformance is a

consequence of the excessive overhead imposed by the fre-quent merge when the merge ratio is set very low and bythe inefficient insert and search operations when the mergeratio is set very high. The results suggest that the choice ofthe merge ratio is more influential for smaller sliding win-dows, and the empirical optimal ratio is not identical forall window sizes. Over the largest evaluated sliding window(223), setting the merge ratio to 1/24 results in the highestthroughput, whereas for the smallest one (216), 1/23 is thebest merge ratio.

Figure 9a illustrates the throughput of the parallel IBWJusing PIM-Tree for varying merge ratios ranging from 2−6

to 1. In contrast to the single-threaded implementation, set-ting the merge ratio to the highest value always results inthe best performance in the multithreaded setting, regard-less of the window size. This result indicates that the costof merge operations during a parallel window join is higherthan the cost in a single-threaded setup. Hence, minimiz-ing the number of merges results in the highest throughput.We also observe that the choice of the merge ratio is moreinfluential for smaller window sizes. Henceforth, we set thevalue of the merge ratio for the multithreaded setup to one.

B+-Tree vs. IM-Tree vs. PIM-Tree – In this experi-ment, we compare the performances of IBWJ using B+-Tree,IM-Tree and PIM-Tree. For a more comprehensive compar-ison, we divide the process of finding matching tuples intotwo steps: traversing the index tree for the tuple with thelowest value, referred to as searching, and linearly checkingtuples in leaf nodes, referred to as scanning. For each datastructure, we measure the costs of the different steps of per-forming IBWJ, including insert, delete, search, scan, andmerge. Figure 9b shows the results for sliding window sizesof 217 and 223.

The merging overhead is almost identical for both IM-Treeand PIM-Tree, and it constitutes 7% and 11% of the totalprocessing for 217 and 223 windows, respectively. Regard-ing the tuple insertion performance, PIM-Tree and IM-Treeperform nearly identically, and they are 1.5 and 2.6 timesfaster than B+-Tree for 217 and 223 windows, respectively.For the smaller window size (217), searching in B+-Tree is75% faster than searching in IM-Tree and PIM-Tree. How-ever, for the larger window size (223), the search perfor-mances corresponding to PIM-Tree and B+-Tree are nearlyidentical, and both are slightly faster than IM-Tree.

Figure 10a presents the throughput of single-threaded IBWJusing B+-Tree, IM-Tree, and PIM-Tree for varying windowsizes. We observe that employing PIM-Tree and B+-Treeresults in the best and the worst performances, respectively.Considering IBWJ using B+-Tree as the baseline, averageimprovements in system performance of 50% and 63% inmagnitude are achieved by employing IM-Tree and PIM-Tree, respectively.

8

Page 9: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

0

2

4

6

8

10

2-6 2-5 2-4 2-3 2-2 2-1 20

Mill

ion

tupl

es p

er s

econ

d

Merge ratio

ω = 216 ω = 218 ω = 220 ω = 222

(a)

0

200

400

600

800

1000

1200

PIM-Tree IM-Tree B+-Tree PIM-Tree IM-Tree B+-Tree

Tim

e (n

s)

Window size (2x)

SearchInsert

DeleteMerge

Scan

23 17(b)

1

1.5

2

2.5

3

3.5

2-6 2-5 2-4 2-3 2-2 2-1 20Mill

ion

tupl

es p

er s

econ

d

Merge ratio

ω = 216

ω = 217ω = 218

ω = 219ω = 220

ω = 221ω = 222

ω = 223

(c)

1.5

2

2.5

3

3.5

2-6 2-5 2-4 2-3 2-2 2-1 20Mill

ion

tupl

es p

er s

econ

d

Merge ratio

ω = 216

ω = 217ω = 218

ω = 219ω = 220

ω = 221ω = 222

ω = 223

(d)

Figure 9: a) Throughput vs. merge ratio for parallel IBWJ using PIM-Tree. b) Cost comparison of the different steps of IBWJ for asingle tuple using various indexing data structures. c) Throughput vs. merge ratio for IBWJ using IM-Tree. d) Throughput vs. mergeratio for IBWJ using PIM-Tree.

0

1

2

3

4

5

6

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

B+-Tree IM-Tree PIM-Tree

(a)

2-3

2-2

2-1

20

21

22

2-4 2-3 2-2 2-1 20 21 22 23 24 25 26 27 28 29 210

Mill

ion

tupl

es p

er s

econ

d

Match rate (σs)

PIM-TreeIM-Tree

B+-TreeMulti-threaded PIM-Tree

(b) 0

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10

Mill

ion

tupl

es p

er s

econ

d

Task Size

ω = 216 ω = 218 ω = 220 ω = 222

(c) 0

4

8

12

16

20

24

1 2 3 4 5 6 7 8 9 10

Lat

ency

(µs

)

Task Size

ω = 216 ω = 218 ω = 220 ω = 222

(d)

Figure 10: a) Performance comparison of single-threaded IBWJ using different indexing data structures. b) Throughput vs. match ratefor IBWJ (w = 220). c) Throughput vs. task size for parallel IBWJ using PIM-Tree. d) Latency vs. task size for parallel IBWJ usingPIM-Tree.

Match rate (σs) – Figure 10b shows the throughputsof four different implementations of IBWJ for the windowsize of 220 and match rates varying from 2−4 to 210. Theseimplementations are three single-threaded IBWJ using B+-Tree, IM-Tree and PIM-Tree and one multi-threaded IBWJusing PIM-Tree. The join performance varies negligibly forthe match rates between 2−4 and 24, which indicates thatthe join performance in this range is bounded by index tra-versing rather than the linear leaf node scans. As the matchrate increases beyond 24, the join performance noticeablydecreases for all implementations. This result implies thatfor higher match rates, i.e., 25 ≤ σs ≤ 210, the join perfor-mance is bounded by system memory bandwidth due to ex-tensive leaf node scans. Consequently, multithreading losesits advantage for IBWJ with high selectivities, and its per-formance becomes closer to that of the single-threaded im-plementations. Additionally, the result indicates that single-threaded IBWJ using IM-Tree and PIM-Tree for join withhigh selectivity results in better performance than using B+-Tree, which is because of the more efficient leaf node scanin immutable B+-Tree (TS) than in regular B+-Tree.

Task size – In this experiment, we study the influence ofthe task size on our parallel window join algorithm. Increas-ing the task size decreases the overhead of task acquisitionwhile simultaneously increasing the system latency (taskprocessing time). Figures 10c and 10d illustrate the per-formance of IBWJ using PIM-Tree over different task sizesranging from 1 to 10 in terms of throughput and latency, re-spectively. Increasing the task size to four steadily improvesthe performance, which suggests that very small task sizeslead to significant task scheduling overhead. For task sizesfrom five to eight, a minor improvement is achieved, and fortask sizes larger than eight, the performance does not signifi-cantly vary. The evaluation results shown in Figure 10c indi-cate that the task size greatly influences the system latency:increasing the task size leads to higher latencies. Addition-ally, we observe that the latency of parallel IBWJ is higherfor larger sliding windows. As the window size increases,the PIM-Tree merge becomes more costly because it leadsto longer linear window scans during nonblocking merge andconsequently causes higher latency. In the remainder of the

evaluation, we use tasks of size eight.Memory consumption – Figure 11a compares the mem-

ory space required for different components of PIM-Tree andB+-Tree storing varying numbers of elements. Each elementis a pair of 4 bytes for key and 4 bytes for sliding windowreference. The storage required for PIM-Tree consists of thesearch-efficient component (TS), the insert-efficient compo-nent (TI), and a buffer that is required during nonblockingmerge. For this experiment, the merge ratio is set to onesuch that TI is at the largest possible size. The results re-veal that the space required for PIM-Tree is almost doublethe space required for B+-Tree, regardless of window size.

Asymmetric sliding windows – In contrast to our otherexperiments, where we considered the lengths of both slid-ing windows to be equal, here we set different sizes for thesliding windows of streams R (wr) and S (ws), and we ex-amine whether asymmetric window sizes impact the perfor-mance of IBWJ. Figure 11c presents the throughout of par-allel IBWJ using PIM-Tree for various combinations of wrand ws. In general, the system performance for asymmet-ric window sizes follows the same pattern as for ones usingsymmetric window sizes. Considering a fixed window sizefor one stream, increasing the size of the other window de-creases system performance, although the magnitude of theperformance decrease is less than when both sliding windowsizes increase.

Asymmetric tuple distribution – Here, we examinethe impact of asymmetric input rates on the performance ofparallel window join using PIM-Tree. An asymmetric inputrate skews the distribution of search and insert operationsamong the two indexing data structures (i.e., inserts in oneindex vs. search in the other index). Thus, there are moreinsert operations in the index of the stream with the higherinput rate and, at the same time, more search operationsin its companion index. Figure 11b illustrates the through-put of parallel window join using PIM-Tree for various inputrates and window sizes. The results show that the through-put increases marginally as the input rate skew increases.This indicates that the parallel window join algorithm isresilient against input rate fluctuations.

Memory Bandwidth – The purpose of this experiment

9

Page 10: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

0

20

40

60

80

100

120

18 19 20 21 22 18 19 20 21 22

Dat

a Si

ze (

MB

)

Window size (2x)

Leaf nodesInner nodes

TI TS

Buffer

PIM-Tree B+-Tree(a)

0

3

6

9

12

15

0 5 10 15 20 25 30 35 40 45 50

Mill

ion

tupl

es p

er s

econ

d

Number of tupls from stream S (%)

ω=224

ω=222ω=220

ω=218ω=216

(b)

0

2

4

6

8

10

12

216 217 218 219 220 221 222 223 224

Mill

ion

tupl

es p

er s

econ

d

ωr

ωs=224

ωs=222ωs=220

ωs=218ωs=216

(c)

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Ban

dwid

th (

GB

/s)

Threads

Load Store

(d)

Figure 11: a) Memory footprint comparison of B+-Tree and PIM-Tree. b) Evaluation of IBWJ using PIM-Tree with asymmetric inputrate. c) Evaluation of IBWJ using PIM-Tree with asymmetric window sizes. d) Effective memory bandwidth of parallel IBWJ.

0

2

4

6

8

10

12

14

1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Mill

ion

tupl

es p

er s

econ

d

Threads

Two-way join without CCSelf-join without CC

Two-way join with CCSelf-join with CC

(a)

2

4

6

8

10

12

14 15 16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

Gamma (k = 3, θ = 3)Gamma (k = 1, θ = 5)

UniformGaussian

(b)

2

4

6

8

10

12

14

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

Single threaded IBWJ w/ B+-TreeSingle threaded IBWJ w/ PIM-Tree

Multi-threaded IBWJ w/ Bw-TreeMulti-threaded IBWJ w/ PIM-Tree

(c)

Figure 12: a) Comparison of parallel IBWJ using PIM-Tree utilizing varying number of threads against the single-threaded implementa-tion without concurrency control (CC) (w = 220). b) Evaluation of parallel IBWJ using PIM-Tree for different tuple value distributions.c) Performance comparison of single-threaded and multithreaded index-based self-join.

is to examine the impact of the system memory bandwidthon the performance of the parallel window join. The maxi-mum system memory bandwidth is 43 GB/s. Figure 11d il-lustrates the effective system memory bandwidth of parallelwindow join using PIM-Tree (w = 220). The results indicatethat 22% of the total memory bandwidth is due to store op-erations for the case of the single-threaded execution. Thisratio decreases to 16% as we increase the number of threads.The higher ratio of load operations for multithreaded exe-cutions is the result of a less efficient sliding window searchduring multithreaded window join compared with the single-threaded execution. The parallel window lookup consists oftwo parts: (1) a linear scan between edge-tuple and slidingwindow head and (2) querying the index for the remainingwindow portion. In the case of a single-threaded windowjoin, there is no need for the linear scan since the entirewindow content is always indexed; thus, the sliding windowlookup is at its most efficient operating point. As we in-crease the number of threads, the number of active tasks inthe system also increases. Consequently, the gap betweenedge-tuple and sliding window head increases, which causesmore costly linear scans and consequently a less efficientwindow lookup.

Scalability – The objectives of this experiment are tofirst study the overhead of the concurrency control mech-anisms and to then examine the scalability of our join al-gorithm using multiple threads. Figure 12a compares theresulting throughputs corresponding to self-join and two-way join using PIM-Tree under a varying number of threadsagainst the single-threaded implementation without concur-rency control (CC).

The results show that enforcing CC causes performancedegradations of nearly 40% and 26% for two-way join andself-join, respectively, mainly as a result of the locking over-head. As we increase the number of threads from one toeight, the performance of both two-way join and self-joinincrease to 4.6 and 4 times of the single-threaded imple-mentation with CC, respectively. Moreover, the results re-veal that enabling hyper-threading (16 threads) increasesthe throughput by 24%, and the mentioned improvements

increase to 5.7 and 5, respectively.Multithreading efficiency – In this experiment, we

study the efficiencies of our multithreading approach andnonblocking merge, and we also compare PIM-Tree to thestate-of-the-art parallel indexing tree, Bw-tree. Figure 13cshows the throughput performances of five different imple-mentations of the two-way IBWJ: (1) single-threaded IBWJusing B+-Tree, (2) single-threaded IBWJ using PIM-Tree,(3) parallel IBWJ using Bw-tree, (4) parallel IBWJ usingPIM-Tree, and (5) parallel IBWJ using PIM-Tree with block-ing merge.

The results of parallel IBWJ using PIM-Tree show thatusing blocking and non-blocking merge techniques result insimilar performances while blocking merge is slightly fasterthan the non-blocking one which is because of its less com-plicated mechanism used to perform blocking merge oper-ations. Moreover, the results reveal that our parallel ap-proach is effective for window sizes larger than 214. For thesmaller evaluated window sizes (210 to 213), merge opera-tions occur very often which leads to frequent linear windowscans during merge operations and thus the system perfor-mance declines. For window sizes between 215 and 225, ourparallel IBWJ using PIM-Tree results in on average 7.5 and3.7 times higher throughput than the single-threaded IBWJusing B+-Tree and PIM-Tree, respectively. The biggest im-provement is achieved for the largest evaluated window size(225) that resulted in improvement increases of 12 and 5.3times, respectively. The evaluation results of IBWJ usingBw-tree reveal that Bw-tree is also not effective for thesmaller evaluated window sizes (210 to 213) which is becauseof the high conflict between threads during index operations.For windows sizes between 214 and 225, parallel IBWJ us-ing Bw-tree results in 1.8 times higher throughput than oursingle threaded IBWJ using PIM-Tree, on average. For thesame range of window sizes, our parallel IBWJ using PIM-Tree outperforms the Bw-tree based implementation by afactor of 2.2 on average. Although our PIM-Tree achievesbetter performance than Bw-tree, we do not aim to challengeBw-Tree in this work since Bw-tree is designed as a genericparallel indexing tree that is highly efficient for OLTP sys-

10

Page 11: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

2-6

2-4

2-2

20

22

24

26

0 100 200 300 400 500 600 700 800 900 1000

Nor

mal

ized

rat

e of

inse

rted

tupl

es

Sub-indexes

r = 0.0r = 0.1

r = 0.2r = 0.3

r = 0.4r = 0.6

r = 0.8r = 1.0

2-6

2-4

2-2

20

22

24

26

0 100 200 300 400 500 600 700 800 900 1000

Nor

mal

ized

rat

e of

inse

rted

tupl

es

Sub-indexes

23

25

27

29

1006 1011 1017 1023

(a)

20

21

22

23

24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Mill

ion

tupl

es p

er s

econ

d

Number of processed tuples (× 220)

r = 0.0r = 0.1

r = 0.2r = 0.3

r = 0.4r = 0.6

r = 0.8r = 1.0

(b)

2

4

6

8

10

12

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Mill

ion

tupl

es p

er s

econ

d

Window size (2x)

Single threaded IBWJ w/ B+-TreeSingle threaded IBWJ w/ PIM-Tree

Multi-threaded IBWJ w/ Bw-Tree

Multi-threaded IBWJ w/ PIM-TreeMulti-threaded IBWJ w/ PIM-Tree

with blocking merge

(c)

Figure 13: a) Distribution of inserts among subindexes during drifting Gaussian distributions. b) Evaluation of multithreaded index-based self-join using PIM-Tree for shifting Gaussian distributions. c) Throughput comparison of single-threaded and multithreadedtwo-way join.

tems where the majority of queries are read accesses (morethan 80% [28]), whereas our design is specifically tuned forhighly dynamic systems such as data stream indexing witha significantly higher rate of data modification.

Figure 12c presents the performance comparison of theparallel and single-threaded IBWJ implementations for self-join. Similar to the experiment on two-way window joins,parallel self-join using PIM-Tree is not effective for the smallerevaluated window sizes (210 to 215). For window sizes be-tween 216 to 225, parallel self-join using PIM-Tree achieves7 and 4 times higher throughput than the single threadedself-join using B+-Tree and PIM-Tree, respectively.

Impact of skewed data – We now study the impactof the tuple value distribution on the performance of par-allel IBWJ using PIM-Tree in two experiments. First, weexamine IBWJ using three differently skewed distributions,including a Gaussian distribution (µ = 0.5, σ = 0.125) andtwo differently parameterized Gamma distributions (k = 3,θ = 3 and k = 1, θ = 5), and we compare them with the re-sult of using a uniform distribution. For each evaluation, weadjust the band join predicate to keep the average matchrate equal to two. Figure 12b presents the evaluation re-sults (w = 220). The uniform distribution of the join at-tributes always results in the highest throughput, althoughthe differences are not significant. On average, the resultingthroughput of IBWJ using PIM-Tree for uniformly distrib-uted join attributes is between 2% and 4% higher than forGaussian and Gamma distributions, respectively.

In the second experiment, we examine the impact of a dy-namic tuple value distribution on the performance of IBWJusing PIM-Tree. In the case of a fixed tuple value distri-bution, the insert operations are spread uniformly acrosssubindexes, even though the tuple value distribution is skewed.The reason is that B+-Tree nodes are naturally adapting tothe indexed values such that the subtrees of the two innernodes at the same depth have almost an equal number ofindexed values. Because TI ’s subindexes are adjusted ac-cording to TS ’s inner nodes, the load among subindexes isuniformly distributed regardless of the value distribution.However, when the distribution changes, the range assign-ment is no longer optimal and causes skew in the insertoperation among subindexes.

In contrast to the previous experiment where the distri-bution of values was fixed, we now study PIM-Tree perfor-mance under a dynamic value distribution, which results ina skewed distribution of inserts among subindexes. For thispurpose, we create a tuple sequence in which tuple values aregenerated based on a shifting Gaussian distribution, and wethen evaluate the performance of parallel index-based self-

join using PIM-Tree with this tuple sequence (w = 220). Thetuple sequence consists of three phases. In the first phase,the tuples are generated according to the fixed Gaussiandistribution N (0.5, 0.125) (µ = 0.5, σ2 = 0.125). Duringthe middle phase, the distribution of tuple values is lin-early shifting from N (0.5, 0.125) to N (r+ 0.5, 0.125), wherethe constant value r defines the speed of the distributionchange; thus, the larger r is, the faster the mean value ofthe Gaussian distribution shifts. In the last phase, the tu-ples are generated according to the Gaussian distributionN (r + 0.5, 0.125). We set the lengths of these three phasesto 4M (4× 220), 10M and 4M tuples, respectively. DI is setto 4, which results in 1024 subindexes considering fib = 32and w = 220. Figure 13a illustrates the normalized dis-tribution of insert operations among TI ’s subindexes dur-ing distribution shifts (second phase) for different values ofr ranging from 0 to 1. It follows that inserts are spreadamong subindexes equally when the tuple value distribu-tion is fixed (r = 0), and as r increases, the distributionof inserts becomes more skewed. For the highest value of r(r = 1), the insert distribution is highly skewed such that77% of all inserts are assigned to a single subindex, andthere are almost no inserts assigned to the other 70% ofsubindexes. Figure 13b presents the evaluation results formultiple values of r ranging from 0 to 1. The join per-formance during the distribution change depends on howfast the distribution shifts: slow, moderate or fast. Duringslow distribution shifts (r = 0.1, 0.2), there is almost no de-crease in the stream join performance, which indicates thatPIM-Tree is able to gracefully tolerate slow changes in thetuple value distribution. For moderate distribution shifts(r = 0.4, 0.6), the system performance decreases to 35% onaverage, which is due to high partition locking congestion.The lowest performance results from fast distribution shifts(r = 0.8, 1.0), where the performance decreases to 16%. Thejoin performances for r = 0.8 and r = 1.0 are nearly identi-cal, which indicates that partition locking congestion is closeto its peak. Additionally, the results imply that regardlessof how fast the distribution shifts during the second phase,as the distribution becomes stationary again in the thirdphase, partitions in PIM-Tree are adjusted accordingly, andstream join performance recovers.

6. RELATED WORKWork related to our approach can be classified as follows:

Tree indexing, Parallel B+-Tree, sliding window indexing,and parallel window join. We review these categories in thissection.

11

Page 12: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

Tree indexing – Due to the advances in main memorytechnology, many databases are currently able to store in-dexing information in main memory and eliminate the ex-pensive I/O overhead arising from storage to disks. Con-sequently, a large body of work has explored tree-based in-memory indexing. B+-Tree is a popular modification of B-Tree, which provides better range query performance [29,30].T-Tree is a balanced binary tree specifically designed to in-dex data for in-memory databases [31]. Although B-Treewas originally designed as a disk-stored indexing data struc-ture, when properly configured, B-Tree outperforms T-Treewhile enforcing concurrency control [32]. Rao et al. [33] ex-tended CSS-Tree [27] to the cache-sensitive B+-Tree (CSB+-Trees), which supports update operations, although B+-Tree outperforms CSB+-Tree in applications that requireincremental updates. LSM-Tree is a multi-level data struc-ture which stores each component on a different storagemedium [24]. All new tuples are inserted into the lowestlevel component and whenever the size of each componentexceeds a predefined threshold, a part of the componentmerges into the higher level one. LSM-Tree improves sys-tem performance in write-intensive applications using deltamerging, however, it does not provide a solution for multi-threaded indexing. Adaptive radix tree (ART) is a high-speed in-memory indexing data structure that exhibits abetter memory footprint than a conventional radix tree andbetter point query performance than B+-Tree [34]. However,B+-Tree outperforms ART in executing range queries [35].We use B+-Tree as the baseline in to evaluate our PIM-Treesince it supports incremental updates and range queries bet-ter than other approaches.

Parallel B+-Tree – As we enter the multicore era, con-current in-memory indexing is essential for databases to ex-ploit the computational resources of a modern server. Bayerand Schkolnick [36] proposed a concurrency control methodfor supporting concurrent access in B-Trees based on cou-pled latching, in which threads are required to obtain theassociated latch for each index node in every tree traver-sal. B-link is a B+-Tree with a relaxed structure that re-quires fewer latch acquisitions to handle concurrent oper-ations [25]. However, concurrency control methods basedon coupled latching are known to suffer from high latchingoverhead and poor scalability for in-memory systems [37].

PALM is a parallel latch-free B+-Tree based on bulk syn-chronous processing [38]. To avoid potential conflicts, PALMsorts all queries in bulk at each level of the tree traversal toguarantee that the operations on each node are assigned toa single thread. Although this approach is scalable and han-dles data distribution changes, it requires processing queriesin large groups (the authors suggest groups of 8,000 queriesto achieve a reasonable scale up). This requirement neg-atively affects the system response time, which is an im-portant criterion for data stream processing applications.Although PALM could excel at supporting batch-orientedprocessing engines, such as Apache Spark [39], it does notmeet the requirements of real-time, event-by-event streamprocessing frameworks, such as Apache Storm [40] or thesliding window indexing considered in this paper. In con-trast to query batching, Pandis et al. [41] proposed physi-ological partitioning (PLP) of indexing data structures onthe basis of a multirooted B+-Tree. Using PLP, the indexstructure is partitioned into disjoint intervals, and each in-terval is assigned exclusively to a single thread. Although,both PLP and our PIM-Tree employ range partitioning to

provide concurrent indexing operations, there the task dis-tribution method and concurrency control mechanism aredifferent between these two methods. PLP is a latch-freepartitioning technique where only one dedicated thread ac-cesses each sub-index, while, sub-indexes in PIM-Tree areuniformly accessible by all operating threads and concurrentaccesses on each sub-index are synchronized using locks. InPLP, a partition manager assigns queries to threads and en-sures that all work given to a thread involves only data thatit owns. If a query execution requires accessing multipleintervals, then the partition manger breaks the query intomultiple subtasks and assembles the subresults to finish thequery. Although the overhead of the partition manager isnegligible for transaction processing in a database, in thecase of processing a single streaming tuple, this overheadcauses a significant performance decrease.

Rastogi et al. [42] introduced a multiversion concurrencycontrol and recovery method in which update transactionscreate a new version of a node to avoid conflicting withlookup transactions rather than using locks. Since build-ing a new node is required for every node modification, thismethod suffers from high node creation and garbage collec-tion overhead for use cases with many update operations.Optimistic latch-free index traversal (OLFIT) is based onnode versioning to ensure data consistency during tree tra-versal, but it does not require the creation of a new physicalnode to avoid conflicts [37]. In this approach, each node isassigned with a version number and a lock. To update anode, it is necessary to obtain the associated lock and in-crement the version before releasing the lock. Node lookupsare performed in an optimistic fashion. The reader threadcompares the node version before and after the read oper-ation; if the versions are identical and the node lock is notobtained, the read operation is successful. Otherwise, it re-peats the entire operation. However, this approach does notprovide an efficient node-merging algorithm, which is criti-cal for preserving an efficient tree structure when the datadistribution of tuples in the sliding window changes. Bw-Tree is another optimistic latch-free parallel indexing datastructure that utilizes atomic compare and swap (CAS) op-erations to avoid race conditions [17]. Bw-Tree is designedto simultaneously exploit the computational power of mul-ticore processors and the memory bandwidth of underlyingstorage, such as flash memories. Among the aforementionedapproaches, Bw-Tree is the best choice for use cases withfrequent incremental updates, which is why we use it asthe baseline for our multithreaded indexing approach usingPIM-Tree.

Sliding window indexing – A class of related work pro-poses accelerating window queries by utilizing an index. Go-lab et al. [7] evaluated different sliding window indexing ap-proaches, such as hash-based and tree-based indexing, fordifferent types of stream operators. Kang et al. [43] evalu-ated the performance of an asymmetric sliding stream joinusing different algorithms, such as nested loop join, hash-based join, and index-based join. Lin et al. [20] and Ya-xin etal. [21] proposed the chained index to accelerate index-basedstream join utilizing coarse-grained tuple disposal. However,all of these approaches considered only single-threaded slid-ing window indexing, thus avoiding concurrency issues re-sulting from parallel update processing, which is central tothe focus of our work.

Parallel window join – Window join processing has re-ceived considerable attention in recent years due to its com-

12

Page 13: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

putational complexity and importance in various data man-agement applications. Several approaches explore parallelwindow join processing. Cell-join is a parallel stream joinoperator designed to exploit the computing power of the cellprocessor [18]. Handshake join is a scalable stream join thatpropagates stream tuples along a linear chain of cores inopposing directions [19]. Roy et al. [22] enhanced the hand-shake join by proposing a fast-forward tuple propagation toattain lower latency. SplitJoin is based on a top-down dataflow model that splits the join operation into independentstore and process steps to reduce the dependency amongprocessing units [23]. Lin et al. [20] proposed a real-timeand scalable join model for a computing cluster by organiz-ing processing units into a bipartite graph to reduce memoryrequirements and the dependency among processing units.

All these approaches are based on context-insensitive win-dow partitioning. Although these methods are effective forusing nested loop join or for memory-bounded joins withhigh selectivity, context-insensitive window partitioning causesredundant index operations using IBWJ, which limits thesystem efficiency.

7. CONCLUSIONSIn this paper, we presented a novel indexing structure

called PIM-Tree to address the challenges of concurrent slid-ing window indexing. Stream join using PIM-Tree outper-forms the well-known indexing data structure B+-Tree bya margin of 120%. Moreover, we introduced a concurrentstream join approach based on PIM-Tree, which is, to thebest of our knowledge, one of the first parallel index-basedstream join algorithms. Our concurrent solution improvedthe performance of IBWJ up to 5.5 times when using anocta-core (16 threads) processor.

The directions for our future work are twofold: (1) de-veloping a distributed stream band join and (2) extendingPIM-Tree to support the indexing of multidimensional data.In this paper, we focused on parallelism within a uniformshared memory architecture. A further challenge, but al-together a different problem, is to develop a parallel IBWJalgorithm for nonuniform memory access (NUMA) architec-tures, which requires addressing two main concerns. First,a range partitioning technique that distributes a workloaduniformly among operating cores is needed. Second, a repar-titioning scheme that alleviates the overhead of data transferbetween memory nodes in a NUMA system is needed. Al-though PIM-Tree discretizes tuples into disjoint intervals,these intervals are adjusted only according to the numberof input tuples, which does not necessarily lead to a uni-form distribution of the workload across all intervals. Inthe solution for uniform memory access presented in thispaper, we used a shared work queue to distribute the work-load among operating cores; in a NUMA system, however,an efficient range partitioning scheme is needed, which con-siders the numbers of both input and output tuples of eachinterval. Such a partitioning is not needed for the approachpresented in this paper. Moreover, with respect to support-ing multidimensional data, PIM-Tree is designed to indexone-dimensional data. Multidimensional indexing is a vitalrequirement for many applications, specifically those thatutilize spatiotemporal datasets. Thus, a further direction isthe design of a multidimensional PIM-Tree.

8. REFERENCES

[1] X. Gao, E. Ferrara, and J. Qiu, “Parallel clustering ofhigh-dimensional social media data streams,” inCCGrid, 2015, pp. 323–332.

[2] L. Zhang and Y. Guan, “Detecting click fraud inpay-per-click streams of online advertising networks,”in ICDCS, 2008, pp. 77–84.

[3] G. Montana, K. Triantafyllopoulos, and T. Tsagaris,“Data stream mining for market-neutral algorithmictrading,” in Proceedings of the 2008 ACM symposiumon Applied computing, pp. 966–970.

[4] M. Stonebraker, U. Cetintemel, and S. Zdonik, “The 8requirements of real-time stream processing,”SIGMOD, pp. 42–47, 2005.

[5] D. Dell’ Aglio, E. Della Valle, F. van Harmelen, andA. Bernstein, “Stream reasoning: A survey andoutlook : A summary of ten years of research and avision for the next decade,” Data Science, pp. 59–83,2017.

[6] S. Babu and J. Widom, “Continuous queries over datastreams,” ACM Sigmod Record, pp. 109–120, 2001.

[7] L. Golab, S. Garg, and M. T. Ozsu, “On indexingsliding windows over online data streams,” inInternational Conference on Extending DatabaseTechnology, 2004, pp. 712–729.

[8] H. Zhang, G. Chen, B. C. Ooi, K.-L. Tan, andM. Zhang, “In-memory big data management andprocessing: A survey,” IEEE Transactions onKnowledge and Data Engineering, pp. 1920–1948,2015.

[9] P. Gepner and M. F. Kowalik, “Multi-core processors:New way to achieve high system performance,” inPARELEC, 2006, pp. 9–13.

[10] B. Chandramouli, J. Goldstein, M. Barnett et al.,“Trill: A high-performance incremental queryprocessor for diverse analytics,” VLDB, pp. 401–412,2014.

[11] H. Miao, H. Park, M. Jeon, G. Pekhimenko et al.,“Streambox: Modern stream processing on a multicoremachine,” in USENIX ATC 17, 2017, pp. 617–629.

[12] A. Koliousis, M. Weidlich, R. Castro Fernandez et al.,“Saber: Window-based hybrid stream processing forheterogeneous architectures,” in SIGMOD, 2016, pp.555–569.

[13] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamyet al., “Storm@twitter,” in SIGMOD, 2014, pp.147–156.

[14] M. Zaharia, R. S. Xin, P. Wendell, T. Das et al.,“Apache spark: A unified engine for big dataprocessing,” Communication of the ACM, pp. 56–65,2016.

[15] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl et al.,“Apache flink : Stream and batch processing in asingle engine,” Bulletin of the IEEE Computer SocietyTechnical Committee on Data Engineering, 2015.

[16] S. Zeuch, B. D. Monte, J. Karimov et al., “Analyzingefficient stream processing on modern hardware,”VLDB, pp. 516–530, 2019.

[17] J. J. Levandoski, D. B. Lomet, and S. Sengupta, “Thebw-tree: A b-tree for new hardware platforms,” inICDE, 2013, pp. 302–313.

13

Page 14: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

[18] B. Gedik, R. R. Bordawekar, and S. Y. Philip,“Celljoin: a parallel stream join operator for the cellprocessor,” The VLDB journal, pp. 501–519, 2009.

[19] J. Teubner and R. Mueller, “How soccer players woulddo stream joins,” in Sigmod, 2011, pp. 625–636.

[20] Q. Lin, B. C. Ooi, Z. Wang, and C. Yu, “Scalabledistributed stream join processing,” in SIGMOD,2015, pp. 811–825.

[21] Y. Ya-xin, Y. Xing-hua, Y. Ge, and W. Shan-shan,“An indexed non-equijoin algorithm based on slidingwindows over data streams,” pp. 294–298, 2006.

[22] P. Roy, J. Teubner, and R. Gemulla, “Low-latencyhandshake join,” VLDB, pp. 709–720, 2014.

[23] M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Splitjoin:A scalable, low-latency stream join architecture withadjustable ordering precision,” in USENIX AnnualTechnical Conference, 2016, pp. 493–505.

[24] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “Thelog-structured merge-tree (lsm-tree),” ActaInformatica, pp. 351–385, 1996.

[25] P. L. Lehman et al., “Efficient locking for concurrentoperations on b-trees,” ACM Transactions onDatabase Systems, pp. 650–670, 1981.

[26] T. Bingmann, “STX B+tree C++ template classes,”URL http://panthema. net/2007/stx-btree, 2008.

[27] J. Rao and K. A. Ross, “Cache conscious indexing fordecision-support in main memory,” in VLDB, 1999,pp. 78–89.

[28] J. Krueger, C. Kim, M. Grund et al., “Fast updateson read-optimized databases using multi-core cpus,”VLDB, pp. 61–72, 2011.

[29] R. Elmasri, Fundamentals of database systems.Pearson Education India, 2008.

[30] R. Bayer and E. McCreight, “Organization andmaintenance of large ordered indices,” in SIGFIDET,1970, pp. 107–141.

[31] T. J. Lehman and M. J. Carey, “A study of indexstructures for main memory database managementsystems,” in VLDB, 1986, pp. 294–303.

[32] H. Lu, Y. Y. Ng, and Z. Tian, “T-tree or B-tree: Mainmemory database index structure revisited,” in ADC,2000, pp. 65–73.

[33] J. Rao and K. A. Ross, “Making B+-trees cacheconscious in main memory,” in SIGMOD, 2000, pp.475–486.

[34] V. Leis, A. Kemper, and T. Neumann, “The adaptiveradix tree: Artful indexing for main-memorydatabases,” in ICDE, 2013, pp. 38–49.

[35] V. Alvarez, S. Richter, X. Chen, and J. Dittrich, “Acomparison of adaptive radix trees and hash tables,”in ICDE, 2015, pp. 1227–1238.

[36] R. Bayer and M. Schkolnick, “Concurrency ofoperations on b-trees,” Acta Informatica, pp. 1–21,1977.

[37] S. K. Cha, S. Hwang, K. Kim et al., “Cache-consciousconcurrency control of main-memory indexes onshared-memory multiprocessor systems,” VLDB, pp.181–190, 2001.

[38] J. Sewall, J. Chhugani, C. Kim et al., “PALM:Parallel architecture-friendly latch-free modificationsto B+ trees on many-core processors,” VLDB, pp.795–806, 2011.

[39] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia,Learning spark: lightning-fast big data analysis. ”O’Reilly Media, Inc.”, 2015.

[40] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamyet al., “Storm@ twitter,” in SIGMOD, 2014, pp.147–156.

[41] I. Pandis, P. Tozun, R. Johnson, and A. Ailamaki,“PLP: Page latch-free shared-everything OLTP,”VLDB, pp. 610–621, 2011.

[42] R. Rastogi, S. Seshadri, P. Bohannon et al., “Logicaland physical versioning in main memory databases,”in VLDB, 1997, pp. 86–95.

[43] J. Kang, J. F. Naughton, and S. D. Viglas,“Evaluating window joins over unbounded streams,”in Data Engineering, International Conference on,2003, pp. 341–352.

APPENDIXA. PIM-Tree OPERATIONS

Here, we provide more detail about the implementationof immutable B+-Tree and PIM-Tree.

A.1 PIM-Tree insertionAlgorithm 1 describes the process of inserting a new record

into PIM-Tree. The first part (Lines 1-7) is to search TS todepth DI to find the corresponding subindex in TI for thegiven record. For each inner node, the algorithm linearlysearches its keys (Lines 5-7) and then calculates the relativelocation of the next inner node (Line 7).

Algorithm 1: PIM-Tree insertionInput: Inners : Array of TS ’s inner node keys in BFS orderInput: Roots : Array of TI ’s root nodesInput: Data : A key-value pair to be inserted in PIM-Tree

1 p← 02 for i← 0 to DI do3 k ← 0

4 node← Level_Offsets[i ]§+ p× sib

5 while k < fib† and Data.key ≤ Inners[node + k ] do

6 k ← k + 1

7 p ← p × fib + k

8 mutex [p].lock()9 node ← Roots[p]

10 B+Tree_insert(node,Data)11 mutex [p].unlock()

† Fan-out of inner nodes in immutable B+-Tree‡ Size of inner nodes in immutable B+-Tree (sib = fib − 1)§ Level_Offsets[i ] refers to the first node at depth i in breadth-first order

The second part (Lines 8-11) is to insert the record intothe corresponding subindex. First, the insert routine ac-quires the mutex that is associated with the targeted subindex(Line 8), and then it fetches the root node of the subindex(Line 9). Next, it inserts the record into the subindex us-ing the B+-Tree insert algorithm. The insert algorithm alsotakes care of correctly setting the flag of the last leaf nodein the case that the last leaf node needs to be split. Fi-nally, the associated mutex is released and the operation isterminated.

A.2 PIM-Tree searchNext, we describe search in PIM-Tree for a given range

of values, which is described in Algorithm 2. The process

14

Page 15: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

starts with searching TS ’s inner nodes for the minimumvalue of the given range (Range.min) (Lines 1-9). At depthDI , the search process detects which TI ’s subindex storesrecords with value equal to Range.min, which we refer toas min sub index. At TS ’s leaf nodes, the search processfirst searches for Range.min (Lines 10-12) and then linearlytraverses through all matching records (Lines 13-15).

Algorithm 2: PIM-tree searchInput: Leaves : Array of TS ’s leaf node elementsInput: Inners : Array of TS ’s inner node keys in BFS orderInput: Roots : Array of TI ’s root nodesInput: Range : Range of values to searchOutput: Results : Search results

1 p← 02 for i← 0 to N − 1 do3 if i = DI then4 sub_index ← p

5 k ← 06 node← Level_Offsets[i ] + p× sib7 while k < fib and Range.min ≤ Inners[node + k ] do8 k ← k + 1

9 p ← p × fib + k

10 p ← p × lib†

11 while p < Leaves.length and Range.min < Leaves[p].key do12 p← p+ 1

13 while p < Leaves.length and Leaves[p].key <= Range.max do14 Results.add(Leaves[p])15 p← p + 1

16 mutex [sub_index ].lock()17 node ← Roots[sub_index]18 (leaf , pos)← B+Tree_search(node,Range.min)19 while finish_flag = False do20 if pos < leaf .size then21 if leaf [p].key ≤ Range.max then22 Results.add(leaf [p])23 pos ← pos + 1

24 else25 finish_flag ← True

26 else27 if leaf .last_leaf = True then28 mutex [sub_index + 1 ].lock()29 leaf ← leaf .next30 mutex [sub_index ].unlock()31 if Range.max <

Inners[Level_Offsets[DI − 1 ] + sub_index ] then32 finish_flag ← True

33 sub_index ← sub_index + 1

34 else35 leaf ← leaf .next

36 if leaf = NULL then37 finish_flag = True

38 pos ← 0

39 mutex [sub_index ].unlock()

† Size of leaf nodes in immutable B+-Tree

The next step is to look for matching tuples in TI (Lines16-39). The search process acquires the mutex associatedwith min sub index (Line 16), and then it fetches min subindex’s root node (Line 17). Using the B+-Tree search rou-

tine, it locates the first matching record in ascending order inmin - sub index that is equal to or greater than Range.min(Line 18). The return values of the B+-Tree search are leafand pos, which indicates that the desired record is the pos-th slot of node leaf . If all records in min sub index are less

than Range.min, then the search result is the first emptyslot of the last leaf node.

The final step is to scan TS ’s leaf nodes to find matchingtuples (Lines 19-38). Whenever the search process switchesfrom a node to its successor, it checks whether it also switchesa new subindex (Line 27). If so, it acquires the mutex ofthe successor subindex before switching to the successor leafnode and releases the mutex for the current subindex after-wards (Lines 28-30). Moreover, in the case that the range ofthe new subindex does not overlap with the given range, thesearch process terminates (Lines 31-32). This range check-ing is helpful to avoid searching through chains of emptysubindexes. At the end, the process releases the currentsubindex’s mutex and terminates (Line 39).

A.3 Immutable B+-Tree creationNodes in the immutable B+-Tree are arranged in a breadth-

first fashion. The relation among elements is deduced im-plicitly based on their position rather than explicitly throughpointers or references. Using this node organization, if nodeN is the ith node at level d in breadth-first order, then thejth child of N is at position Offset [d+ 1] + i× fib + j, whereOffset [d] is pointing to the beginning of the dth-level andfib is the fan-out of inner nodes. Since it is not requiredto explicitly store references to child nodes with this rep-resentation, it is possible to achieve a higher fan-out usingthe same amount of space compared with the regular B+-Tree. Consequently, the depth of the immutable B+-Treeis smaller than the depth of the regular B+-Tree storingthe same number of elements, which results in better searchperformance.

Algorithm 3: Immutable B+ptree creationInput: Leaves : Array of leaf node elementsOutput: Inners : Array of inner node keys in BFS order

1 depth ← CalculateInnerDepth(Leaves.length)

2 Length[depth − 1 ]← Leaves.length/lib§

3 for i← depth − 2 to 0 do4 Length[i ]← Length[i − 1 ]/fib

5 Offset [0 ]← 06 for i← 1 to depth − 1 do7 Offset [i ]← Offset [i − 1 ] + Length[i− 1]

8 for i← 0 to depth − 1 do9 Node_Size[i ]← 0

10 Current_Slot [i ]← 0

11 for i← 0 to Leaves.length/lib do12 k ← depth − 113 status ← true14 repeat15 if Node_Size[k] = fib then16 Inners[Offset [k] + Current_Slot [k]]←

Leaves[i× lib + lib − 1]17 Node_Size[k]← Node_Size[k] + 118 Current_Slot [k]← Current_Slot [k] + 119 status ← false

20 else21 Node_Size[k]← 022 k ← k − 1

23 until status = true;

§ Size of leaf nodes in immutable B+-Tree

Algorithm 3 describes how inner nodes are created for agiven sorted array of leaf node elements. First, the algorithmcalculates the tree depth (Line 1) and the number of innernodes at each depth (Lines 2-4). In the next step (Lines 5-

15

Page 16: Parallel Index-based Stream Join on a Multicore CPU · ing three existing indexing approaches, including B+-Tree, chain-index and round-robin partitioning, in order to point out the

10-4

10-3

10-2

10-1

15 16 17 18 19 20 21 22 23 24

Tim

e (s

)

Window size (2x)

Merging

Figure 14: The cost of a PIM-Tree merge operation for variouswindow sizes.

7), it determines the address of each tree level relative tothe start of the inner nodes array (Offset [d]). Then, thesize of the current inner node and the current slot at eachlevel are initialized to zero (Lines 8-10). For each leaf node,the algorithm starts from the deepest level of inner nodes(k = depth − 1 ). At each level, it checks whether there is

an empty slot in the current inner node. If so, it assigns thelargest key of the current leaf node to the next available slotin the inner node and increments the inner node size; then,it resumes tree creation by advancing to the next leaf node.If there is no empty slot left in the inner node (Lines 21-22), the algorithm initializes a new node at the current level(resetting the node size to zero), moves to the parent level,and repeats the same procedure. Considering that l is thetotal number of elements in leaf nodes and d is the depth ofthe tree, Equation 7 shows the computational complexity ofthe immutable B+-Tree creation.

d∑k=1

k · l

fibk

= O(l) (7)

Figure 14 illustrates the cost of the PIM-Tree merge oper-ation including merging nonexpired tuples of TS and TI intoa single sorted array and creating a new immutable B+-Tree.As shown, the cost of a merge operation increases linearlyin the number of elements in the tree.

16


Recommended