Date post: | 20-Mar-2017 |
Category: |
Data & Analytics |
Upload: | feng-george-yu |
View: | 74 times |
Download: | 0 times |
Youngstown State University
Write Optimization using Asynchronous Update on Out-of-Core Column-Store Databases in Map-Reduce
Feng Yu, Eric S. Jones
Youngstown State University, Youngstown, OH
[email protected], [email protected]
Wen-Chi Hou
Southern Illinois University, Carbondale, IL
Youngstown State University
Column-Store Databases• The column-store database is also known as columnar
database or column-oriented database• The column-store database fits well into the write-once-and-
read-many environment.– Retrieve only the necessary attributes included in the
query prediction without the need to read the entire tuple.– Works especially well for OLAP and data mining queries– It can reach a higher compression rate and higher reading
speed than row-based databases.
Youngstown State University
Challenge• Optimizing write operations in a column-store database has
always been a challenge.• Data is vertically decomposed into BATs (Binary Association Tables)
and randomly distributed over the storage.• The writing on a column-store database will be significantly delayed by
ad hoc access to large BATs across multiple pages.• Existing works majorly focus on write optimizations in a main-
memory column-store database.
Youngstown State University
BAT Example
Fig. 1 customer Data in Row-Based and Column-Store (BAT) Format
A BUN consists of (oid, value)
Mapping Rules
Relational Data
Column-Store
Youngstown State University
Update on BAT in Map-Reduce
• In a Map-Reduce environment, we assume the update list of OIDs are collected and submitted in a batch1. Map-Reduce Join
BAT LEFT OUTER JOIN UPDATE_LIST ON OID => (BAT combine UPDATE_LIST)• Map-side join: when UPDATE_LIST is small enough to fit into memory• Reduce-side join: when UPDATE_LIST is large enough
2. Projection (Map-Only)FOR each record in (BAT combine UPDATE_LIST)IF UPDATE_LIST attribute is not NULL: output updated valueELSE: output original value
Youngstown State University
Motivation
• Focus: Write-optimization on column-store in Map-Reduce
• Principle: avoid seeking and writing on every change• Solution: Timestamp the newly updated data (TBAT)
– multi-version– no need of index
• Update: AMO (Asynchronous Map-Only) update– the newly updated data is appended to the end of a TBAT
slip in a map-only manner
Youngstown State University
TBAT (Timestamped BAT)
• TBAT in HDFS:struct TBUN{ TIMESTAMP optime, ROWID oid, USER_DEFINED_TYPE attrv}struct TBAT_slip{ TBUN[max_size_per_HDFS_slip] tbuns}
– No need for any global pre-sorting or indexing– ‘attrv’ is can be any user defined type that flexibly
define arbitrary kinds of schema
Youngstown State University
TBAT Example (logical view)
oid float
101 100.00
102 200.00
103 300.00
optime oid float
time1 101 100.00
time1 102 200.00
time1 103 300.00
customer_balance customer_balance
BAT TBAT
Suppose the existing records were inserted in one batch at time1.
Youngstown State University
AMO Update (logical)
Example:Uupdate query on customer table:
update customer set balance=201.00 where id=2Current timestamp is time2 (>time1).
The newest TBUN for 201.00 is appended to the end of TBAT customer_balance
New Data
Old Data
Youngstown State University
Selection after AMO Update
• The data consistency is intact in a TBAT after AMO update.
• Example:– Selection after AOC update:
SELECT balance FROM customer WHERE id=2– Two tuples will be retrieved:
t1=(time1, 102, 200.00)t2=(time2, 102, 201.00)
– Compare the timestamps, time2 > time1. Then 201.00 is returned which is consistent with the last update value.
Youngstown State University
Preliminary Experiment
• Performed on a Cloudera Distributed Hadoop (CDH) version 5.3 cluster – 1 master and 3 slaves– Total HDFS capacity= 310GB (block size = 64MB) – Interconnection is Gigabit Ethernet
• Data sets: 1GB and 10GB random synthetic data in BAT and TBAT.
• Update queries: from 10% to 30% of the original data.
Youngstown State University
Preliminary Experiment Results (cont.)
1GB Update Running Time
Youngstown State University
Preliminary Experiment Results (cont.)
10GB Update Running Time
Youngstown State University
Preliminary Experiment Results (cont.)
Overhead Changing over Data Sets
Youngstown State University
Resource Usage
Youngstown State University
Conclusion
• We introduce a new method called AMO update for write optimization on OOC column-store databases in map-reduce.
• AMO update employs TBAT to improve the update performance with data atomicity guaranteed.
• Significant improvement in running speed of AOC update has been shown in preliminary experiment results.
Youngstown State University
Future Works
• The performance variation of the Map-Reduce selection algorithm on TBAT after different percentages of the file is updated.
• Introduce a distributed local indexing on each TBAT slip in HDFS to improve the global data retrieval performance.
THANK YOU! Feng “George” YuComputer Science and Information Systems
Youngstown State University, Youngstown, [email protected]
Youngstown State University