Download - Sougata Bhattacharjee Caching for Flash-Based Databases Summer Semester 2013.

Sougata Bhattacharjee

Caching for Flash-Based Databases

Summer Semester 2013

• MOTIVATION• FLASH MEMORY

FLASH CHARACTERISTICS FLASH SSD ARCHITECTURE FLASH TRANSLATION LAYER

• PAGE REPLACEMENT ALGORITHM ADAPTIVE REPLACEMENT POLICY

• FLASH-AWARE ALGORITHMS CLEAN-FIRST LRU ALGORITHM CLEAN-FIRST DIRTY-CLUSTERED (CFDC) ALGORITHM AD-LRU ALGORITHM CASA ALGORITHM

• CONCLUSION• REFERENCES

OUTLINE

Data ExplosionThe worldwide data volume is growing at an astonishing

speed.In 2007, we had 281 EB data; in 2011, we had 1800 EB data.

Motivation

Flash Memory

Page Replacement

Algorithm

Flash-Aware Algorithms

Conclusion

Data storage technology : HDDs and DRAMHDDs suffer from HIGH LATENCY.DRAM comes with HIGHER PRICE.

Energy consumptionIn 2005, total power used by servers in USA was 0.6% of its total annual electricity consumption.

We need to find a memory technology which may overcome these

limitations.

http://faculty.cse.tamu.edu/ajiang/Server.pdf

BACKGROUND

In 1980, Dr. Fujio Masuoka invented Flash memory.

In 1988, Intel Corporation introduced Flash chips.

In 1995, M-Systems introduced flash-based solid-state drives.

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

What is flash? Flash memory is an electronic non-volatile semiconductor storage device that can be electrically erased and reprogrammed.

3 Operations: program (Write), Erase, and Read.

Two major forms NAND flash and NOR flash

NAND is newer and much more popular.

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FLASH AND MEMORY HIERARCHY

Registers

CACHE

RAM

HDD

HigherSpeed, Cost

Larger Size

Flash is faster, has lower latency, is more reliable,

but more expensive

than hard disks

NAND Flash

READ - 50 μsec

WRITE – 200 μsec

ERASE- Very Slow

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

Why Flash is popular?

Benefits over magnetic hard drives

Offers lower access latencies.

Semi-conductor technology, no mechanical parts.

High data transfer rate.

Higher reliability (no moving parts).

Lower power consumption.

Small in size and light in weight. Longer life span.

Benefits over RAM

Lower power consumption.

Lower price.

Flash SSD is widening its range of applications

Embedded devices

Desktop PCs and Laptops

Servers and Supercomputers

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

USE OF FLASH

http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2011/20110811_S308_Cooke.pdf , Page 2

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FLASH OPERATIONS

……

Block 1 Block 2 Block n

• Three operations: Read, Write, Erase.

• Reads and writes are done at the granularity of a page (2KB or 4KB)• A flash block is much larger than a disk block: Contains p (typically 32 - 128) fixed-size flash pages with 512 B - 2 KB

Page

Page

Data

……

Page

Page

Page

Data

……

Page

• Erasures are done at the granularity of a block (10,000 – 100,000 erasures)• Block erase is the slowest operation requiring about 2ms• Update of flash pages not possible; only overwrite of an entire block where erasure is needed first

……Page

Page

Data

……

Page

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FLASH OPERATIONS

Block 1 Block 2

Page

Page

Page

• Update of flash pages not possible; only overwrite of an entire block where erasure is needed first

Page

Page

DataPage

Page Page

Free

Page

Page

Page

Page

Page

DataPage

PagePage

Block 1 Block 2

Page

Page

FreeSteps of

modified DB Pages

PagePage

Page

Page

Page

Page

Free

ERASE

Ful

l

Blo

ck

Updates go to new page (new block).

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FLASH CONSTRAINTS

Write/Erase granularity asymmetry

(Cons1)

Erase-before-write rule (Cons2)

Limited cell lifetime (Cons3)

+

Invalidate Out-of-place update

Logical to Physical Mapping

Garbage Collection

+

Wear Leveling

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FLASH MEMORY STRUCTURE

FTL

Mapping

Garbage Collecti

onWear

Leveling

Other

File System

Flash Device

Various operations need to be carried out to ensure correct operation of a flash device.

Garbage collection Wear leveling

Mapping

Flash Translation Layer controls flash management.

Hides complexities of device management from the application Garbage collection and wear leveling

Enable mobility – flash becomes plug and play

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

MAPPING TECHNIQUES (1/2)

3 Types of Basic Mappings

Page-Level Mapping Block-Level Mapping Hybrid Mapping

0 8

1 4

2 3

3 11

4 9

5 3

6 7

7 0

8 1

9 2

10 6

11 5

@7

LPN PPN

Page-Level Mapping

Each page mapped independently.

Highest performance potential.

Highest resource use

Large size of mapping table.

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

MAPPING TECHNIQUES (2/2)

3 Types of Basic Mapping

Page-Level Mapping Block-Level Mapping Hybrid Mapping

0 3

1 0

2 1

3 2

@7 = 7/4 =1

LBN PBN

Block-Level Mapping

Only block numbers kept in the mapping table.

Page offsets remain unchanged.

Small mapping table.

Bad performance for write updates.

7 mod 4 =3

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FTL BLOCK-LEVEL MAPPING (BEST CASE)k flash blocks: B g log blocks: Lfree blocks: F

….. ….. …..

1 2 k-1 k i j 1 g

…..

Erase B1Switch: L1 becomes B1, Erase old B1

1 Erase operation

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

FTL BLOCK-LEVEL MAPPING (SOME CASE)k flash blocks: B g log blocks: Lfree blocks: F

….. ….. …..

1 2 k-1 k i j 1 g

…..

Erase B1

2 Erase operation

Merge: B1 and L1 to Fi Erase L1

Merge of n flash blocks and one log block to Fi n+1 erasures

Flash Memory

Motivation

Page Replacement

Algorithm


Conclusion

GARBAGE COLLECTION

Moves valid pages from blocks containing invalid data and then erases the blocks Removes invalid pages and increases free pages

FREE

Valid

Invalid

ER

AS

E F

ull

B

lock

ER

AS

E F

ull

B

lock

Wear Leveling decides where to write the new data Wear Leveling picks up most frequently erased blocks and least worn-out blocks to equalize overall usage and swap their content; enhances lifespan of flash

Find the location of the desired page on disk.

Find free frame:

If a free frame exists, use it.

Otherwise, use a page replacement algorithm to select a victim page.

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

BASICS OF PAGE REPLACEMENT

Load the desired page into the frame.

Update the page allocation table (page mapping in the buffer).

Upon the next page replacement, repeat the whole process again in the same way.

Cache is FAST but EXPENSIVE HDDs are SLOW but CHEAP

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

THE REPLACEMENT CACHE PROBLEM

How to manage the cache?Which page to replace?

How to maximize the hit ratio?

How LRU Works?

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

PAGE REPLACEMENT ALGORITHMS (1/2)

Least Recently Used (LRU) - Removes the least recently used items first- Constant time and space complexity & simple-to-implement- Expensive to maintain statistically significant usage statistics- Does not exploit "frequency”- It is not scan-resistant

A

B

C

D

String: C A B D E F D G E Time: 0 1 2 3 4 5 6 7 8 9

C

A

B

D

A

C

B

D

D

F

E

B

D

B

A

C

B

A

C

D

E

D

B

A

F

E

D

B

G

D

F

E

E

G

D

F

Page FaultC goes out

Page FaultA goes out

Page FaultB goes out

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

PAGE REPLACEMENT ALGORITHMS (2/2)

Least Frequently Used (LFU)- Removes least frequently used items first- Is scan-resistant- Logarithmic time complexity (per request)- Stale pages can remain a long time in the buffer

LRU + LFU = LRFU (Least Recently/Frequently Used)- Exploit both recency and frequency- Better performance than LRU and LFU- Logarithmic time complexity - Space and time overhead

Adaptive Replacement Cache(ARC) is a Solution

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

ARC (ADAPTIVE REPLACEMENT CACHE) CONCEPT

L1

General double cache structure (cache size is 2C)

MRU

LRU

L1 contains recently seen pages: Recency list.

L2

MRU

LRU

L2 contains pages seen at least twice recently: Frequency list. If L1 contains exactly C pages, replace the LRU page from L1. Otherwise , replace the LRU page in L2.

Cache is partitioned into two lists L1 and L2.

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

ARC CONCEPT

ARC structure (cache size is C)

MRU

LRU

Divide L2 into T2 (MRU end) & B2 (LRU end).

MRU

LRU

Upon a page request: if it is found in T1 or T2 , move it to MRU of T2.

When cache miss occurred, new page is added in MRU of T1

- If T1 is full, LRU of T1 is moved to MRU of B1.

Divide L1 into T1 (MRU end) & B1 (LRU end).

T2T1

L1 L2

The size of T1 and T2 is C.

The size of T1 and B1 is C, same for T2 and B2.

2C

B1 B2

C

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

ARC PAGE EVICTION RULE

ARC structure (cache size is C)

MRU

LRU

If requested page found in B1, P is increased & the page moved to MRU position of T2.

MRU

LRU

ARC adapts parameter P , according to an observed workload. - P determines the target size of T1.

T2T1

L1 L2

If requested page found in B2, P is decreased & the page moved to MRU position of T2.

2C

B1 B2

C

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

HOW ARC WORKS? (1/2)

B1 T1 T2 B2

Recency

FrequencyC

C C

Reference String : A B C A D E E F G D

A

A B

A B C

B C A

B C D A

B C D E A

B C D E A

B C D F E A

B C D F G E A

B C F G D E A

0 1

2

3

4

5

6

7

8

9

Time

Page Replacement Algorithm

Motivation

Flash Memory


Conclusion

HOW ARC WORKS? (2/2)

Reference String :

A B C A D E E F G D

H I J G K H L D

B C F G H D E A

C F G H I J D E A

C F H I J G D E A

C F H I J K G D E A

9 10

11

12

13

14

Time B C F G D E A

B C F G H I D E A

Increase T1, Decrease B1

Page B is out from the list

Scan-

Resis

tant

Self-Tuning

C F I J K H G D E A15 C F I J K L H G D E A

C F I J K L D H G E A16 17 Increase T2, Decrease

B2Self-Tuning

Page Replacement

Algorithm

Motivation

Flash Memory


Conclusion

ARC ADVANTAGE

ARC is scan-resistant.

ARC is self-tuning and empirically universal.

Stale pages do not remain in memory; better than LFU. ARC consumes about 10% - 15% more time than LRU, but the hit ratio is almost twice as for LRU. Low space overhead for ‘B’ lists.


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

FLASH-AWARE BUFFER TECHNIQUES

Minimize the number of physical write operations.

Cost of page write is much higher than page read.

Buffer manager decides How and When to write.

CFLRU (Clean-First LRU) LRUWSR (LRU Write Sequence Reordering) CCFLRU (Cold-Clean-First LRU) AD-LRU (Adaptive Double LRU)

Read/Write entire flash blocks (addressing the FRW problem) FAB (Flash Aware Buffer)

REF (Recently-Evicted-First)


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CLEAN-FIRST LRU ALGORITHM (1/3)

One of the earliest proposals of flash-aware buffer techniques. CFLRU is based on LRU replacement policy.

LRU list is divided into two regions:

Working region: Recently accessed pages.

Clean-first region: Pages for eviction.

P1 P2 P3 P4 P5 P6 P7 P8Working Region

Clean-First Region

Window , W = 4

LRU

MRU

CFLRU always selects clean pages to evict from the clean- first region first to save flash write costs. If there is no clean page in this region, a dirty page at the end of the LRU list is evicted.

CleanDirty


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion


P1 P2 P3 P4 P5 P6 P7 P8Working Region

Clean-First Region LR

U

MRU

CFLRU always selects clean pages to evict from the clean- first region first to save flash write costs. If there is no clean page in this region, a dirty page at the end of the LRU list is evicted.

CleanDirtyEvicted Pages :

P7

P5

P8

P6


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion


Disadvantage :

CFLRU has to search in a long list in case of a buffer fault.

Keeping dirty pages in the clean-first region can shorten the memory resources.

Determine the size of W, the window size of the clean-first region.

CFDC : Clean-First, Dirty-Clustered


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CFDC (CLEAN-FIRST, DIRTY-CLUSTERED) ALGORITHM

54 1 45 33 44

39 69 48 7 11

20 6 4 13 8 15 27 28 29

Clean Queue

Dirty Queue

Victim

Working RegionPriority Region

Divide clean-first region (CFLRU) into two queue: Clean Queue and Dirty Queue; Separation of Clean and Dirty Pages. Dirty pages are grouped in clusters according to spatial locality. Clusters are ordered by priority.

Implement two-region scheme. Buffer divided into two region:

1. Working Region : Keep hot pages2. Priority Region: Assign priority to pages

Otherwise, a dirty page is evicted from the LRU end of a cluster having lowest priority.

Clean pages are always chosen first as victim pages.


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CFDC ALGORITHM – PRIORITY FUNCTION

For a cluster c with n pages, its priority P(c) is computedaccording to Formula 1

Where P0, …, Pn-1 are the page numbers ordered by theirtime of entering the cluster.

IPD (Inter-Page Distance)

Example : 15

8

13

20 4

6

29

28

27

4 2 3 6

Victim Page

Timestamp ->Priority -> 2/9 1/8 1/14 1/18 Lowest Priority

GlobalTime : 10


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CFDC ALGORITHM – EXPERIMENTS

CFDC vs. CFLRU: 41% CFLRU vs. LRU: 6%

Cost of page flushes Clustered writes are efficient

Number of page flushes CFDC has close write count to CFLRU

Influence of increasing update ratios CFDC is equal with LRU for update workload.


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CFDC ALGORITHM – CONCLUSION

Reduces the number of physical writes

Improves the efficiency of page flushing

Keeps high hit ratio.

Size of the Priority Window is a concern for CFDC.

CASA : Dynamically adjusts buffer size


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

AD-LRU (ADAPTIVE DOUBLE LRU) ALGORITHM

AD-LRU integrates the properties of recency, frequency and cleanness into the buffer replacement policy.

Cold LRU Hot LRU

LRU

LRU

MRU

MRU

Cold LRU: Keeps pages referenced once Hot LRU: Keeps pages referenced at least twice (frequency)

FC

FC

Min_lc

FC (First-Clean) indicates the victim page.

If page miss occurs, increase the size of the cold queue If buffer is full, cold clean pages are evicted from Cold LRU. If cold clean pages are not found, then cold dirty pages are evicted by using a second-chance algorithm.


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

AD-LRU ALGORITHM EVICTION POLICY

Example :Buffer size : 9 pages

3Dirty Hot

7Dirty Hot

2CleanHot

1Dirty Hot

4Dirty Cold

6Clean Cold

5Clean Cold

9Dirty Cold

8Dirty Cold

Hot Queue

Cold Queue

MRU LRU

Ad-LRU Victim

6Clean Cold

Ad-LRU Victim

4Dirty Cold

If no clean cold page is found, then a dirty cold page will be chosen as victim using a second-chance algorithm.

10Dirty Cold

New Page


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

AD-LRU ALGORITHM EXPERIMENTS

Write count vs. buffer size for various workload patterns

AD-LRU has the lowest write count

Random Read-Most

Write-Most Zipf


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

AD-LRU ALGORITHM - CONCLUSION

AD-LRU considers reference frequency, an important property of reference patterns, which is more or less ignored by CFLRU. AD-LRU frees the buffer from the cold pages as soon as appropriate.

AD-LRU is self-tuning.

AD-LRU is scan-resistant.


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CASA (COST-AWARE SELF-ADAPTIVE) ALGORITHM

CASA makes trade-off between physical reads and physical writes. It adapts automatically to varying workloads.

LRU

LRU

MRU

MRU

b= |Lc|+ |Ld|

|Lc| |Ld|

Clean List Lc Dirty List Ld

Divide buffer pool into 2 dynamic lists: Clean and Dirty list Both lists are ordered by reference recency.

CASA continuously adjust parameter τ; 0 ≤ τ ≤ b

In case of a buffer fault: τdecides from which list the victim page will be chosen.

τ is the dynamic target size of Lc, so size of Ld is (b – τ).


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CASA ALGORITHM

CASA algorithm considers both read and write cost.

LRU

LRU

MRU

MRU

b= |Lc|+ |Ld|

|Lc| |Ld|

Clean List Lc Dirty List Ld

Case 1: Logical Read request in Lc , τ increased.

Case 2: Logical Write request in Ld , τ decreased.

CASA algorithm also considers the status (R/W) of a requested page


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CASA ALGORITHM – EXAMPLE (1/2)

Total buffer size b = 13, τ = 6, Target Size of Lc = 6, Ld = 7

24 13 19 16 21 33

LRU LRU

b= |Lc|+ |Ld|

Clean List LcDirty List Ld


Incoming page : 14 (Read) in Lc

24 13 19 16 21 33 14

Total buffer size b = 13, τ = 7 , target size of Lc = 7, Ld = 6

11 22 34 4 5 7 811 22 34 4 5 7


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CASA ALGORITHM – EXAMPLE (2/2)

LRU LRU

b= |Lc|+ |Ld|

Clean List LcDirty List Ld


Case 2: Logical Write request in Ld , τ decreased.

Incoming page : 15 (Write) in Ld

24 13 19 16 21 33 14 11 22 34 4 5 7

Total buffer size b = 13, τ = 7 ,Target Size of Lc = 7, Ld = 6

15 11 22 34 4 5 713 19 16 21 33 14

Total buffer size b = 13, τ = 6 , target size of Lc = 6, Ld = 7


Motivation

Flash Memory

Page Replacement

Algorithm

Conclusion

CASA ALGORITHM - CONCLUSION

CASA is implemented for two-tier storage systems based on homogeneous storage devices with asymmetric R/W costs. CASA can detect cost ratio dynamically.

CASA is self-tuning. It adapts itself to varying cost ratios and workloads

Motivation

Flash Memory

Page Replacement

Algorithm


CONCLUSION

However, the performance behavior of flash devices is still remaining unpredictable due to complexity of FTL implementation and its proprietary nature.

Conclusion

Flash memory is a widely used, reliable, and flexible non-volatile memory to store software code and data in a microcontroller.

To gain more efficient performance, we need to implement a flash device simulator. We addressed issues of buffer management for two-tier storage systems (Caching for a flash DB); ARC and CASA are two better approach. Phase-change memory (PCM) is a promising next-generation memory technology, which can be used for database storage systems.

REFERENCES

1. Yi Ou: Caching for flash-based databases and flash-based caching for databases, Ph.D. Thesis, University of Kaiserslautern, Verlag Dr. Hut, Online August 2012

2. Nimrod Megiddo, Dharmendra S. Modha: ARC: A Self-Tuning, Low Overhead Replacement Cache. FAST 2003: (115-130)

3. Nimrod Megiddo, Dharmendra S. Modha: Outperforming LRU with an Adaptive Replacement Cache Algorithm. IEEE Computer 37(4): 58-65 (2004)

4. Yi Ou, Theo Härder: Clean first or dirty first?: a cost-aware self-adaptive buffer replacement policy. IDEAS 2010: 7-14

5. Seon-Yeong Park, Dawoon Jung, Jeong-Uk Kang, Jinsoo Kim, Joonwon Lee: CFLRU: a replacement algorithm for flash memory. CASES 2006: 234-241

6. Yi Ou, Theo Härder, Peiquan Jin: CFDC: a flash-aware replacement policy for database buffer management. DaMoN 2009: 15-20

7. Peiquan Jin, Yi Ou, Theo Härder, Zhi Li: AD-LRU: An efficient buffer replacement algorithm for flash-based databases. Data Knowl. Eng. 72: 83-102 (2012)

8. Suman Nath, Aman Kansal: FlashDB: dynamic self-tuning database for NAND flash. IPSN 2007: 410-419

9. Kyoungmoon Sun, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, Sang Lyul Min: LTFTL: lightweight time-shift flash translation layer for flash memory based embedded storage. EMSOFT 2008: 51-58

10. Nimrod Megiddo, Dharmendra S. Modha: System and method for implementing an adaptive replacement cache policy, US 6996676 B2, 2006

11. Wikipedia: Flash memory12. Wikipedia: Page replacement algorithm13. N. Megiddo , D. S. Modha: Adaptive Replacement Cache, IBM Almaden Research Center,

April 200314. Yang Hu, Hong Jiang, Dan Feng, Lei Tian, Shu Ping Zhang, Jingning Liu, Wei Tong, Yi

Qin, Liuzheng Wang: Achieving page-mapping FTL performance at block-mapping FTL cost by hiding address translation. MSST 2010: 1-12

THANK YOU