+ All Categories
Home > Documents > From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for...

From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for...

Date post: 22-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
28
From Embedded World to High Performance Computing using STT-MRAM 1 Lionel Torres, Sophiane Senni Paris, France May 29, 2017 30-May-17 Workshop NVRAM
Transcript
Page 1: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

From Embedded World to High Performance Computing using STT-MRAM

1

Lionel Torres, Sophiane Senni

Paris, France May 29, 2017

30-May-17 Workshop NVRAM

Page 2: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

OUTLINE

30-May-17 Workshop NVRAM 2

1. Motivation

2. Spintronics

1. Basics

2. STT-MRAM technology

3. STT-MRAM exploration at system level

1. Embbeded systems & High Performance Computing

4. Conclusions and Future Work

Page 3: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Motivation

30-May-17 Workshop NVRAM 3

• CMOS scaling issues are observed...

– Heat dissipation

– Performance saturation

• Due to..

– High leakage current

– High power density

• Thermal constraints partially turn off the system

• Turning off the memory part the execution state is lost

eFPGA

CPU

High performance bus

Cache

On-chip

SRAM

DDR

Controller

Flash

Controller

GPU

External DRAM External Flash

Non-volatile

FPGA

Non-Volatile

CPU

High performance bus

NV Cache

Embedded

STT-MRAM

DDR

Controller

Memory

Controller

GPU

External STT-MRAM External STT-MRAM

Need to go

beyond CMOS

Current system-on-chip Non-volatile system-on-chip

Page 4: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Spintronics

30-May-17 Workshop NVRAM 4

Electron properties

Mass

Electric charge

Spin

Electronics

Electrons are moved (current)

by acting on the charge

Spintronics

Motion by acting on the spin !

Phenomena related to spin

Magnetoresistance

Spin Transfer torque

Page 5: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Spintronics

30-May-17 Workshop NVRAM 5

William Thomson

1824-1907

Resistance variation 2% - 5% at room temperature

The electrical resistance of magnetic metal varies with the presence of an external magnetic field

Anisotropic

magnetoresistance

(Fe

/Cr)n

Peter Grünberg

Albert Fert

2007 Nobel Prize

(Physics)

Large increase of the conductance with structure alternating ferromagnetic / non-magnetic layers

Giant

magnetoresistance

T. Miyazaki

J. Moodera

(not in the pictures:

M. Jullière)

CoFe/Al2O3/Co

J. S. Moodera 1995

CoFeB/MgO/CoFeB

S. Ikeda 2008

Unlike GMR, the barrier is an isolant

With MgO, TMR of 608% reached at room temperature

Tunnel

magnetoresistance

Page 6: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Spintronics

30-May-17 Workshop NVRAM 6

Tunnel magnetoresistance principle

The transport of the electrons through the material is spin-dependent

Ferromagnetic Ferromagnetic Isolant

Parallel configuration

Ferromagnetic Ferromagnetic Isolant

Antiparallel configuration

Spin-up Spin-down

RMAX ‘1’ RMIN ‘0’

Page 7: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Spintronics

30-May-17 Workshop NVRAM 7

GMR

read head

Coil Ferromagnet

Applications

Page 8: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

STT-MRAM technology

30-May-17 Workshop NVRAM 8

• STT-MRAM can be used to build:

– Flip-Flops

– Cache memories

– Main memories

RMIN ‘0’

(Parallel state)

RMAX ‘1’

(Antiparallel state) - +

Access

Transistor Ref

Sensing/Writing Current

Storage Layer

Reference Layer

Tunnel oxide

4Gb LPDDR2 STT-MRAM [2] NVFF STT-MRAM [1]

[1] B. Jovanovic et al., “A hybrid magnetic/complementary metal oxide semiconductor three-context memory bit cell for non-volatile circuit design,” AIP Journal of Applied Physics, April 2014.

[2] K. Rho et al., “A 4Gb LPDDR2 STT-MRAM with compact 9F2 1T1MTJ cell and hierarchical bitline architecture,” Solid-State Circuits Conference (ISSCC), February 2017.

Bit Cell Structure

Page 9: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 9

• The main objectives are… – Evaluate the impact at system level of using

STT-MRAM

– Explore new applications

• Non-volatile working memories (registers, cache…)

• In-memory computing

• This talk focuses on.. – Non-volatile processor for embedded applications

– STT-MRAM exploration framework for High Performance Computing

STT-MRAM exploration

Non-Volatile

CPU

High performance bus

NV Cache

Embedded

STT-MRAM

Memory

Controller

External

STT-MRAM

Page 10: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Non-volatile processor based on STT-MRAM

30-May-17 Workshop NVRAM 10

Page 11: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 11

Non-volatile processor based on STT-MRAM

• Two application under study…

– Normally-off Computing

• The system is normally off

• The execution state is preserved after a shutdown

• Fast wakeup, near-zero leakage power in sleep mode

– Checkpoint/Rollback

• Restore a safe state of the processor for instance after an execution error or a power failure

• Two 32-bit RISC processors considered…

– Secretblaze (MIPS like)

– Amber (ARM like)

Page 12: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 12

Non-volatile processor based on STT-MRAM

Execute

ALU

Decode

Register file

Memory

Data bus

Fetch

Instruction bus

Write back

Reg

Address decoder

Memory bus interface

Instruction Cache

Address decoder

Memory bus interface

Data Cache

Main memory

NV Register

file

NV Reg

STT-MRAM

Main memory STT-MRAM

(Checkpoint Memory)

Hybrid CMOS/STT-MRAM flip-flop

• Speed of CMOS

• Non-volatility of STT-MRAM

STT-MRAM main memory

Checkpoint memory for the Rollback

• Data are preserved after a shutdown

• Store a valid state of the system to be tolerant

against execution errors and power failures

Non-volatile Processor

Architecture

Page 13: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 13

Non-volatile processor based on STT-MRAM

Normally-off Computing

Back up the register’s state

POWER DOWN

POWER UP

Main memory based on MRAM

Data preserved

Main memory based on MRAM

Data available

Restore the register’s state

4

3

2

1

Page 14: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Non-volatile processor based on STT-MRAM

30-May-17 Workshop NVRAM 14

• Conventional system

– Leakage power during sleep mode

• Non-volatile system with instant-on/off

– Near-zero leakage during sleep mode

– Backup energy

Conventional system Non-volatile system

Minimum Tsleep required to be

more energy efficient ?

TsleepPleakageEbackupTbackupPleakage

TsleepPleakage

EbackupTbackupPleakage

µWPleakage 973

nJEbackup 1

nsTbackup 20

µsTsleep 05.1

Synthesis of the Amber processor (Industrial 40nm CMOS low-power process)

µWPleakage 775

nJEbackup 1

nsTbackup 20

µsTsleep 32.1

Synthesis of the Secretblaze processor (Industrial 40nm CMOS low-power process)

* D. Chabi et al., “Ultra low power magnetic flip-flop based on checkpointing/power gating and self-enable mechanisms,” IEEE Transaction on Circuits and Systems I, January 2014.

Page 15: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Non-volatile processor based on STT-MRAM

30-May-17 Workshop NVRAM 15

Checkpoint Rollback

Main memory

Checkpoint memory

ON OFF

+ CHECKPOINT

- Back up registers

- Back up memory = Main

memory Checkpoint

memory

Backup ON ON

+ ROLLBACK

1. Stall the processor

2. Restore the checkpoint

3. Execution

Main memory

Checkpoint memory

Restore ON ON

NORMAL EXECUTION

- Only the main memory contents are modified

- The checkpoint memory is turned off

Page 16: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Non-volatile processor based on STT-MRAM

30-May-17 Workshop NVRAM 16

Checkpoint/Rollback (Memory part)

NORMAL EXECUTION

- Only the main memory contents are modified

- Buffer to back up the addresses of the modified memory locations

Main memory

Checkpoint memory

Buffer (128 entries)

Address

ON OFF

Main memory

Checkpoint memory

Buffer (128 entries)

Backup ON ON

CHECKPOINT

- Only the modified memory locations are copied

Main memory

Checkpoint memory

Buffer (128 entries)

Restore ON ON

ROLLBACK

- Only the modified memory locations are restored

Page 17: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

Non-volatile processor based on STT-MRAM

30-May-17 Workshop NVRAM 17

• Validation of the backup/recovery of the system

• Evaluation of the cost

– Register level (Data from real flip-flop design)

• Backup: ≈1nJ (<20ns)

• Restore: <25pJ (≈1ns)

– Main memory level (Data from NVSim)

• 1MB Main memory / 4kB Checkpoint memory – Backup: <100nJ (<20µs)

– Restore: <100nJ (<20µs)

Blowfish application

DES application

Page 18: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

High Performance Computing using STT-MRAM

30-May-17 Workshop NVRAM 18

Page 19: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

High Performance Computing using STT-MRAM

30-May-17 Workshop NVRAM 19

• A simulation framework has been developed to… – Explore the impact of STT-MRAM at system level

– Provide essential feedback to enhance the development of STT-MRAM devices

– Explore different memory technologies

• A cross-layer investigation is done… – Device level Physical Design Kit

– Circuit level Bit cell

– Memory level Cache, main memory…

– System level Multi-core architectures

Page 20: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 20

• Case study… – Architecture considered

• 4-core out-of-order (ARMv7 ISA)

• 32kB L1 instruction cache (SRAM)

• 32kB L1 data cache (SRAM)

• 1MB shared L2 cache

– Two scenarios (SRAM / STT-MRAM)

• 512MB DRAM DDR3 main memory

– Benchmarks

• PARSEC

• SPLASH-2

Core 3 Core 1 Core 0

L1 I/D L1 I/D L1 I/D

Shared L2

DDR3

Core 2

L1 I/D

High Performance Computing using STT-MRAM

Page 21: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 21

• Circuit-level analysis…

– Area

High Performance Computing using STT-MRAM

0,01

0,1

1

10

100

8kB 16kB 32kB 64kB 128kB 256kB 512kB 1MB 2MB 4MB

Are

a (m

m²)

SRAM STT-MRAM

STT-MRAM is denser for large cache capacity

STT-MRAM cell size smaller than that of SRAM

STT-MRAM needs large transistors for write operations

Process Technology 1MB L2 (mm²)

32kB L1 (mm²)

45nm SRAM 2.7 0.091

STT-MRAM 1.12 0.116

Page 22: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 22

• Circuit-level analysis…

– 1MB cache performances

• Based on NVSim

High Performance Computing using STT-MRAM

Node Technology Latency

(ns) Energy

(nJ)

45nm SRAM 10.6 0.51

STT-MRAM 7.6 0.15

Read Write Standby

Latency (ns)

Energy (nJ)

10.6 0.05

16.7 0.65

Leakage (mW)

630

24 /26

STT-MRAM < SRAM for reads

Small area of STT-MRAM

STT-MRAM > SRAM for writes

STT-MRAM << SRAM for static power

Page 23: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 23

• Set of results…

– Runtime • Similar performance when using STT-MRAM

0

0,2

0,4

0,6

0,8

1

1,2

Ru

nti

me

(No

rmal

ize

d)

PARSEC benchmarks

SRAM STT-MRAM

0

0,2

0,4

0,6

0,8

1

1,2

barnes fmm fft lu1 lu2 ocean1 ocean2 radix water

Ru

nti

me

(No

rmal

ize

d)

SPLASH-2 benchmarks

SRAM STT-MRAM

High Performance Computing using STT-MRAM

Page 24: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 24

• Set of results…

– L2 cache energy • STT-MRAM based L2 cache consumes >80% less energy than

SRAM based L2

0

0,2

0,4

0,6

0,8

1

1,2

Ene

rgy

(No

rmliz

ed

)

PARSEC benchmarks

SRAM STT-MRAM

0

0,2

0,4

0,6

0,8

1

1,2

barnes fmm fft lu1 lu2 ocean1 ocean2 radix water

Ene

rgy

(No

rmal

ize

d)

SPLASH-2 benchmarks

SRAM STT-MRAM

High Performance Computing using STT-MRAM

Page 25: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 25

• Set of results…

– System energy • Evaluate the impact of the memory part compared to the rest of

the system

0

0,5

1

1,5

2

2,5

3

SRAM STT-MRAM

Ene

rgy

(J)

SPLASH-2 workload (Water)

Memory Controller

Buses

L2

Dcache

Icache

Cores

0

1

2

3

4

5

6

7

8

9

SRAM STT-MRAM

Ene

rgy

(J)

PARSEC workload (Canneal)

Memory Controller

Buses

L2

Dcache

Icache

Cores

High Performance Computing using STT-MRAM

Page 26: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 26

• Set of results…

– System energy • The impact for

different number of cores

0

1

2

3

4

5

6

SRAM STT-MRAM

Ene

rgy

(J)

PARSEC workload (Canneal) 2 cores

Memory Controller

Buses

L2

Dcache

Icache

Cores

0

1

2

3

4

5

6

7

8

9

SRAM STT-MRAM

Ene

rgy

(J)

PARSEC workload (Canneal) 4 cores

Memory Controller

Buses

L2

Dcache

Icache

Cores

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

SRAM STT-MRAM

Ene

rgy

(J)

PARSEC workload (Canneal) 1 core

Memory Controller

Buses

L2

Dcache

Icache

Cores

High Performance Computing using STT-MRAM

Page 27: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 27

Conclusions

• STT-MRAM is promising for: – Energy-efficient & Reliable embedded systems

• Normally-off computing

• Checkpoint / Rollback

– Caches memories for High Performance Computing

• A system level simulation framework is developed to enhance the developement of STT-MRAM and other memory technologies

Page 28: From Embedded World to High Performance Computing using ... · STT-MRAM needs large transistors for write operations Process Technology 1MB L2 (mm²) 32kB L1 (mm²) 45nm SRAM 2.7

30-May-17 Workshop NVRAM 28

Future Work

• Strenghten the results by designing a real system-on-chip based on STT-MRAM

– Ongoing work (European Project GREAT)

• Explore STT-MRAM at main memory level – Ongoing work

• Extension of the simulation framework

• Explore other memory technologies

– Spin-Orbit-Torque MRAM

– Voltage-Controlled MRAM


Recommended