+ All Categories
Home > Documents > Introduction to ACPI Based Memory Hot-Plug · ACPI & Memory Hot-Plug •ACPI: Advanced...

Introduction to ACPI Based Memory Hot-Plug · ACPI & Memory Hot-Plug •ACPI: Advanced...

Date post: 01-Mar-2020
Category:
Upload: others
View: 28 times
Download: 0 times
Share this document with a friend
49
Introduction to ACPI Based Memory Hot-Plug Tang Chen <[email protected]>
Transcript

Introduction to ACPI Based Memory Hot-Plug

Tang Chen<[email protected]>

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

2

memory(in use)

memory(in error)

Why need memory Hot-Plug

3

Load Load

memory(in use)

memory(idle)

memory(idle)

2. Reduce power consumption.

1. Balance the load.

3. Handle hardware error.

4

Why need memory Hot-Plug

Hardware (ACPI registers)

Firmware (ACPI BIOS)

Guest OS

VMware

Guest OS

Guest OS

5. ACPI provides sufficient conditions• With the help of ACPI, hardware and

fireware are now able to support memory hotplug physically.

4. Guest OS should support memory hotplug.• VMware supports virtual memory device

hotplug for the virtual machine.• The similar feature is being developped for

KVM.Memory device virtualization

+

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

5

ACPI & Memory Hot-Plug• ACPI: Advanced Configuration and Power Interface

6

ACPI is an interface specification of Operating System-directed motherboard device configuration and Power Management.

-- ACPI Specification 5.0

Kernel (Software)

ACPI Driver

Hardware

Methods(dynamic)

ACPI BIOS (Firmware)

ACPI Tables(static)

ACPI Registers

Run Time Boot Time

OS layer framework. Event handling API

Static info used only at boot time. DSDT SRAT ……

Event driven model. Event registers Control registers ……SCI

(System Control Interrupt)

Dynamic methods used at run time. _EJ0 _STA ……

Kernel

Memory Hot-Plug Subsystem

ACPI & Memory Hot-Plug

7

ACPI Driver

Hardware

Methods(dynamic)

ACPI BIOS

ACPI Registers

Event info

Call event handlerCall API

ACPI Tables(static)

Generate SCI(System Control Interrupt)

Call ACPI Method

Hardware operation

Read ACPI Tables

Install event handler

• ACPI and Memory Hot-Plug Run time processBoot time process

Call device dependent code

Hot-Plug happens

Memory Device Driver

Kernel

Memory Hot-Plug Subsystem

ACPI & Memory Hot-Plug

8

ACPI Driver

Hardware

Methods(dynamic)

ACPI BIOS

ACPI Registers

ACPI Tables(static)

• Main jobs of Memory Hot-Plug

Memory Device Driver

Main Jobs Node data

Direct mapping

Virtual memory mapping

Page online & offline

Page migration

Event handler

ACPI & Memory Hot-Plug

Physical space

Userspace

……

Kernelspace

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

block X

movable pages (used)

otherpage_structs

page_structs ofblock X

……

……

process (128TB)

9

……

• Things associated with Memory Hot-Plug

2. User processes’ pagetable.

1. Memory block to be hot-plugged.

3. Kernel direct mappingpagetable.

4. Virtual memory mappingpages.

5. Virtual memory mappingpagetable.2

1

5

3

4

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

10

11

Memory hot-add• Generic memory definition & sparse memory model

endstart

section

pages

Physical memory range

block blockblock …… block……

section

pages

section

pages……

Sparse memory

model

Sparse memory model divides a memory range into several sections, in which the memory is contiguous.(128MB per section, one section per block by default on x64)

Generic memory definition divides a memory range into several blocks.(128MB per block by default on x64)

block X

pages(invalid)

block X

pages(invalid)

Memory hot-add• Add memory (1)

Physical space……

Kernel space

hole

virtual memory map

(1TB)

direct mapping

(64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

……

12

unmapped

unmapped

free pages

page_structs

page_structs

block X

pages(invalid)

empty

Blocks in the memory range are hot-added one by one.

Memory hot-add• Add memory (2)

Physical space……

Kernel space

hole

virtual memory map

(1TB)

direct mapping

(64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

……

13

NEW

NEW

page_structs

page_structs

block X

pages(invalid)

NEW

empty

1

2

1. Initialize direct mappingpagetable.

2. Allocate virtual memory mapping pages.

3

3. Initialize virtual memory mapping pagetable.

Memory hot-add• Add memory (3)

Physical space……

Kernel space

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

……

14

page_structs

page_structs

block X (offline)

pages

page_structs ofblock X The newly added pages are

offline and not present.

echo online >/sys/devices/system/memory/memoryX/state

Memory hot-add• Online pages

Physical space……

Kernel space

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

……

15

page_structs

page_structs

block X (online)

pages

page_structs ofblock X

User space

process n (128TB)

……

process 1 (128TB)

process 3 (128TB)

process i (128TB)

process 2 (128TB)

Memory hot-add

• Configuration– mm/Kconfig

config MEMORY_HOTPLUGbool "Allow for memory hot-add“depends on SPARSEMEM || X86_64_ACPI_NUMAdepends on HOTPLUG && ARCH_ENABLE_MEMORY_HOTPLUGdepends on (IA64 || X86 || PPC_BOOK3S_64 || SUPERH || S390)

16

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

17

Memory hot-remove• Remove memory (1)

Physical space

block X (online)

movable pages (used)

……

pages(free)

pages(free)

Kernel space

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

process n (128TB)

page_structs

page_structs ofblock X

……

……

……process 1 (128TB)

process 3 (128TB)

process i (128TB)

process 2 (128TB)

User space18

Memory hot-remove• Remove memory (2)

Physical space

process n (128TB)

…process 1 (128TB)

process 3 (128TB)

process i (128TB)

process 2 (128TB)

1

1

1. Unmap user pages.• Kernel will generate a

page fault for each process who access these pages, and the process will wait till the migration is over.

2. Allocate new pages.

3. Copy data from old pages to new pages.

pages(used)

pages(used)

block X (online)

……

page_structs

page_structs ofblock X

……

……

……

movable pages (used)

2

2

3

3

19User space

pages(used)

pages(used)

Memory hot-remove• Remove memory (3)

Physical space

block X (offline)

……

page_structs

page_structs ofblock X

……

……

……

movable pages (isolated)

process n (128TB)

……

process 1 (128TB)

process 3 (128TB)

updating

updating

process i (128TB)

process 2 (128TB)

4. Update user processes’pagetable.

• Also wake up all the processes waiting for these pages.

5. Isolate old pages.• Pages are not in the

buddy system, and won’t be allocated to anyone.

4

54

20User space

6. Set the block state to offline.

6

pages(used)

pages(used)

removing

Memory hot-remove• Remove memory (4)

Physical space……

Kernel space

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

page_structs

……

……

……

freeing

freeing

freeing

freeing

freeing

6

7

6. Free kernel direct mappingpagetable.

7. Free virtual memory mapping pages.8

8. Free virtual memory mapping pagetable.

21

removed

pages(used)

pages(used)

Memory hot-remove• Remove memory (5)

Physical space……

Kernel space

hole

virtual memory map (1TB)

direct mapping (64TB)

kernel text mapping

module mapping space

hole

holevmalloc/ioremap space

hole

process n (128TB)

……

page_structs

……

……

……process 1 (128TB)

process 3 (128TB)

freed

freed

updated

updated

process i (128TB)

process 2 (128TB)

freed

freed

22

freed

User space

Memory hot-remove• Post work: automatically remove the node

ZONE_MOVABLE

node i

cpu

CPU in use

CPU idle

unmovable memory

movable memory

memory removedCPU removed

removedmemory

node i

cpu cpu cpu cpu

RemoveCPUs

removedmemory

cpu cpu cpu

Removememory

ZONE_MOVABLE

All CPUs, memoryare removed ?

NO

YES

NO

Set node state to offline

Remove /sys files

Free wait_table

Clear pgdat

Node hot-remove

23

Memory hot-remove

• Merged into Linux 3.9• Configuration

– mm/Kconfig

config MEMORY_HOTREMOVEbool "Allow for memory hot remove"select MEMORY_ISOLATIONselect HAVE_BOOTMEM_INFO_NODE if X86_64depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVEdepends on MIGRATION

24

Memory hot-remove• Kernel pages cannot be hot-removed

25

direct mapping(64TB)

Userspace

Kernelspace

user mapping

(128TB)

Physical space

User page

User page

User page

Kernel page

Kernel pageKernel page

Kernel page

variable

1. migrate

not migratableva = pa + offset(1-1 mapped)

User page

Kernel page

not hot-removable

hot-removable

2. hot-remove

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

26

memory

Movable node

memory

node0

cpu cpu

node1 node2

cpu cpucpu

27

• NUMA: Non-Uniform Memory Access– A node consists of a set of CPUs and memory.– CPUs access memory in the same node faster.– Meaningful in 64bits platform only. 32bits platform has only

one node.

cpu

memory

Fast FastFast

Movable node

ZONE_NORMAL

ZONE_MOVABLE

node

cpu cpu

28

ZONE_DMA

ZONE_DMA32

ZONE_HIGLMEM

ZONE_DMA / ZONE_DMA_32 (64bits only): used for DMA.

ZONE_NORMAL: Memory directly mapped, used by kernel and user space.

ZONE_HIGHMEM: For 32bits kernel to access memory not directly mapped.

ZONE_MOVABLE: Memory can be migrated. (user space only)

• Zone: Different types of memory in a node

Kernel allocates ZONE_NORMAL on each node evenly

Movable node• Problem

ZONE_DMA

ZONE_DMA32

ZONE_NORMAL

ZONE_MOVABLE

node 0

cpu cpu

cpu cpu

CPU in use

CPU idle

unmovable memory

movable memory

ZONE_NORMAL

ZONE_MOVABLE

node 1

cpu cpu

cpu cpu

node i

Each node has ZONE_NORMAL

No node is hot-removable

ZONE_NORMAL

ZONE_MOVABLE

cpu cpu

cpu cpu

Kernel may use

ZONE_NORMAL

29

Configure a node to have only

ZONE_MOVABLE

Movable node• Solution

node 0

cpu cpu

cpu cpu

ZONE_NORMAL

ZONE_MOVABLE

node 1

cpu cpu

cpu cpu

ZONE_MOVABLE

node i

cpu cpu

cpu cpu

The node is hot-removable

CPU in use

CPU idle movable memory

ZONE_DMA

ZONE_DMA32

ZONE_NORMAL

ZONE_MOVABLE

Movable node has no ZONE_NORMAL

Kernel can not use

ZONE_MOVABLE

unmovable memory

30

Processor Local APIC/SAPIC AffinityProcessor Local

APIC/SAPIC AffinityProcessor Local x2APIC Affinity

Mainly useful information

SRAT

Local x2APIC ID PXM (proximity domain) ……

Static information of NUMA architecture.

KernelACPI Driver

Hardware

Methods

ACPI BIOS

Tables

Registers

Memory AffinityMemory AffinityMemory Affinity

Processor Local APIC/SAPIC AffinityProcessor Local

APIC/SAPIC AffinityProcessor Local APIC/SAPIC Affinity

APIC ID or SAPIC ID/EID PXM (proximity domain) ……

Memory range PXM (proximity domain) Hotpluggable flag ……

Movable node• Static configuration

– SRAT: System Resource Affinity Table

31

Movable node• movablecore = acpi

unmovable memory

movable memory

ZONE_DMA

ZONE_DMA32

ZONE_NORMAL

node 0

ZONE_MOVABLE

node 1

ZONE_MOVABLE

node i

ZONE_NORMAL

node n

Node 0unhotpluggable

Node 1hotpluggable

Node nunhotpluggable

Node ihotpluggable

… …

• Use SRAT to arrange ZONE_MOVABLE.• Only for memory hotplug users.• Still being pushing.(New)

SRAT memory affinities

32

Movable node• The old way (no performance lost)

– kernelcore / movablecore = nn {G|M|K} (Old)

unmovable memory

movable memory

• Allocate ZONE_MORMAL in each node evenly.• For regular users.

ZONE_DMA

ZONE_DMA32

ZONE_NORMAL(same size)

ZONE_MOVABLE

node 0

ZONE_NORMAL(same size)

ZONE_MOVABLE

node 1

ZONE_NORMAL(same size)

ZONE_MOVABLE

node i

ZONE_NORMAL(same size)

ZONE_MOVABLE

node n

33

ZONE_NORMAL

ZONE_MOVABLE

Movable node• Dynamic configuration

offline memory

node i

cpu cpu

cpu cpu

memory offlineCPU offline

mem_section XXX

mem_section XXX+1

CPU in use

CPU idle movable memory

unmovable memory

node i

cpu cpu

cpu cpu

online memory

1. online_kernel (NEW)Set to ZONE_NORMAL.

2. online_movable (NEW)Set to ZONE_MOVABLE.

3. online (Improved)Keep the previous state.(ZONE_NORMAL for the first time)

echo COMMAND >/sys/devices/system/node/nodei/memoryXXX/state

Rule: ZONE_MOVABLE should always be after ZONE_NORMAL, never overlaps.

34

online_kernel

online_movable

offline online

offline and online again

Performancedown!

Movable node

ZONE_NORMAL

unmovablenode

cpu cpu

cpu cpu

ZONE_MOVABLE

movablenode

ZONE_NORMAL

ZONE_MOVABLE

unmovablenode

cpu cpu

cpu cpu

cpu cpu

cpu cpu

CPU in use

CPU idle movable memory

unmovable memory

CPU in use by kernel

35

• Drawback

No good enough way to solve this problem now.

Movable node

• Merged into Linux 3.8• Configuration

– mm/Kconfig

config MOVABLE_NODEboolean "Enable to assign a node which has only movable memory"depends on HAVE_MEMBLOCKdepends on NO_BOOTMEMdepends on X86_64depends on NUMAdefault n

36

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

37

node0 node1

Bootmem handling

lowmemory

memblock.memory[]: All the present memory in the system.

memblock.reserve[]: Allocated memory.

memory1 memory2 memory4

node2memory5

node3memory3 memory6 memory7

38

• Memblock: A bootmem allocator– Consists of two arraies

…… …

Buddy systemorder

0

1

MAX

……

……

……

……

Free unused memory to buddy system at last.

Bootmem handling• Problem at boot time

– Bootmem allocator memblock may allocate hotpluggablememory for kernel at boot time.

Hotpluggablememory

Kernel data

Kernel data

ZONE_MOVABLE

Movable node

Not hot-removable39

2. Parse SRAT (too late).

Allocated by memblock

1. Memblock is ready.Boot time

3. Initialize ZONEs.

Hotpluggable memory ranges are unknown.

Bootmem handling

DEFAULTHOTPLUGGABLE

40

• Solution1. Parse SRAT earlier, before memblock starts to work.2. Introduce flags into memblock, and reserve hotpluggable

memory with a special flag in memblock.

2. Parse SRAT earlier.

Boot time

No memoryallocation.

1. Memblock is ready. 3. Initialize ZONEs.

ZONE_MOVABLE

Movable node

Hot-removable

Hotpluggablememory Kernel data

node0 node1

Boot time

Bootmem handling

lowmemory

memblock.memory[]

memblock.reserve[]

unhotpluggable memoryhotpluggable memory

memory1 memory2 memory4

node2memory5

node3memory3 memory6 memory7

Allocated memory ranges will be put into reserve[].

1. Memblock is ready.

empty

41

All the memory ranges in the system are put into memory[].

Boot time

node0 node1

Bootmem handling

lowmemory

memblock.memory[]

memblock.reserve[]

memory1 memory2 memory4

node2memory5

node3memory3 memory6 memory7

DEFAULT

unhotpluggable memoryhotpluggable memory

2. Before parsing SRAT.

• Reserve kernel _data, _text, setup data, … , with flag DEFAULT.

• Any node the kernel resides in is unhotpluggable.(Not necessary to be node 0)

• No new memory allocation, so no hotpluggable memory could be used by the kernel.

42

Boot time

node0 node1

Bootmem handling

lowmemory

memblock.memory[]

memblock.reserve[]

memory1 memory2 memory4

node2memory5

node3memory3 memory6 memory7

… HOTPLUGGABLE HOTPLUGGABLE

unhotpluggable memoryhotpluggable memory

3. Parsing SRAT.

Reserve hotpluggable memory with flag HOTPLUGGABLE.

43

DEFAULT

Boot time

node0 node1

Bootmem handling

lowmemory

memblock.memory[]

memblock.reserve[]

memory1 memory2 memory4

node2memory5

node3memory3 memory6 memory7

… HOTPLUGGABLE HOTPLUGGABLE … … …

unhotpluggable memoryhotpluggable memory

4. After parsing SRAT, hotpluggable memory has been reserved.

44

DEFAULT DEFAULT

No hotpluggable memory used by kernel.

Boot time

Bootmem handling

memblock.reserve[]

… HOTPLUGGABLE HOTPLUGGABLE … … …

unhotpluggable memoryhotpluggable memory

5. Memory initialization has been finished.

Free hotpluggable memory to buddy system. (NEW)

Buddy systemorder

0

1

MAX

……

……

……

……

45

DEFAULT DEFAULT

Agenda

1. Why need Memory Hot-Plug2. ACPI & Memory Hot-Plug3. Memory hot-add4. Memory hot-remove5. Movable node6. Bootmem handling7. Future work

46

Future work

• Node local pagetable and vmemmap.– Improve performance.

• Migrate user pages pinned in memory.– For those who pin pages for a long time.

• User space tools, like libnuma and numactl.– A library of functions.– Commands.

47

Thank you!Q&A

48

Movable node

49

• Performance tests

0

2

4

6

8

10

12

alloc read write

unmovable nodemovable node

Time of accessing 20GB memory (s)

Alloc: 40% down

Read: 33% down

Write: 12% down


Recommended