Precision Timed Embedded Systems Using TickPAD Memory Matthew M
Y Kuo* Partha S Roop* Sidharta Andalam Nitish Patel* *University of
Auckland, New Zealand TUM CREATE, Singapore
Slide 2
Introduction Hard real time systems Need to meet real time
deadlines Catastrophic events may occur when missed Synchronous
execution approach Good for hard real time systems Deterministic
Reactive Aids static timing analysis Well bounded programs No
unbounded loops or recursions
Slide 3
Synchronous Languages Executes in logical time Ticks Sample
input computation emit output Synchronous hypothesis Tick are
instantaneous Assumes system is executes infinitely fast System is
faster than environment response Worst case reaction time Time
between two logical ticks Languages Esterel Scade PRET-C Extension
to C
Slide 4
Synchronous Languages Executes in logical time Ticks Sample
input computation emit output Synchronous hypothesis Tick are
instantaneous Assumes system is executes infinitely fast System is
faster than environment response Worst case reaction time Time
between two logical ticks Languages Esterel Scade PRET-C Extension
to C
Slide 5
PRET-C Light-weight multithreading in C Provides thread safe
memory access C extension implemented as C macros StatementMeaning
ReactiveInput IDeclares I as a reactive environment input
ReactiveOutput ODeclares O as a reactive environment output PAR(T1,
. Tn)Synchronously executes n threads in parallel, where thread t i
has a higher priority than t i+1 EOTMarks the end of tick [weak]
abort P when C Preempt p when c is true
Slide 6
Introduction Practical System require larger memory Not all
applications fit on on-chip memory Require memory hierarchy
Processor memory gap [1] Hennessy, John L., and David A. Patterson.
Computer Architecture: A Quantitative Approach. San Francisco, CA:
Morgan Kaufmann, 2011.
Slide 7
Introduction Traditional approaches Caches Scratchpads However,
Scant research for memory architectures tailored for synchronous
execution and concurrency.
Slide 8
Caches CPU Main Memory
Slide 9
Caches Traditionally Caches Small fast piece of memory Temporal
locality Spatial locality Hardware Controlled Replacement policy
CPU Main Memory Cache
Slide 10
Caches Hard real time systems Needs to model the architecture
Compute the WCRT Caches models Trade off between length of
computation time and tightness Very tight worse case estimate is
not scalable CPU Main Memory Cache
Slide 11
Scratchpad Scratchpad Memory (SPM) Software controlled
Statically allocated Statically or dynamically loaded Requires an
allocation algorithm e.g. ILP, Greedy CPU Main Memory SPM
Slide 12
Scratchpad Hard real time systems Easy to compute tight the
WCRT Reduces the worst case performance Balance between amount of
reload points and overheads May perform worst than cache in the
worst case performance CPU Main Memory SPM
Slide 13
TickPAD CPU Main Memory SPMCache Good at overall performance
Hardware controlled Good at worst case performance Easy for fast
and tight static analysis
Slide 14
TickPAD CPU Main Memory SPMCache Good at overall performance
Hardware controlled Good at worst case performance Easy for fast
and tight static analysis TPM
Slide 15
TickPAD CPU Main Memory TPM TickPAD Memory TickPAD - Tick
Precise Allocation Device Memory controller Hybrid between caches
and scratchpads Hardware controlled features Static software
allocation Tailored for synchronous languages Instruction
memory
Slide 16
TickPAD Design flow
Slide 17
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } main t1 t3 t2
Slide 18
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } Computation main t1 t3 t2
Slide 19
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } Spawn children threads main t1
t3 t2
Slide 20
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } End of tick Synchronization
boundaries main t1 t3 t2
Slide 21
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } Child thread terminate main t1
t3 t2
Slide 22
PRET-C int main() { init(); PAR(t1,t2,t3);... } void thread
t1() { compute; EOT; compute; EOT; } Main thread resume main t1 t3
t2
Slide 23
PRET-C Execution Time main t1 t3 t2 Sample inputs
Slide 24
PRET-C Execution main t1 t3 t2 main Time
Slide 25
PRET-C Execution main t1 t3 t2 main Time t1
Slide 26
PRET-C Execution main t1 t3 t2 main Time t1t2
Slide 27
PRET-C Execution main t1 t3 t2 main Time t1t2
Slide 28
PRET-C Execution main t1 t3 t2 main Time t1t2 Emit Outputs
Slide 29
PRET-C Execution main t1 t3 t2 main Time t1t2 1 tick (reaction
time)
Slide 30
PRET-C Execution main t1 t3 t2 main Time t1t2 local tick
Slide 31
Assumptions 0x000x040x080x0C 4 Instructions 1 Cache Line Takes
1 burst transfer from main memory Cache miss, takes 38 clock cycles
[2] 0x00Each instructions takes 2 cycles to execute buffer Buffers
are 1 cache line in size 2. J. Whitham and N. Audsley. The
Scratchpad Memory Management Unit for Microblaze: Implmentation,
Testing, and Case Study. Technical Report YCS-2009-439, University
of York, 2009.
Slide 32
TickPAD - Overview TickPAD - Overview
Slide 33
Spatial memory pipeline To accelerate linear code TickPAD -
Overview TickPAD - Overview
Slide 34
Associative loop memory For predictable temporal locality
Statically allocated and Dynamically loaded TickPAD - Overview
TickPAD - Overview
Slide 35
Tick address queue Stores the resumptions address of active
threads TickPAD - Overview TickPAD - Overview
Slide 36
Tick instruction buffer Stores the instructions at the
resumption of the next active thread To reduce context switching
overhead at state/tick boundaries TickPAD - Overview TickPAD -
Overview
Slide 37
Command table Stores a set of commands to be executed by the
TickPAD controller. TickPAD - Overview TickPAD - Overview
Slide 38
Command buffer A buffer to store operands fetched from main
memory Command requiring 2+ operands TickPAD - Overview TickPAD -
Overview
Slide 39
Spatial Memory Pipeline Cache on miss Fetches from main memory
on to cache First instruction miss, subsequence instructions on
that line hits Requires history of cache needed for timing analysis
Scratchpad unallocated Executes from main memory Miss cost for all
instructions Simple timing analysis
Slide 40
Spatial Memory Pipeline Memory controller Single line buffer
Simple analysis Analyse previous instruction First instruction
miss, subsequence instructions on that line hits CPU Main
Memory
Slide 41
Spatial Memory Pipeline Computation required many lines of
instructions Exploit spatial locality Predictability prefetch the
next line of instructions Add another buffer
Slide 42
Spatial Memory Pipeline To preserve determinism Prefetch only
active if no branch
Slide 43
Spatial Memory Pipeline
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52
Slide 53
Timing analysis Simple to analyse Analysis next instruction
line If has a branch next target line will miss e.g. 38 clock
cycles Else will be prefetched e.g. 38 8 = 30 clock cycles
Slide 54
Spatial Memory Pipeline Timing analysis Simple to analyse
Analysis next instruction line If has a branch next target line
will miss e.g. 38 clock cycles Else will be prefetched e.g. 38 8 =
30 clock cycles
Slide 55
Spatial Memory Pipeline Timing analysis Simple to analyse
Analysis next instruction line If has a branch next target line
will miss e.g. 38 clock cycles Else will be prefetched e.g. 38 8 =
30 clock cycles
Slide 56
Tick Address Queue Tick Instruction Buffer Reduce cost of
context switching Maintains a priority queue Thread execution order
Prefetches instructions from next thread Make context switching
points appear as linear code Paired using Spatial Memory
Pipeline
Slide 57
Tick Address Queue Tick Instruction Buffer
Slide 58
Slide 59
Slide 60
Slide 61
Slide 62
Context switching memory cost same as linear code
Slide 63
Tick Address Queue Tick Instruction Buffer
Slide 64
Slide 65
Slide 66
Slide 67
Timing analysis Same prefetch lines for allocated context
switching points
Slide 68
Associative Loop Memory Statically Allocated Greedy Allocates
inner most look first Fetches Loop Before Executing Predictable
easy and tight to model Exploits temporal locality
Slide 69
Command Table Statically Allocated A Look Up table to
dynamically load Tick Instruction Buffer Tick Queue Associative
Loop Memory Command are executed when the PC matches the address
stored on the command Allows the TickPAD to function without
modification to source code Libraries Propriety programs
Slide 70
Command Table Three fields Address The PC address to execute
the command Command Discard Loop Associative Memory Store Loop
Associative Memory Fill Tick Instruction Buffer Load Tick Address
Queue Operand Data used by the command
Slide 71
Command Table Allocation NodeCommandAddress FORKLoad Tick
Address Queue x N Fill Tick Instruction Buffer Address of FORK
EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of
EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard
Loop Associative Memory Store Loop Associative Memory Address at
start of Loop
Slide 72
Command Table Allocation NodeCommandAddress FORKLoad Tick
Address Queue x N Fill Tick Instruction Buffer Address of FORK
EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of
EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard
Loop Associative Memory Store Loop Associative Memory Address at
start of Loop
Slide 73
Command Table Allocation NodeCommandAddress FORKLoad Tick
Address Queue x N Fill Tick Instruction Buffer Address of FORK
EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of
EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard
Loop Associative Memory Store Loop Associative Memory Address at
start of Loop
Slide 74
Command Table Allocation NodeCommandAddress FORKLoad Tick
Address Queue x N Fill Tick Instruction Buffer Address of FORK
EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of
EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard
Loop Associative Memory Store Loop Associative Memory Address at
start of Loop
Slide 75
Command Table Allocation NodeCommandAddress FORKLoad Tick
Address Queue x N Fill Tick Instruction Buffer Address of FORK
EOTLoad Tick Address Queue Fill Tick Instruction Buffer Address of
EOT KILLFill Tick Instruction BufferAddress of Kill LoopsDiscard
Loop Associative Memory Store Loop Associative Memory Address at
start of Loop
Conclusion Presented a new memory architecture Tailored for
synchronous programs Has better worst case performance Analysis
time is scalable Between scratchpad and abstract cache analysis The
presented architecture is also suitable for other synchronous
languages Future work Data TickPAD TickPAD on multicores