FSW Workshop 2011
RBSP Mission Ops Flight Software Simulator – Saving & Restoring Sessions
Christopher A. Monaco JHU/APL
[email protected] 240-228-5387
Overview
2 FSW-11 October 19-21, 2011
Mission Operations tool (FAST) that simulates the state of the spacecraft and validates command sequences against operational constraints in “faster-than-real-time”.
The platform for the flight component of the simulator is quad core x86 architecture running 64-bit SUSE Linux with real-time preemptive patch
The simulator executes the RBSP flight software with targeted modifications and additional simulation framework and modeling
It will be used on a daily basis and terminated between sessions Technical challenges encountered with FAST: Endian, Time,
and Save/Restore
RBSP FSW Architecture
3 FSW-11 October 19-21, 2011
HW components Mission Libraries
Mission Applications
cFE
Platform Support Package (PSP)
Operating System Abstraction Layer (OSAL)
App App
App App
App App
App App • FSW uses cFE middleware
• Event/Message based architecture
• Applications are loosely coupled
• FSW applications interfaces only to mission libraries (and cFE, PSP and OSAL)
• Mission libraries may interface with devices – EEPROM, SSR, Interface Card, etc
Simulator FSW-Component Architecture
4 FSW-11 October 19-21, 2011
Simulator also uses cFE, with different OSAL and PSP for Linux and x86
Minor changes to cFE Minor changes to applications HW independent (FSW) libraries used by simulator HW dependent libraries are replaced by libs with emulations of
HW
Linux Workstation Emulation & Simulation Libraries
Mission Applications Applications have no other
dependencies outside of cFE, OSAL, PSP and a few libraries
Save/Restore Requirements
5 FSW-11 October 19-21, 2011
The state file needs to be human readable – in the event that key elements need to be modified. Infrequent and with the assistance of simulator developers.
Restore process must be able to input state into new release of the simulator. Not necessarily releases that take new FSW. Releases of new FSW to the SC
would be accompanied by a CPU reset. A minor change to a model should not prevent loading the simulator state
Constraint – the system has significant performance requirements – the solution to save/restore cannot impact the run-time of the system.
Off the Shelf Solutions
6 FSW-11 October 19-21, 2011
Cryopid2 Open source application checkpointing software Handles sockets, pipes, open files, data Available via SourceForge Only works for 32-bit kernels – we are using a 64-bit kernel. Checkpoint is NOT human readable/editable
DMTCP – Distributed Mutlithreaded Checkpointing Handles sockets, pipes, open files, data Available via SourceForge Able to build and execute on our 64-bit Linux distro. Checkpoint is NOT human readable No noticeable performance impact Several warning messages during execution Checkpoints are NOT compatible across minor builds – recompiling will invalidate a
checkpoint
Berkeley Labs Checkpoint Restore (BLCR) – planning to look at this
Custom Solution
7 FSW-11 October 19-21, 2011
State file to reference internal data the same way the source code does. Use the real variable names - human readable.
Build the save/restore functions into the simulation software Use the compiler to translate xxx.yyy variable mnemonic into the relocatable
address. Variable xxxx.yyyy in state file (if name does not change) can be restored across releases.
Across minor releases of the simulator the FSW does not change. Simulator executive and models may change. FSW variable names do not change, and FSW state can persist across minor releases of
simulator. The simulation has the functionality to pause – where each thread returns to
a home state – provides a good opportunity to perform save and restore operations.
Challenge: How to identify all of the data in the system to save / restore?
Options
8 FSW-11 October 19-21, 2011
Hand pick elements that are important to save and restore and write code for these elements? Very manual approach and needs to be revisited with each version of the SW.
Did we save everything that is important? or Try to save everything and hand pick elements that should NOT be saved/
restored? Did we exclude something that should be saved? Can some things NOT be restored? How do we save everything?
Instead of choosing specific state information that we want to save/restore, lets list everything and then find things that we know we don’t want to (or cannot) save/restore.
How to Identify All of the Internal Data?
9 FSW-11 October 19-21, 2011
The highest priority state information is contained within global data and function static data.
Intermediate stage of the compilation process: Translation Unit Generated by the compiler frontend, passed to the compiler contains an abstract
parse tree / abstract syntax tree for the entity being compiled Can be dumped into an ASCII file using –fdump-translation-unit The TU is for the entire context of the source file
All include files have been pulled in Contains information about all variables and their types in the given context
Mining the Translation Unit
10 FSW-11 October 19-21, 2011
Perl library GCC::TranslationUnit – Ashley Winters 2003 Reads the translation unit output and stores internally so it can easily be traversed
Start at the root of the TU parse tree and follow the chained nodes Find all the variable declarations (var_decl) – this gives all of the global
variable declarations.
Figure from: “GCC Front-End Internals”, Andi Hellmund, March 6 2011
int a;!
Complex Data Structures?
11 FSW-11 October 19-21, 2011
Complex data structures are also described using a tree in the Translation Unit.
a[i].b.c[j].e[k].f everything ultimately terminates with a primitive type (leaf)
We want the leaf items in each data structure and for every array entry.
So we need to un-nest complex data structures and ultimately unroll arrays.
Input to the Code Generator
12 FSW-11 October 19-21, 2011
The product of mining the translation output is a data file that contains: Regular expression used in the C code to match a line in the
state file with internal C variable name The actual variable name with the array component “[ ]”
specifying size of the array foo_array[100].boo_array[50].leaf
The actual variable name with the array component filled in with loop indices starting with “i”
foo_array[i].boo_array[j].leaf Signedness, primitive type, bit length, min and max values
Output of the Code Generator - Save
13 FSW-11 October 19-21, 2011
The code generator is run for each source file that has state that should be maintained.
A save function and a restore function specific to the particular file.
“Save” function serially writes each leaf name and value to a specified file: array_var[0].leaf = 6;! array_var[0].another_leaf = 3;! array_var[1].leaf = 7;! array_var[1].another_leaf = 2;!
Output of the Code Generator - Restore
14 FSW-11 October 19-21, 2011
“Restore” function takes the variable name string and value string in as an argument. It attempts to match the input string against regular expressions that correspond to the variables it knows how to restore.
Regular expressions created for each leaf while mining the TU file
When it finds a match it performs any conversions/casts necessary and does the assignment.
What about Function-Static Data?
15 FSW-11 October 19-21, 2011
Function-static data cannot be referenced outside of the specified function – not by name
Developed a C preprocessor that replaces each function-static declaration with a unique global declaration <function_name>_<original_variable_name>
Replace all references within the scope with the new name.
Do this prior to the previously outlined process
State File
16 FSW-11 October 19-21, 2011
Data structures un-nested
Arrays are unrolled
Organized by source file maps to a single restore function
Coordinating Save/Restore
17 FSW-11 October 19-21, 2011
The FSW is an event driven system All tasks and applications pend on queues, pipes or semaphores.
We’ve modified the OSAL for the simulator to be aware of the simulator state: PAUSED, RT, FASTER_THAN_RT
The system can be paused in one of two ways 1. During nominal execution – any attempt to get data on queues, pipes,
semaphores or delay operation gets blocked until the simulator returns from PAUSE.
2. Pause the events that cause the system to run: data on queues, pipes, semaphores – allows tasks and applications to return to a “home” state. 1. Used to prior to save/restore operations.
Run-Time Performance
18 FSW-11 October 19-21, 2011
Save operation takes < 1 second to generate 20MB of ASCII state
Initial implementation of Restore process Single task matching up an ASCII string to an internal variable using regular
expression. Linear search across all internal variables until match found Each leaf (and array indices) treated as a separate item to match against Restore operation took >1hour
Second implementation of Restore process Four tasks (quad core processor) – each data point can be restore independently Segregate the state file by source file (and Save/Restore function) - 85 source files Still linear search, however, on 1/85 of the original domain Arrays use wild cards for array indices. Restore operation takes 3 minutes on 20 MB file
The Save/Restore Code Generator Apparatus (Courtesy of Rube Goldberg)
19 FSW-11 October 19-21, 2011
from within the makefile - calls external scripts & tools
(Step 1) Create a redirect.c (includes file to be compiled)
redirect.c
(Steps 2 & 10) preprocess to get one big file
Preprocessed file 1
(Steps 3&4, 11&12) Parse: replace and rename function static
variables with global
Preprocessed file 3 (Step 5) Compile
and dump the translation unit
Translation Unit
(Step 6) Parse the TU, find all global data, get types,
unroll all nests, get leaf type info
Global data file
(Step 7) Remove entries specified in removals list
Prebuilt Removals
list
Global data file
(Step 8 & 9) Code Generator
<filename>_save_restore.c
(Step 13) Compile this time for the
object file
Object file with save/restore
functions
Hybrid Approach
20 FSW-11 October 19-21, 2011
Considering using both COTS and custom approach COTS process-checkpointing could be used on a daily basis Custom approach could be used to “dump state” to:
View it - human readable Carry state across minor releases of FSW
Programmatically generate a core dump files during a save operation.
Can be used as a back-up If COTS product fails, if Custom approach does not contain all of the state
information. The core dump can be examined using gdb.
Thank You
21 FSW-11 October 19-21, 2011
Comments Questions