[American Institute of Aeronautics and Astronautics 7th Computers in Aerospace Conference -...

VALIDATION OF REAL-TIME, DISTRIBUTED SOFTWARE

Abstract

Carolyn B. Bocttchcr Alicc H. Munlz ScicntistEnginccr ScicnlislEnginccr

Hughes Aircraft Company PO Box 92326

Los Angclcs, CA 90009 Bldg. R11 MS. 11001

(mail) ahmuntz@ h:lc2arpn,hac.com

Validation of software that executes i n parallel on distributed processors introduces new problems not addressed by traditional testing tools. These problems are exacerbated when the software under test must execute within hard real-time deadlines and is embedded within a larger system such as an avionics system. This paper will describe a methodology and set of tools that is currently being used to ;,lonitor the distributed processing, to validate the correctness of timing and synchronization, and to analyze the resource utilization of radar application software that executes in a distributed processing environment. The set of validation tools is applied starting at the top level design phase rind continuing through the integration test phase of the software development cycle. The methodology is generally applicable to tightly coupled multi-processor environments which include both general purpose and special purpose processors.

I. Introduction

A wide range of current and future military avionics applications (e.g., tactical and strategic radar, electronic warfare, and sensor fusion) need to adapt and respond to changes i n the operational environment wi th in a time deadline and to provide complex functionalities, high performance, and high reliability. The software in such systems is distributed and contains a collection of programs which are executed in a cooperative fashion (i.e. partially synchronized) toward a common goal. The execution of the distributed software can be characterized as follows:

1. Multiple Asynchronous Processes: Synchronization points may not only exist between two successive instructions within a single process (pipelined

execution), but also among different processes. The fact that the distributed system has multi-thread controls makes it difficult to predict the runtime behavior of the distributed software.

Multiple Distributed Processors: The multiple asynchronous processes are running on physically distributed processors. Even though these physical processors may be the same type, it is impossible to guarantee that they run at the same speed. In a real-time system, the occurrences of events are frequently synchronized at fixed time intervals. Identifying the timing problems which are caused by hardware speed is a difficult task even in a system with a single processor. Multiple processors in the system intensify the difficulty of locating problems.

Nondeterministic Behavior: The behavior of a process can be determined not only on the basis of its internal state, but also by the information which has been derived as a result of communicating with other processes. If one process is slower than the other processes, it then influences the execution of the others. Even though the instructions in programs are deterministic, the factors such as 110 access time and external events are nondetemiinistic. The timing relationship among all processes is not deterministic. When programs are executed on a distributed system, it is impossible to deduce, a priori, the execution order of the instruction sequences for all processes.

4. Shared Resources: T h e multiple processors may share a common bus, memory, or other resources. Moreover, the sequence of requests for the shared

Copyright O 1989 by Hughcs Aircraft Co. I'uhlishcd by thc Amcrican In.;i~illc of Acronauiics and Astronautics with permission.

resources may not correspond to the actual sequence of granting the accesses. The nondeterministic sequence of resource allocation intensifies the nondeterministic nature of the distributed processes.

Distributed Controlling Mechanisms: The distributed decisions for controlling are more prone to errors. The errors in many cases are subtle or sporadic, caused by improper synchronization among the processes (race conditions) or violations in time deadlines. As an added complication, the results observed in the system are in many cases hard to reproduce. This is because the computation not only depends on the system input, but also depends on the relative timing of the processes and processors.

6 . Delay of Propagation of the Exception Conditions: The propagation of exceptions from a processor to all processors can only be "as soon as possible" without any guarantee of how long it will take. By the time that all processors are informed, the critical information pertinent to determining the reasons for the exception conditions may have been destroyed.

The increased functional complexity of avionics software (which stems from the application requirements and the employment of multiple processing elements) introduces new problems not addressed by traditional testing tools. These problems are exacerbated when the software under test must satisfy adaptability and critical timing constraints and is embedded within a larger system. A methodology and a set of tools have been developed in Hughes Aircraft Company to monitor the distributed processing, to validate the correctness of timing and synchronization, and to analyze the resource utilization of radar application software that executes in a distributed processing environment.

The set of validation tools is applied starting at the top level design phase and continuing through the integration test phase of the software development cycle. The existing tool set is targetted to a distributed processing architecture, as illustrated in Figure 1, that includes both a dual CPU data processor and an

array of programmable signal processing elements which share a global memory and are controlled by a single CPU data processor.

DUAL CPU

PROCESSOR

It DEVICES

F I CONTROLLER

SHARED I MEMORY I , MUTIL-ELEMENT SIGNAL I PROCESSOR

Figure 1 : Distributed Processing Architecture

However, the methodology is generally applicable to tightly coupled distributed processor environments which include both general purpose and special purpose processors.

In the next section, we will describe a validation methodology which has been demonstrated to be very successful on several avionic radar programs. In Section 3, we will describe the tool set which was designed to support our validation methodology. We will describe how the tools are used with high level and instruction level simulators, as well as how they are used with the embedded processors while they are executing either in a controlled, real-time, simulated environment or in flight test. The tools include an editor for describing the software as a directed graph and special purpose hardware and software for capturing the data in real-time. Also included in the tools is data reduction software executing on an engineering workstation that can be used to create either interactive, animated, graphic displays of the software as it consumes processing and memory resources or hardcopy validation reports.

Section 4 contains a summary and description of ongoing work.

In non-distributed systems we can always assume that the global status of the system is easily accessible, but in distributed architectures, the global status of the system needs to be constructed by assembling information collected from all processors. This is not a trivial task because events must be collected by communicating with all processors in the system. There are at least two problems with this approach: 1) the clock on each processor cannot be completely synchronized, and 2) message transmission may be delayed. Thus although software test and validation in a distributed real- time system has much in common with performing these functions in a non-distributed real-time system, the multiple asynchronous processes, multiple processors, nondeterministic behavior, and shared resources make performing these functions more difficult and challenging. Therefore, it is worthwhile to develop special strategies for finding bugs in distributed systems and to create a controllable environment which supports the strategies.

In the following, we will describe a methodology for validating distributed software that is based on the traditional bottom-up approach to test and validation. This approach is widely used as a proven, effective method for debugging a complex software system. That is, each sequential program is thoroughly tested to check the correctness of the program logic using traditional testing methods. Then distributed and asynchronous programs are exercised together to verify interfaces, timing, communication, and synchronization at the system level. The system is exercised repeatedly to uncover as many bugs as possible. The last step of testing is fine tuning the system by conducting performance measurements of the system. We have developed a specialized set of tools for performing this system level validation.

Three-Level Debugging:

The classical bottom-up approach is the motivation for tool designers to categorize debugging into three levels. Each level provides programmers with different views of how programs are executed. This three-level debugging is outlined as follows:

The Processor Level to examine (and change) information in registers, memories, PSW (program status register), VO ports, etc. The processor level debugger does not help the programmer interpret information. Low level software such as the operating system or 110 drivers are usually debugged at this level.

2. The Process Level to symbolically set breakpoints and traces in source programs, and examine the variables, status, and traces of processes. Although the debugging information obtained at this level is closer to the programmer's conceptual view of how the processes are made up from the program modules, it still does not provide the programmer with information of how all processes on multi-processors interact with each other, or the overall performance of the system. To derive this system level information, the programmer needs to manually collect information from different executions of each process, correlate it, and synthesize the correlated information.

3. The System Level to directly observe the interactions among processes, the timing constraints of each process, and the overall system performance. Trace data which contain snapshots of the system's activities are usually used to construct a behavior abstraction model. Examining the model exposes the processes' interaction, the timing constraints, and the system's performance. Frequently, traces are the only method to discover subtle synchronization and timing problems which commonly occur in distributed real-time systems.

Since each level of debugging has its strength and weakness, an ideal test environment should provide tools to support all three levels of debugging. Further, this environment should allow the programmer to switch among each level easily. In this paper, we will focus on system level debugging in which the trace (i.e. history) recorded during execution of the software can be used for verifying the validity of execution behavior, discovering system bottlenecks and

anomalies, and observing the throughput of the system.

Methods for Prodwi~g: Traces

"Tracing" produces a history of events recorded without halting the execution of software. Each entry in the trace represents an event in the system and the system state when that event occurred.. From the trace data, the programmer can determine the sequence of events that occurred at runtime. Capturing detailed trace statistics consumes massive amounts of resources (CPU, 10, and storage.) Furthermore, the programmer must have a thorough understanding of the entire system to be able to analyze the trace data to extract the "important" and "significant" activities in the system from the masses of recorded data. Nevertheless, when subtle synchronization and timing problems occur, tracing has proven to be the most successful technique for determining their cause.

Currently at Hughes, to collect trace data in real-time from the embedded processors we employ a dedicated hardware unit. To buffer and store the trace data, we developed dedicated real-time software executing on a general purpose host computer.. To analyze the trace data and display it in a easy to understand, graphical form, we developed an interactive Graphic Display System (GDS) which is hosted on a SUN 3 engineering workstation.

Three methods of specifying trace data are employed:

1. Each user selects events for the dedicated hardware to monitor.

2. The operating system keeps track of a set of commonly occurring, predefined events.

3. The dedicated hardware monitors a set of commonly occurring, predefined events.

Using the first method to set up events requires an in-depth understanding of the application code. On the other hand, the trace data produced by using the latter two methods represents an application-independent abstraction of the execution behavior of the distributed system. We defined a behavioral abstraction language for describing a set of events and the

sequencing relations among those events which is used to interpret the application-independent trace data . In particular, in a distributed system, there are a set of commonly occurring events which are critical to the understanding of the system's execution (e.g., the events of dispatching a job for execution.)

Event Event Tme Parameter

Data I Conroller

2 - type 2 job start

I . I I . I Table 1: An Example Set of Traced Events

PE Jobs

Interrupts

Change

Bulk Memory Allocation

.

Table 1 gives as an example a set of important events currently defined for signal processing. For each of these events, the trace file will contain some standard parameters and some event specific parameters.

To facilitate understanding of the trace data, a "monitor" of the events is provided. The monitor provides the programmer with an overall picture of the execution of the system. This overall picture graphically presents the information in terms of a conceptual model of the system behavior which is derived from the design specification. Automatic detection and presentation of anomalies are essential capabilities needed in a system level debugger. From the graphical display, the programmer can immediately see anomalies, such as timeline- overruns or processes out of synchronization, can easily discover system bottlenecks, and can review details of the trace at critical points.

0 -job start 1 - job end 2 -segment start 3 - segment end

Interrupt Types

Mo&Type

Methods for Detecting Anomalies

Job ID

interrupt parameters

Table Data

The intended model of the system is constructed by the programmer from design

specifications of the relationships among processes, interrupts, and resource utilization, etc., and of the timing and concurrency constraints which must be met. The execution model for the system is automatically constructed by the test environment from the trace database. To automatically detect anomalies, the intended model is compared against the execution model. If these two models are different, then anomalies have occurred. There are two basic approaches to modeling for automatic anomaly detection:

1. Define all the possible anomalies, and look for these anomalies in the trace database, and

2. Define all the normal behavior of the software. When the execution of the software deviates from the specification, an anomaly is reported.

The first approach resembles what experienced programmers typically do in a debugging session to select events to be monitored. Based on a priori knowledge and experience, the programmer sets up an anomaly database. During the debugging session, when a sequence of events matches some entries in the database, anomaly detection is triggered. If no matches are found, even though system errors are detected, either the data base must be changed or expanded to include additional anomaly conditions, or the second approach must be adopted.

There is a "closed-world" assumption associated with the first approach, in the sense that only a fixed number of anomalies can be known to the anomaly detection mechanism. This is sufficient if the database can contain all possible anomaly conditions. However, as experienced programmers know, software anomalies come in strange flavors; it's virtually impossible to itemize all potential anomalies.

The second approach avoids the "closed world assumption, but is suitable only where the operation of the target system is well defined, so that the behavior of the system can be modelled concisely. Fortunately, avionic radar signal processing satisfies this criteria, since it is characterized by repetitive performance of a set of pre-specified functions. The normal behavior of signal processing software can be described by our graph methodology in which partially ordered graphs are used to describe the

processing behavior. In this methodology, a graph is defined as a collection of jobs and the relationship they have with one another. The job graph adapts the concept of a task graph from [I]. Figure 2 illustrates an example signal processing graph which includes six jobs. Directed arcs between the jobs represents the data flow relationship between jobs. The job at the tail of the arc must be completed before the job at the head can execute. A job is defined as a set of functions o r nodes which will execute sequentially in a single PE or in the Array Controller shown in Figure 1. A job is the smallest dispatchable entity. A node is a signal processing function which must be performed sequentially. (e.g., an FFT on one range bin, pulse compression on one pulse repetition interval).

Legend

0 Job

0 -

Figure 2: An Example Job Graph

III. Integrated Tool Set

The existing Hughes graph methodology tool set supports validation of distributed signal processing software whose normal behavior has been specified in the form of a directed graph. One of the tools, the Resource Monitor, also supports validation of data processing software tasks. As shown in Figure 3, the existing tool set includes the following:

EDITOR 0 SCHEDULER (zqDATA BASE

INTERFACE SYSTEM LEVE

SIMULATOR (W

DATA BASE

EMBEDDED PROCESSORS

Figure 3: Graph Methodology Integrated Tool Set

1. Graph Editor for describing the timing and sequencing relationships which must be preserved between distributed, asynchronous software jobs.

2. Initial scheduler to automatically schedule the partially ordered software graphs on the multi-processor architecture and predict their execution time and processor utilization for nominal cases.

3. Monte Carlo simulator for system level performance analysis based on expected, data-dependent variations in the time-to-complete signal processing graphs.

4. Resource monitor for graphical display of system timing and synchronization information and for validation of the actual timing behavior against the specified precedence relationships and temporal constraints. The trace data processed by the resource monitor may be obtained from the system level simulation, from non-realtime, instruction level simulation, from real-

time execution on the flight processors driven by a simulated environment, or from flight test in the operational environment.

Graph Editor

The Graph Editor tool allows a user to create acyclic graphs as described in Section I1 which represent the parallel signal processing algorithms required for each radar sensor mode , such as high pulse repetition frequency track, velocity search, etc.. Several graphs may be used to represent a single sensor mode. The Graph Editor provides an interactive editing capability to name jobs, define each job's predecessors, define the expected execution time of each job, and optionally, to assign jobs to signal processing elements in the distributed architecture.

A directed graph representing the user's input is automatically drawn in a separate window. It graphically illustrates the precedence order, processing element assignment, and timeline for each job While viewing the graphical representation, the user can edit the text describing the graph. The graphical representation is then automatically redrawn so that the user can easily see the changes in the graph precedences and timeline. An example textual editing window and graphical output window are illustrated in Figure 4.

Initial Scheduler

Each job in a graph has associated with it a timing estimate and a partially ordered precedence relationship. In addition, the graph has an overall elapsed time that must be met by the schedule. The Initial Scheduler (IS) tool produces a list schedule and assigns jobs to processing elements so that all precedence constraints are satisfied and the elapsed time to complete the graph processing is acceptable.

Since the precedence relationship of jobs is only partially ordered, additional criteria must be utilized to decide which job to schedule when several jobs simultaneously have all their precedents satisfied. It is well known that determining a list schedule that minimizes processing time is an NP-complete problem [I] [3]. However, a number of heuristic algorithms are known to yield sub-optimal schedules in linear time. The IS tool allows the user to try any

Figure 4: Graph Editor Screen Showing Graph Command Language Inputs (upper large window) and the Generated Graph (lower window)

of four heuristic scheduling algorithms to choose the one yielding the best results. A completely ordered list schedule and an assignment of jobs to PEs is produced with the results displayed to the user as shown by the timeline inset in Figure 4. One of the scheduling algorithms, Largest Descendent First Weight Determination (LDFWD) [2] is illustrated in Figure 5. Using the LDFWD algorithm, the priority, or "weight", of a job is computed as the sum of the "weights" of all its immediate successors plus its own estimated execution time. The "weight" of a terminal job (i.e. a job with no successor jobs) is its estimated execution time. Ties are resolved arbitrarily to arrange the jobs in a list.

The user must enter the expected execution time for each job through the Graph Editor before running the Initial Scheduler. Then

a list schedule is determined by calculating the job weights according to the algorithm selected by the user. Next, the jobs are allocated to PEs by assigning the highest priority ready job (i.e., jobs whose predecessors have completed) to the first available PE. Then the elapsed time required to execute the graph is computed, also shown in Figure 5.

Svstem Level Simulator

The System Level Simulator (SLS) performs a high level timing simulation of one or more graphs executing according to a user-input scenario. A predicted execution time or a range of execution times for each job must be supplied by the user. Where a range of execution times is supplied, a uniformly distributed random number

EXECUTION THE "WEIGHT" OF A JOB IS THE SUM

OF THE "WEIGHTS" OF ALL ITS SUCCESSORS,

PLUS ITS OWN EXECUTION TIME

(1 ) ASSIGN "WEIGHTS"

I (3) ALLOCATE I I PEA B I A H I

JOBS WEIGHT

H

(2) ARRANGE IN A LIST

PEB

PEC

Figure 5: The Job Graph (upper left comer) Can Be Scheduled According to the Schedule Chart (lower left comer) which is determined from the Largest Descendant First Weight Determination Algorithm

is used to compute an actual execution time whenever the job is executed in the simulation.

C

D

Prior to executing the SLS, a list schedule must have been produced for the graph, either automatically by using the IS tool, or manually by the user. The SLS is most useful during the software design phase or during maintenance when changes must be made to the graphs. By using the SLS tool, potential timing problems can be resolved for data-dependent processing which cannot be adequately considered if only static scheduling is considered.

I

E

G

Resource Monitor

The Resource Monitor (RM) graphically displays the execution status of signal processing graphs and data processing tasks executing in the

distributed processing architecture. It monitors resource uti l ization, inter-processor communication, and process execution time. This information can be used to determine if there are any discrepancies between the actual behavior of the software running on the distributed architecture and the behavior of the system predicted by using a priori or simulated methods. The RM significantly reduces the time required for testing and integration of distributed software.

Special code in the real-time operating system which controls the execution of both software tasks and the signal processing graphs on the distributed architecture is used to generate the data for the RM. The overhead of this special code has been shown to be insignificant (on the order of 1%). However, the special OS code is

C G ~ U I H INFO ) [v] (T] [srbrsl

t [ T R I C E S E L E C T I O N J ( W A T C H P O I I I T O N 3 - F W E CONTROL J ( WATCH S E L E C T 3

). >IIIE. 431 56 mf

x

PEO

PEI

PE2

PE3

LOCK - 374.: 374.; 31m.1 377.1

Figure 6: Resource Monitor Screen Showing Job Execution Sequence, Time Line, and Synchronization

selected via a compile-time switch, so that operational versions of the OS do not have this overhead

In the signal processor, the RM data is sent over a central communication bus which is part of the operational hardware and collected in real-time through a special port sitting on the bus. Since the bus is generally lightly loaded, the extra data has minimal impact on the timeline. During ground based testing of the system, special hardware monitors the bus to choose selected data to be collected. It is connected to a general purpose computer which is used to buffer and store the RM data to disk. During flight test, the special port is connected to a flight tape which is used to store the data. In the data processor, the RM data is collected by the operating system in a smart memory employing cycle stealing to output the data to an interface control unit with minimal impact to the timeline.

The RM software which executes as part of the Graphical Display System on a graphical

engineering workstation (currently a SUN 3 system) analyzes and displays the data generated by the real-time software. As illustrated in Figure 6, jobs are shown in the order in which they executed by placing them on the timeline for the PE or AC to which they were assigned. The extent of the job on the horizontal axis indicates the duration of its execution time. Idle time for the PEs is shown by the extent of the horizontal axis which has no jobs shown.

The RM's graphical interface allows the user to scroll backward or forward along the timeline by clicking mouse buttons on left and right arrow symbols. Within the timeline area of the screen, the mouse represents a cross-hair cursor. The user can click one of three mouse buttions with the cross-hair at the beginning of a job to mark a time, and as the mouse is moved horizontally, the time from the mark to the cursor is displayed, allowing the exact execution time for a job to be easily obtained by the user. A second mouse button can be clicked on any job to obtain a full textual description of the job.

Capabilities exist to jump forward or backward in the timeline by specifying a time, to adjust the display speed of the timeline as it is scrolled past, and to set the scale of the horizontal time axis so that more jobs can be seen on one screen, or so that the user can zoom in on a selected segment of the timeline. Statistical windows can be selected to show the operating mode of the system, the graph ids, and job ids for the activities during a specified time period. Statistics include the minimum, maximum and average execution times for jobs, the number of times each job ran, and the total time and percentage each job required for its executions. These statistics can be calculated for a given PE or the AC, or for all PEs, on a per-job or per- graph basis over a specified time interval.

IV. Summary

System level resource monitoring of signal and data processing has proven to be extremely valuable during test and integration of real-time radar software that executes on distributed processors. In addition, validation that actual system execution behavior corresponds to an intended system behavior model has been achieved for signal processing by representing its intended behavior as a directed graph.

In currently operational distributed avionic processing, software is statically allocated to processors. In the future, dynamic allocation of software to processors will be employed to achieve improved processor utilization, fault tolerance, load balancing, and adaptation of processing to the tactical situation. We have already demonstrated that dynamic allocation of signal processing software is feasible for our current systems. The RM software proved invaluable in validating the timing and synchronization of software which employs dynamic allocation and in verifying the improved processor utilization and load balancing.

Our current research is exploring new validation techniques for modeling the intended behavior of data processing software. Modeling the intended behavior of data processing software is considerably more complex than modeling signal processing because the data

processing is data-dependent and pre-emptive and has periodic, sporadic, and adaptive tasks [4l.

In addition to exploring methods for modeling software behavior, we are also extending our graph methodology into the software requirements phase of the software life cycle. Our goal is to develop an integrated tool set that (a) supports automated validation of the software specification at each phase of the life cycle; and (b) helps to automate the analysis cr>f feasibility of meeting performance requirements with specified software executing on a particular hardware architecture.

REFERENCES

[I] E. G . Coffman and P. J. Denning, Operating Svstem Theory, Chapter 3, page 83-144, Prentice Hall, 1973.

[2] H. F. Li, "Scheduling Trees in Parallel Pipelined Processing Environments", IEEE Transactions on Computers, November, 1977.

[3] S. Sahni, "Scheduling Multipipeline and Multiprocessor Computers", IEEE Transactions on Computers, May, 1984.

[4] A. Muntz, "A Framework for Specification and Design of Software for Advanced Sensor Systems", Hughes Aircraft Co. Technical Re~or t , . RSG.SEL233002/1452, El seg;ndo; Cal.

Date post:	15-Dec-2016
Category:	Documents
Upload:	alice
View:	213 times
Download:	0 times

[American Institute of Aeronautics and Astronautics 7th Computers in Aerospace Conference -...

Documents