+ All Categories
Home > Documents > The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung...

The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung...

Date post: 28-Dec-2015
Category:
Upload: alyson-adelia-warren
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2 , Ted Bapty 2 , Harry Cheung 3 , Zbigniew Kalbarczyk 4 , Akhilesh Khanna 4 , Jim Kowalkowski 3 , Derek Messie 5 , Daniel Mossé 6 , Sandeep Neema 2 , Steve Nordstrom 2 , Jae Oh 5 , Paul Sheldon 7 , Shweta Shetty 2 , Dmitri Volper 5 , Long Wang 4 , Di Yao 2 1 High Energy Physics, University of Illinois, 1110 W. Green Street, Urbana, IL 61801 USA 2 Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN 37235 USA 3 Fermi National Accelerator Laboratory, Batavia, IL 60510 USA 4 Electrical and Computer Science, University of Illinois, Urbana, IL 61801 USA 5 Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY 13244 USA 6 Computer Science, University of Pittsburgh, Pittsburgh, PA 15250 USA 7 Physics and Astronomy Department, Vanderbilt University, Nashville, TN 37235 USA
Transcript
Page 1: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

The RTES Project – BTeV, and Beyond

Michael J. Haney1

Shikha Ahuja2, Ted Bapty2, Harry Cheung3, Zbigniew Kalbarczyk4, Akhilesh Khanna4, Jim Kowalkowski3, Derek Messie5, Daniel Mossé6, Sandeep Neema2, Steve Nordstrom2, Jae Oh5, Paul Sheldon7, Shweta Shetty2, Dmitri Volper5,

Long Wang4, Di Yao2

1High Energy Physics, University of Illinois, 1110 W. Green Street, Urbana, IL 61801 USA2Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN 37235 USA

3Fermi National Accelerator Laboratory, Batavia, IL 60510 USA4Electrical and Computer Science, University of Illinois, Urbana, IL 61801 USA

5Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY 13244 USA6Computer Science, University of Pittsburgh, Pittsburgh, PA 15250 USA

7Physics and Astronomy Department, Vanderbilt University, Nashville, TN 37235 USA

Page 2: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Outline

• Real Time Embedded System Project– BTeV => RTES

• Prototypes– SuperComputing 2003– Demo System 2004

• Beyond BTeV

Page 3: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

BTeV - High Energy Physics

• Input: 500 GB/s (2.5 MHz)• Level 1 processing: 190s

– rate of 396 ns– 528 “8 GHz” G5 CPUs

• (factor of 50 event reduction)

– high performance interconnects

• Level 2/3 processing: 5+135 ms • (factor of 10+2 event reduction)

– 1536 “12 GHz” CPUs commodity networking

• Output: 200 MB/s (4 kHz) = 1-2 Petabytes/year

Page 4: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

BTeV’s Need• “Given the very complex nature of this system

where thousands of events are simultaneously and asynchronously cooking, issues of data integrity, robustness, and monitoring are critically important and have the capacity to cripple a design if not dealt with at the outset… BTeV [needs to] supply the necessary level of “self-awareness” in the trigger system.”

– [June 2000 Project Review]

Page 5: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

thus, RTES• The Real Time Embedded System Group

– University of Illinois– University of Pittsburgh– University of Syracuse– Vanderbilt University (PI)– Fermilab

• Physicists and Computer Scientists/Electrical Engineers with expertise in– High performance, real-time system software and hardware,– Reliability and fault tolerance,

– System specification, generation, and modeling tools.

• NSF ITR grant ACI-0121658

Page 6: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Analysis

Local Oper.Manager

LocalFaultMgr

TrigAlgo.

ARMOR/RTOS

TrigAlgo.Trig

Algo.Trig

Algo.

Logical C

ontrol N

etwork

L1/DSP

Local Oper.Manager

LocalFaultMgr

TrigAlgo.

ARMOR/RTOS

TrigAlgo.Trig

Algo.Trig

Algo.

Log

ical

Dat

a N

et

Local OperManager

LocalFaultMgr

TrigAlgo.

ARMOR/Linux

TrigAlgo.Trig

Algo.Trig

Algo.

Log

ical

Dat

a N

et

Logical C

ontrol Netw

ork

Local OperManager

LocalFaultMgr

TrigAlgo.

ARMOR/Linux

TrigAlgo.Trig

Algo.Trig

Algo.

L2,3/CISC/RISC

Region Operations Mgr

RegionFault Mgr

Runtime

Designand

AnalysisAlgorithm Fault Behavior

Resource

Syn

thes

is

PerformanceSimulation

DiagnosabilityAnalysis

ReliabilityAnalysis

SystemModels

Soft Real-Time Hard

ExperimentControl

Interface

Synthesis

Fee

dbac

k

Modeling

Logical C

ontrol Netw

ork

Global Operations

Manager

Global Fault Manager

Reconfig Behavior

The RTES Solution• Model Integrated Computing

– Graphical representation of complex system,with modeling (simulation) resources

• ARMORs– To protect Linux processes

• And sub processors

• VLAs– To monitor/mitigate

at every level• embedded, supervisory Linux,

Linux trigger farm, etc.

Page 7: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Modeling Environment: GME*Fault handlingProcess dataflowHW Configuration

* GME is an Open-Source, Meta-configurable, multi-aspect graphical modeling tool

Page 8: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

ARMOR: Adaptive Reconfigurable Mobile Objects of Reliability

Heartbeat ARMORDetects and recovers FTM failures

Fault Tolerant ManagerHighest ranking manager in the system

DaemonsDetect ARMOR crash and hang failures

ARMOR processesProvide a hierarchy of error detection and recovery.ARMORS are protected through checkpointingand internal self-checking.

Execution ARMOROversees application process(e.g. the various Trigger Supervisor/Monitors)

Daemon

Fault TolerantManager (FTM)

Daemon

HeartbeatARMOR

Daemon

ExecARMOR

AppProcess

network

Page 9: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Very Lightweight Agents

• Minimal footprint

• Platform independence– Employable everywhere in

the system!

• Monitors hardware and software

• Handles fault detection& communications with higher level entities

PhysicsApplication

HardwareOS Kernel

(Linux)

VLA

L2/L3 Manager Nodes(Linux)

PhysicsApplication

Level 2/3 Farm Nodes(Linux)

NetworkAPI

Page 10: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

RTES view of the BTeV L1 Trigger

DAQ Switch

L2/3 L2/3 L2/3

ITCH

DSP

FPGA

DSP

FPGA

DSP

FPGA

DSP

FPGA

BufferManager

Concentrator (4:1)

Concentrator(80:1)

Level 1 Switch

FIFO

GL1

GlobalMTSM

GlobalL2/3SM

FarmletManager

RegionalMTSM

Slow/RunControl

L1Buf

GlobalGL1SM

FPGAFPGAManager

RegionalMTSM

GL1Manager

Muon Detector

FPGA FPGAL1Buf

L1Buf

60+ (or 600+) farmlets

GlobalPTSM

GL1SM – Global L1 TriggerSupervisor/Monitor

MTSM – Muon TriggerSupervisor/Monitor

PTSM – Pixel TriggerSupervisor/Monitor

L2/3SM – Level2/Level3 TriggerSupervisor/Monitor

Page 11: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

SC2003 Prototype

GatewayGateway

PC - Windows OSDATA

DSP - BIOS

PhysicsApplication

PhysicsApplication

Very Light Monitor Agent

Very Light Monitor Agent

TCP/IP

PC - Linux OS

EPICS Graphical Display System

EPICS Graphical Display System

TCP/IP

COMMANDS

ARMOR Microkernel

RecoveryPolicy

MsgParser

Local Manager ARMOR

DSPInterface

DaemonDaemon

ARMOR Microkernel

RecoveryPolicy

MsgParser

Local Manager ARMOR

EPICSInterface

DaemonDaemon

Page 12: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

EPICS GUI

1 2

3

4 6

5

9

7

8

10

Page 13: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Independent Review

• Following SuperComputing 2003, a software review was conducted– GME needs to coherently address

multiple, differing domains• System modeling, messaging, fault mitigation,

Run Control function, GUI, other

– ARMORs need to be easily customized• Via GME

– Overall packaging and version control - vital

Page 14: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Domain-specific languages

• GME models, metamodels, and interpreters for– system description, messaging, state machine (run control,

ARMOR), GUI

• Each language generates appropriate artifacts– C++, Python, Matlab M-files, Elvin config files, etc.

Page 15: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Versioning/Build System

Run TreeRun TreeSystem

Executables

Build TreeBuild Tree

UDM TranslatorsUDM Translators

CanonicalXML models

Domain ModelsDomain Models

MetamodelsMetamodels

Language Specification

Domain Artifacts

ArtifactsCompiler/Linker

Translator

SourceFiles

Models

LanguageSpecification

ObjectSourceArtifacts

OUT

IN

OUT

IN

IN

Page 16: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Demo System 2004 - L2/3 Trigger

FTM

Global Mgr Heartbeat/Source node

Regional Mgr 1

Worker 1.1

HB

Exec ARMOR

Exec ARMOR

Filter 1 Filter 2 Event Builder

Worker 1.2

Regional Mgr 2

Exec ARMOR

Worker 2.1

Elvin Router

GUI

Region 1

Elvin msg

ARMOR msg Exec ARMOR

Event Source

Page 17: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Demo System 2004

Iron

Gangliapublic

private

laptop

MatlabElvin

laptop

MatlabElvin

Boulder

Elvin

Global

RC, ARMOR

Regional

RC, ARMOR

Worker

RC, VLA, ARMORFilterApp

Worker

RC, VLA, ARMORFilterApp

Regional

RC, ARMOR

Worker

RC, VLA, ARMORFilterApp

Worker

RC, VLA, ARMORFilterApp

Regional

RC, ARMOR

Worker

RC, VLA, ARMORFilterApp

DataSource

file reader

Page 18: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Matlab GUI

Page 19: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Beyond BTeV - CMS

• GME modeling for XDAQ– System descriptions, state machines,

messaging…– Work in progress

• Fault tolerance for HLT– ARMORs and VLAs

• Being discussed

• Balancing CMS needs and RTES goals• Adding value, without requiring changes

Page 20: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Beyond BTeV - LQCD• Lattice Gauge Theory Computation

– farm at Fermilab

• Single-point sensitivities– Single process fault can compromise

entire farm computation– Checkpointed; can be restarted, but…

• ARMORs and VLAs– Batch/autonomous protection

• No operator

– Dynamic mix of protection requirements• Not a (quasi)static L2/3 Trigger

Page 21: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Beyond BTeV - Grid, Other

• Grid Projects– Load balancing and networks studies

• Nodes-in-farm => farms-in-grid• Resource driven, deadline driven, other

– Extension of studies done for BTeV/partitioning

• Other - Dark Energy Survey (astro camera)– “Simple” system (few nodes)– Not real-time hard (can reacquire image)

• But it will be a good case-study for the “cost” of incorporating RTES (GME, ARMORs, VLAs)

Page 22: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond

Conclusions

• The RTES project developed two prototypes (L1, and L2/3) for BTeV– Demonstrated at conferences

• RTES is now applying its design-time modeling and runtime middleware to severalhigh performance heterogeneousembedded application environments

Page 23: The RTES Project – BTeV, and Beyond Michael J. Haney 1 Shikha Ahuja 2, Ted Bapty 2, Harry Cheung 3, Zbigniew Kalbarczyk 4, Akhilesh Khanna 4, Jim Kowalkowski.

M. Haney; RT 2005 The RTES Project - BTeV, and Beyond


Recommended