0087_03

ED 03 Released MFS Defence and Investigation facilities

0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 1/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Site

VELIZY GSM / WiMAX Business Division

Originators

MFS Specifications team

MFS Defence and Investigation Facilities

Release B11

System : ALCATEL-LUCENT GSM BSS Sub-system : SYS-TLA Document Category : PRODUCT DEFINITION

ABSTRACT

This SFD describes a list of features that are proposed to improve the MFS Telecom SW defence and SW failures investigation means.

The SW defence improvements aim at:

- Confining the impacts of any remaining bug in the delivered SW, mainly by means of consistency checks, so as to allow the SW to go on or restart after a local cleaning. This should allow to avoid most of the field-observed GPU reset and SW blocking states.

- Providing different hierarchical levels of SW cleaning and restart, so as to always execute the defence action the more adapted to the scale of the detected inconsistency.

- Improving the SW restart procedures, so that most of them can be executed almost transparently from the end-user point of view, and do not impact the product's external quality.

The investigation means improvements aim at:

- Giving SW testers and developers more or improved traces, dumps, historical and/or contextual data attached to SW failures detected in the field, to help them in their analysis.

Approvals

Name App.

R.MICHEL/ Y.GORJUX SYT Manager / OSY Manager

R. MAUGER TPL

J.-P. GRUAU PM

REVIEW

Ed. 01 Proposal 01 2006/10/20 Review report EVOLIUM/R&D/TD/MFS/2006-5063

Ed. 01 Proposal 03 2006/11/27 Review report EVOLIUM/R&D/TD/MFS/2006-5104

Ed 01 Released 2007/03/29 Review report EVOLIUM/R&D/TD/MFS/2007-5200

Ed 02 Released 2007/07/12 Review report Wireless/GSM/R&D/SYT/207150

Ed 03 Released 2007/12/07 Review report Wireless/GSM/R&D/MFS/2007-5366


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 2/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

HISTORY

Ed. 01 Proposal 01 2006/10/12 First delivered issue after internal review of the draft doc.

Ed.01 Proposal 02 2006/10/27 Take into review report EVOLIUM/R&D/TD/MFS/2006-5063





Ed.01 Released 2007/03/29 Take into review report EVOLIUM/R&D/TD/MFS/2007-5200

Ed 02 Released 2007/07/12 Take into review report Wireless/GSM/R&D/SYT/207150

The restart of PTU without software reloading is replaced by a better DSP KO detection with PTU reload from the PMU

Ed 03 Proposal 01 2007/11/21 The impacts on BTP and the PTU checksum are introduced.

Some sub-features are renamed to avoid the confusing

Ed 03 Released 2007/12/07 Take into review report Wireless/GSM/R&D/MFS/2007-5366


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 3/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

TABLE OF CONTENTS

1 INTRODUCTION...........................................................................................................................................8 1.1 Scope ....................................................................................................................................................8 1.2 Document Structure ............................................................................................................................8 1.3 Definitions and pre-requisite..............................................................................................................8

2 HIGH-LEVEL DESCRIPTION .......................................................................................................................9 2.1 Functional Requirements....................................................................................................................9

2.1.1 Proposed PMU defence features ................................................................................................9 2.1.2 Proposed PTU defence features ...............................................................................................10

2.1.2.1 RLC containers integrity...........................................................................................................10 2.1.2.2 Recall of "Fatal Alarm" mechanism..........................................................................................11 2.1.2.3 Recovery of fatal cases............................................................................................................11 2.1.2.4 TRX defence reset ...................................................................................................................11 2.1.2.5 PMU PTU interfaces dynamic traces.......................................................................................11 2.1.2.6 PTU Traces improvement ........................................................................................................11

2.2 Compliance to the Marketing Requirements ..................................................................................12 2.3 Compliance to 3GPP standard .........................................................................................................12 2.4 Working Assumptions.......................................................................................................................12 2.5 Dependencies ....................................................................................................................................12 2.6 HW Coverage......................................................................................................................................12 2.7 Decision criteria.................................................................................................................................14

2.7.1 Standardisation..........................................................................................................................14 2.7.2 Competition ...............................................................................................................................14 2.7.3 Customer ...................................................................................................................................14 2.7.4 Gains .........................................................................................................................................14 2.7.5 Risks ..........................................................................................................................................14

3 SYSTEM IMPACTS.....................................................................................................................................15 3.1 Telecom ..............................................................................................................................................15

3.1.1 Basic MFS Defence Principles ..................................................................................................15 3.1.1.1 Overview of Defence Actors ....................................................................................................15 3.1.1.2 Defence Agents .......................................................................................................................17 3.1.1.3 GPU Defence Manager............................................................................................................17

3.1.2 PMU defence mechanisms .......................................................................................................18 3.1.2.1 Buffers Management................................................................................................................18

3.1.2.1.1 Defence against Leaks of GPU Buffers...........................................................................18 3.1.2.1.2 Prevention of Buffers Congestion....................................................................................19

3.1.2.2 Defence against leaks of GPU-Signals....................................................................................20 3.1.2.3 Memory access control ............................................................................................................20

3.1.2.3.1 Principle ...........................................................................................................................20 3.1.2.3.2 PPC 750 MMU.................................................................................................................21 3.1.2.3.3 Application to the PMU SW .............................................................................................22 3.1.2.3.4 Impacts ............................................................................................................................26

3.1.2.3.4.1 MMU initialisation.........................................................................................................26 3.1.2.3.4.2 Exception handlers ......................................................................................................26

3.1.2.4 Stack overflow..........................................................................................................................26 3.1.2.5 Defence against Timers corruption..........................................................................................27 3.1.2.6 Defence against Sleeping Cells...............................................................................................27

3.1.2.6.1 Introduction ......................................................................................................................27 3.1.2.6.2 TRX-Level Defence .........................................................................................................27

3.1.2.6.2.1 Problems detection and Defensive response..............................................................27 3.1.2.6.2.2 TRX-Defence-Reset procedure...................................................................................28

3.1.2.6.3 Cell-Level Defence ..........................................................................................................30


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 4/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.2.6.3.1 Problems detection and defensive response ..............................................................30 3.1.2.6.3.2 GPU-level Cell-Defence-Reset procedure ..................................................................31 3.1.2.6.3.3 BSS-level Cell-Defence-Reset procedure...................................................................32

3.1.2.7 Defence against frozen MS-Contexts ......................................................................................32 3.1.2.8 Defence against lost Radio-Resources ...................................................................................33 3.1.2.9 Detection of blocked automatons ............................................................................................33 3.1.2.10 Recovery of a DSP KO: reload only one DSP.................................................................34 3.1.2.11 Recovery of frozen GPU..................................................................................................36 3.1.2.12 Fast GPU restart (without GPU SW code reload) ...........................................................36

3.1.2.12.1 Description.......................................................................................................................36 3.1.2.12.2 Impacts ............................................................................................................................37

3.1.2.13 Defence against PTU access, Trx or Tbf index leak in PTU proxy .................................38 3.1.2.14 PMU IPGCH Defence tasks ............................................................................................38 3.1.2.15 Defence against others GPU tasks .................................................................................38

3.1.3 PTU defence mechanisms ........................................................................................................39 3.1.3.1 Recovery of RLC fatal case related to one TBF ......................................................................39 3.1.3.2 Recovery of other easy fatal case ...........................................................................................39

3.1.3.2.1 Easy fatal cases in GCH..................................................................................................39 3.1.3.2.2 Easy fatal cases in MAC:.................................................................................................40 3.1.3.2.3 Easy fatal cases in RLC:..................................................................................................40 3.1.3.2.4 Easy fatal cases in SCH: .................................................................................................41 3.1.3.2.5 Easy fatal cases in SDL:..................................................................................................42

3.1.3.3 Recovery of fatal alarm on memory operation.........................................................................42 3.1.3.3.1 Recall of current implementation: ....................................................................................42 3.1.3.3.2 Separation of data and pointer zones..............................................................................43 3.1.3.3.3 Impacts ............................................................................................................................44

3.1.3.4 Recovery of critical fatal cases ................................................................................................44 3.1.3.5 Reaction of PMU command after DSP KO..............................................................................46 3.1.3.6 Forced TRX delete procedure .................................................................................................47 3.1.3.7 Recovery of a DSP KO (fast detection & reload only one DSP)..............................................47 3.1.3.8 PTU Code corruption detection (Checksum)...........................................................................48

3.1.4 PMU debugging facilities improvements ...................................................................................49 3.1.4.1 Debug information on important events...................................................................................49

3.1.4.1.1 Debug information for GPU Crash...................................................................................49 3.1.4.1.2 Debug information for a GPU Reset or Restart...............................................................49 3.1.4.1.3 Debug information for a TRX-Reset ................................................................................49 3.1.4.1.4 Debug information for a Cell-Reset .................................................................................49 3.1.4.1.5 Debug information for autonomous clean-up ..................................................................50 3.1.4.1.6 Debug information for a DSP KO ....................................................................................50 3.1.4.1.7 Debug information for Problem involving the PTU ..........................................................50 3.1.4.1.8 Debug information for anomaly detection........................................................................51

3.1.4.2 Traces improvement (trace dictionary) ....................................................................................51 3.1.4.3 Dynamic trace level set............................................................................................................51 3.1.4.4 Miscellaneous ..........................................................................................................................51 3.1.4.5 Traces service improvement....................................................................................................51

3.1.4.5.1 Traces Encoding (traces size reduction) .........................................................................51 3.1.4.5.1.1 Mechanism ..................................................................................................................51 3.1.4.5.1.2 Table of indexes management ....................................................................................52 3.1.4.5.1.3 Encoded traces exploitation ........................................................................................52

3.1.4.5.2 Buffering before sending to the Control Station ..............................................................52 3.1.4.5.3 Review of current traces..................................................................................................52

3.1.4.6 HTML pages improvement.......................................................................................................52 3.1.5 PTU debugging facilities improvements....................................................................................53

3.1.5.1 Trace filters of PMU-PTU interface trace.................................................................................53 3.1.5.2 Dynamic PMU-PTU interface trace..........................................................................................54 3.1.5.3 Dynamic trace of DSP alarm indication ...................................................................................54 3.1.5.4 Improvement of PTU internal Debug trace ..............................................................................54


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 5/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.5.5 Partly memory dump for critical alarm .....................................................................................55 3.1.5.6 PTU internal trace for DSP alarm indication ............................................................................55 3.1.5.7 Forced DSP KO .......................................................................................................................55

3.1.5.7.1 Critical DSP alarm ...........................................................................................................55 3.1.5.7.2 Fake DSP KO ..................................................................................................................56

3.1.5.8 Improvement of DSP alarm mechanism..................................................................................56 3.1.5.9 Synthesis debug information to be shown in html page ..........................................................57 3.1.5.10 Dedicated trace file for the specific PTU traces ..............................................................57 3.1.5.11 Other improvements ........................................................................................................57 3.1.5.12 PTU restrictions ...............................................................................................................57

3.1.6 Features priorities......................................................................................................................58 3.1.7 Features per release .................................................................................................................60 3.1.8 MFS HW coverage ....................................................................................................................63 3.1.9 Interfaces...................................................................................................................................65

3.1.9.1 Radio interface (05.02, 04.06, 04.60, 04.18, 24.008, etc) .......................................................65 3.1.9.2 Abis interface (08.58)...............................................................................................................65 3.1.9.3 A interface (08.08) ...................................................................................................................65 3.1.9.4 Gb interface (08.18) .................................................................................................................65 3.1.9.5 BSCGP interface......................................................................................................................65 3.1.9.6 GCH interface ..........................................................................................................................65

3.1.10 Simulations ................................................................................................................................65 3.2 Operation and maintenance .............................................................................................................65

3.2.1 OMC-R parameters ...................................................................................................................65 3.2.2 Modelisation of OMC-R parameters..........................................................................................65 3.2.3 Other parameters ......................................................................................................................65 3.2.4 PM counters ..............................................................................................................................68 3.2.5 PM indicators.............................................................................................................................69 3.2.6 Migration....................................................................................................................................69 3.2.7 Java scripts................................................................................................................................69 3.2.8 Fault Management.....................................................................................................................69 3.2.9 O&M Specification impacts .......................................................................................................69

3.3 Validation............................................................................................................................................69 3.3.1 Testing tools ..............................................................................................................................69 3.3.2 Test strategy..............................................................................................................................70

3.3.2.1 System tests coverage.............................................................................................................70 3.3.2.2 Overall strategy for system tests..............................................................................................70

3.4 Methods ..............................................................................................................................................70 3.5 GCDs ...................................................................................................................................................70 3.6 Engineering rules ..............................................................................................................................70

4 SUBSYSTEM IMPACTS.............................................................................................................................70 4.1 BTS......................................................................................................................................................71 4.2 BSC .....................................................................................................................................................71 4.3 Transcoder .........................................................................................................................................71 4.4 MFS .....................................................................................................................................................71 4.5 OMC-R.................................................................................................................................................71 4.6 LASER.................................................................................................................................................71 4.7 MPM/NPA/RNO...................................................................................................................................71 4.8 Polo .....................................................................................................................................................71 4.9 OEF......................................................................................................................................................71

5 PERFORMANCE & SYSTEM DIMENSIONING.........................................................................................72 5.1 Traffic model ......................................................................................................................................72 5.2 Performance.......................................................................................................................................72 5.3 Load constraints................................................................................................................................72

6 OPEN POINTS ............................................................................................................................................72

7 IMPACTS SUMMARY.................................................................................................................................73


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 6/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

8 GLOSSARY ................................................................................................................................................73 8.1 Abbreviations.....................................................................................................................................73 8.2 SHT: SSD Signal Historical trace inside the PTU Terminology ....................................................73


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 7/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

INTERNAL REFERENCED DOCUMENTS

Not applicable

REFERENCED DOCUMENTS

Alcatel-Lucent references

[1] 3BK 11203 0091 DSZZA GPRS BSS load Model and Performance Objectives B8 3GPP references

Not applicable

RELATED DOCUMENTS

Alcatel-Lucent documents

Not applicable 3GPP CRs

Not applicable

PREFACE

This document is the input paper for the feature “MFS Telecom SW defence and SW investigation improvements” inside TD. It will further on be used as reference for the development of that feature in each subsystem.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 8/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

1 INTRODUCTION

1.1 Scope

The present document aims to be the basis for decision of a proposed change to be made on the BSS system. It provides all necessary information related to functional description, gains, description of the system impacts and subsystem impacts.

This document describes the new defence mechanisms and investigation means to be implemented in the MFS. It is applicable for the B10, B11 or >B11 releases.

It is mainly intended to:

• The product.

• Designers and developers of the GPU software.

• The integration & validation team.

1.2 Document Structure The section 2 of this document presents: • The functional requirements, • An overall description of the features, • The compliance to marketing requirements and to the 3GPP standard, • Which working assumptions have been made • And the dependencies, • As well as some decision criteria such as the risk, the gain, etc associated to the features addressed by

the present SFD.

The section 3 identifies the system impacts : it gives the principles and presents the functional split of the feature between subsystems. The interactions within the BSS between the various modules, layers, etc are shown as well as the interaction with the other Network Elements. Impacts on telecom and O&M Step2 specifications are also given in this section. The validation strategy is presented as well as the impacts on GCD, methods and engineering rules.

The section 4 recaps the impacts for each subsystem.

The section 5 addresses the performance and system dimensioning concerns.

The section 6 identifies and describes the open points that have been raised in the various reviews.

The section 7 is a sum up of the system impacts.

The section 8 contains the glossary of the present document.

1.3 Definitions and pre-requisite


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 9/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

2 HIGH-LEVEL DESCRIPTION

2.1 Functional Requirements

At GPU level, experience from the field shows that blocking problems happen without any automatic recovery. Besides they are difficult and costly to investigate and fix. This has damageable consequences :

- Too many efforts and people are needed for the maintenance and problems follow-up.

- The GPRS service for final users is impacted (No or degraded GPRS traffic), and clients are dissatisfied with our product quality.

The new functionalities hereafter defined aim at :

- Enhancing the MFS reliability to better ensure the GPRS service continuity and improve the overall perceived quality thanks to automatic defence mechanisms whenever possible.

- Improving the debugging means to facilitate investigation and problem solving.

This encompasses PMU and PTU features. The PMU features mainly monitor internal PMU aspects (both in resource management and dynamic parts), but some of them must also trigger / monitor PTU features, so as to recover a whole GPU state consistency.

During recent years, we have been facing more and more GPU auto-restart in the field, due to PTU "Fatal Alarms" (DSP KO). Thus PTU defense features focus on this "DSP KO" mechanism.

2.1.1 Proposed PMU defence features

Defence mechanisms address two main topics:

1) Automatic prevention or recovery of problems affecting GPU system-level resources :

1. Leak of Buffers

2. Leak of Signals

3. Memory access control

4. Stack overflow

5. Timers corruption

6. Prevention of buffers congestion

2) Automatic recovery of applicative severe faults or anomalies :

1. Sleeping Cells

2. Frozen Mobile contexts

3. Loss radio resources

4. Loss of remote Ptu Access


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 10/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

5. Blocked automatons

6. DSP KO

7. Frozen GPU

8. GPU restart

Improved Debugging facilities include :

- More Debug-Information in case of GPU reset or GPU crash.

- Ability to capture and manage traces on site, without impacting the ongoing Telecom Traffic.

- More appropriate traces at the right time.

- Keeping most recent exchanges at critical interfaces for automatic dump just before reset or on some error occurrences.

- Debug messages on PMU-PTU interface to be issued on PMU request to get some PTU reporting information.

- Improved Traces.

- Dynamic trace level set

- Tools for Off-Line traces analysis.

- Improved HTML pages.

2.1.2 Proposed PTU defence features

Defence mechanisms address three main topics:

1) Automatic recovery of problems affecting PTU resources. This point is relative to fatal alarm linked to RLC containers.

2) Recovery of some fatal Alarm without using the DSP KO mechanism.

3) Processing of the TRX defence reset coming from PMU.

Improved Debugging facilities include:

1) The dynamic trace level set by the PTU for the PMU-PTU interface trace mechanism.

2) The improvement of the PTU traces with the modification of the DSP alarm mechanism.

2.1.2.1 RLC containers integrity

To improve the container operation processing the partition of RLC containers is separated into two parts: a data table and an index table. The management of the container is same as the old implementation except the pointer to the next element is replaced by the index located in another area.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 11/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

2.1.2.2 Recall of "Fatal Alarm" mechanism

When a fatal alarm is triggered on DSP, the S/W will be tripped into an infinite loop. There is a timer monitoring the PPC-DSP interface. When DSP can’t respond on this interface, PPC will take it for granted that DSP encounter fatal error. And PPC will read DSP fatal alarm information via direct access to the DSP memory (so called DSP Post-Mortem dump, including PTU file name, error info, debug info, etc.).

The GPU will be reset, and all traffic on the GPU is broken. GPU reset is done through BAM fatal_error (FErr, cause DSP KO) with the brief traces of DSP Post-Mortem dump. In B8, a new improvement was introduced: BAM can dump the whole DSP memory, which greatly improves efficiency of PTU FR correction.

GPU auto-restart is an effective recovery mechanism, but it’s not reasonable to reset the whole GPU and suspend all service on this board when there is a logical bug happened on one TBF/TRX/MEGCH instance.

2.1.2.3 Recovery of fatal cases

There are lots of fatal alarms (390 occurrences) in PTU S/W, most of them being used for a logical check against a potential bug in the PTU S/W. According to the experience in PTU S/W maintenance, most of such cases can be recovered. So it’s strongly proposed to reduce the number of unnecessary DSP fatal alarms.

We propose to remove those fatal alarms, except the following ones, whose recovery couldn’t be done at PTU software level, so that a DSP restart is a must :

1. Fatal alarm on hardware interface operation.

2. Fatal alarm relevant with SDL operation.

3. Other conditions that PTU can’t recover.

Recovery of current fatal alarms is spread over PTU modules, RLC, MAC, GCH, and "PTU - DSP system modules" (HPI, SDL, and Scheduler).

2.1.2.4 TRX defence reset

When the TRX defence agent detects a cell-blocking case or a PMU-PTU inconsistency case, the PTU is notified with the TBF release (no reply). Following the TBF release, the blocked TRX will be deleted in PTU in mode forced (no reply).

2.1.2.5 PMU PTU interfaces dynamic traces The PMU-PTU interface can be traced by the PMU entity with a filter. To improve this trace filter facility the PTU can ask dynamically PMU for a filtering on the following entities: DSP class, TRX class, GCH class and TBF class. Once PMU gets the indication, only this class PMU-PTU interface for one DSP will be traced. If the trace level indication is not repeated by PTU before half an hour the trace is stopped

2.1.2.6 PTU Traces improvement DSP alarm mechanism is improved in the following manner:

• Redefine the cause of DSP Alarm type • Introduce more debug info in DSP alarm indication • Redefine the filtering mask


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 12/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

2.2 Compliance to the Marketing Requirements

Not applicable

2.3 Compliance to 3GPP standard

Not applicable

2.4 Working Assumptions

None

2.5 Dependencies

No dependencies

• requires i.e. identify the features which are required by the present one

Feature Dependency

• is enhanced by i.e. identify the features which bring an enhancement of the present feature

Feature Dependency

• is incompatible with i.e. identify the features which are technically incompatible with the present feature

Feature Dependency

• will be enhanced by the following future evolutions

Feature Dependency

• decreases the interest of i.e. identify the features (already existing or candidate for implementation)

which become less interesting or even useless if the present feature is implemented

Feature Dependency

Each dependency has to be described and the functionality required/enhanced/incompatible/less interesting/useless has to be clearly identified.

2.6 HW Coverage

System:


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 13/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

GSM 900 Y

DCS 1800 / DCS1900 Y

GSM 850 Y

Network element:

In the following tables, indicate :

Xsw if feature supported by the NE with software impact,

Xhw if feature supported by the NE with hardware impact

Xsw+hw if feature supported by the NE with software and hardware impact.

X: The feature is supported on the NE without hardware or software impacts.

- : The feature is not supported by the NE or the NE is not concerned by the feature

Impact

BTS Generation: A9100 (Evolium standard) - A9110 (M4M) - A9110-E (M5M) - BSC Generation: BSC G2 - MFS: MFS Xsw Transmission: Alcatel-Lucent TSC - Transcoder: TRAU G2 with DT16 - TRAU G2 with MT120 TRAU G2.5 (with MT120) - MSC: MSC - SGSN: SGSN - Data IWF: Data IWF - HLR: HLR - O&M: OMC-R - POLO - OEF - LASER - MPM/NPA - RNO - MSTS: MSTS -


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 14/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

2.7 Decision criteria

2.7.1 Standardisation

Not applicable

2.7.2 Competition

Not applicable

2.7.3 Customer

Service continuity and product quality improvement required.

2.7.4 Gains

Less instability and better responsiveness for problem solving expected.

2.7.5 Risks Performances degradation


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 15/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3 SYSTEM IMPACTS

3.1 Telecom

3.1.1 Basic MFS Defence Principles

3.1.1.1 Overview of Defence Actors

Inside the PMU, Defense-dedicated entities supervise functional code and perform defense actions when problems happen. They are distinct from functional entities and organized in a hierarchical model.

Local entities called Defence Agents are subordinated (directly or through an intermediate layer) to a central GPU-level Defence Manager. This Defence Manager basically receives problems notifications requiring a Defence response and works-out the appropriate Defence action. Its scope is PMU plus PTU.

Execution of the defence actions is then delegated to the Defence Agents. Local Defence Agents include BSS-level, Cell-level, TRX-level, MS-level, Abis-level and PTU-level agents.

Defence Agents are organised in a Hierarchical model:

o TRX and MS Defence Agents are logically contained in Cell Defence Agents. Cell Defence Agents are logically contained in the BSS Defence Agent.

o PTU Defence Agents are logically contained in Abis Defence Agents. Abis Defence Agents are locally contained in BSS Defence Agent. PTU and Abis Defence agents are introduced for the IP mode.

o The BSS Defence Agent is directly contained in the GPU Defence Manager.

TRX and MS Defence Agents

BSS Defence Agent

GPU Defence Manager

Cell Defence Agents Abis Defence Agents

PTU Defence Agents


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 16/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Particular Defence Agents autonomously perform local defence actions: e.g. each Garbage Collector periodically checks contexts or resources.

The overall Defence process consists in the following steps:

• Problem Detection:

o In the MFS SW any detection can be performed locally as functional processes are going on. But defence-dedicated entities can also detect problems, through periodic checks or on specific events.

� For example, handlers can autonomously detect that their queue is full, or

� Existing Garbage Collectors periodically check the validity of radio-resources sets.

• Problem Notification:

o Detected abnormal conditions result in anomalies currently called Data-Err.

� Anomalies are characterised by a severity, the problem identification, the originator identifier, a date plus a text to be dumped in traces.

� A dictionary of anomalies is maintained in the PMU.

o Severity is determined by the originator and can take the following values:

� Critical: An abnormal condition has been met which makes the entity unable to continue to live and provide its service. A defence action is required from the GPU Defence Manager.

• Typical examples are: Automatons blocked, Handlers blocked, Inconsistent states, Inconsistency between automatons, Inconsistency between cooperating entities, Full lists or queues, Null pointers, Object or Signal allocation impossible, etc …

� Major: An abnormal condition has been met letting the entity able to continue to work without internal inconsistency. No additional defence action is necessary.

• Typical examples are: Lost resources, frozen contexts recycled by Garbage Collectors.

� Minor: A not directly damageable but unexpected condition has been detected. Dump in traces could ease investigation of other problems.

o Anomalies are notified to local Defence Agents that process them as hereafter described.

• Defence Decision:

o Garbage Collectors recycling lost resources or contexts autonomously release and clean them up. MS Defence Agents can also make defence decisions at MS-level.

o For other defence issues the decision is made by the GPU-Defence-Manager hereafter described.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 17/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

• Defence Action:

o Defence treatments are as simple as possible, without automaton and synchronous whenever possible. They are performed by specific code distinct from the functional code as much as possible.

o Defence treatments are performed by Defence Agents except the GPU-Reset.

3.1.1.2 Defence Agents

The Defence Agents operating mode is defined by:

o MS Defence Agents operate autonomously and can perform MS-Resets to confine problems at MS-level. However they may be instructed to do MS-Resets by TRX Defence Agents as part of TRX-Resets.

o Other Agents do not make Defence decisions and are monitored by the GPU Defence Manager.

Defence agents are associated to functional entities (Ms, Cell, BSS, TRX) and supervise them. They have access to all their internal data.

Defence agents collect all anomalies raised from their supervised entities:

o Minor anomalies are forwarded to the CS to be traced in the Trace file. Major Anomalies are forwarded to the CS to be traced in the Defence Log file.

o Critical anomalies are forwarded to the higher-level Defence Agent up to the GPU-Defence-Manager.

� Only MS-Defence-Agents operate autonomously without notifying the higher level.

o Depending on the anomaly, Defence Agents may dump any data helpful for debug.

o Defence Agents may maintain simple local statistics about anomalies.

o Defence Agents may issue Major anomalies in case of an excessive influx of minor anomalies.

o Defence Agents perform critical anomalies no configurable filtering in order to prevent the GPU-Defence-Manager to trigger redundant or useless Defence Actions. But such filtered anomalies are still dumped at the CS Defence Log file.

o Autonomous Garbage Collectors and Contexts Allocators may also notify major or critical anomalies.

3.1.1.3 GPU Defence Manager

The Defence Manager is created at GPU initialisation. Its mission is to collect critical anomalies whatever their origin (Task, Package) and make Defence decisions. When receiving critical anomalies, the Defence Manager:

o Updates a maintained Defence history in memory.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 18/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

� This Defence history keeps all defence decisions and critical anomalies that occurred from the GPU start.

� For each update, the Defence log file on the Control Station disk is refreshed too. It keeps defence-oriented information embracing a long period of time just like the current GPU Log file. In particular GPU Restarts are visible.

� Defence data is accessible via the HTML pages.

� The Defence History is structured and synthesised in the form of Defence Contexts that contain any useful information to make defence decisions such as dates and counters.

o Makes a Defence decision. There are 5 levels of Defence (TRX, Cell, DSP, PTU, GPU). The available Defence responses include:

� TRX-Reset, PTU-Reset, GPU-level Cell-Reset, BSS-level Cell-Reset, DSP-Restart, GPU Restart (1), GPU Reset (1).

� The Defence Manager is the only entity in the GPU allowed to request a GPU Reset.

� GPU Fast Restart and GPU Reset are the single available Defence responses for critical anomalies arisen in the interface layers BSCGP, Gb, TRM, IP-GCH.

o Instructs (except for (1)) the appropriate Defence Agent to perform the decided Defence procedure.

� TRX Defence Agent for the TRX-Reset, Cell-Defence-Agent for the GPU Cell-Defence-Reset, Ptu-Defence Agent for the PTU-Reset and BSS-Defence-Agent for the DSP-RESTART.

o Waits for the completion of the Defence procedure. with a guard timer

When receiving an End-Of-Defence-Action indication:

o Updates its Defence history.

o For a failure possibly request a heavier Defence action (Reset GPU if needed)

If a anomaly requiring a reset GPU is received during a defence action this one is taken into account immediately

3.1.2 PMU defence mechanisms

3.1.2.1 Buffers Management

3.1.2.1.1 Defence against Leaks of GPU Buffers

Today there is no mechanism in the GPU Kernel to detect and fix any buffer loss. If a Buffer starvation occurs a GPU-Reset is performed.

To detect a buffer leak resulting in the loss of buffers a Garbage Collector for buffers is introduced. Its purpose is to release outdated buffers considered as being not used any more.

This service is required by the Defence against blocked automatons to be able to release buffers associated with signals after automatons deletion.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 19/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

• Each buffer is assigned a maximum life-time. The system descriptor of each buffer is enriched with the date corresponding to its maximum life-time. This information is built from the current GPU internal clock.

o The maximum lifetime is 30 minutes due to MS-related constraints.

• A list of allocated buffers sorted from the oldest to the youngest whatever their size is maintained.

o The oldest allocated buffer holds the head of the list. This list does not contain the buffers allocated with an infinite life-time.

• At each buffer allocation, the Buffers-Manager checks if the maximum life-time of the oldest allocated buffer has expired.

o If yes a certain amount of allocated buffers with their maximum life-time expired is autonomously released by the Buffers-Manager, the GPU-Defence-Manager is informed and some Debug-information (particularly the data size and buffer owner) dumped in the GPU Trace and Defence-Log file.

o This is done by screening the items of the list until the first buffer with a maximum life time not reached is meet.

• At buffer allocation or when a buffer is passed from a software entity to another one, the current owner of the buffer has to put its identity in a Owner field.

• In the GPU life most of buffers actually contain PDUs. The Garbage Collector must not release Pdus still hold by MS-level entities. Consequently at MS-level a periodic audit of PDUs is to be done to release PDUs with their maximum lifetime expired. A lifetime for UL PDU must be added for IP part

o There is a dependency between a buffer life-time and the time necessary for the Garbage Collector MS to audit PDUs queues of all MS of the GPU. Buffers life-time must be higher than this time to avoid a release while the contained PDU is still present in a Queue.

• Furthermore the Buffers-Manager periodically checks for a possible overwrite in the memory dedicated to Buffers management. It then screens its internal lists of free and allocated buffers and checks their consistency with its counters.

o If an inconsistency is detected an anomaly report is notified to the GPU-Defence-Manager. This one performs a GPU-Reset.

• Two Congestion Thresholds are defined. The Allocation policy in congestion is defined hereafter. Ultimately, if despite this policy a Buffer starvation occurs, GPU Restart (fast restart without SW reload or restart with SW reload) has to be performed.

• Communication with the Trace service does not use applicative buffers but statically allocated memory.

3.1.2.1.2 Prevention of Buffers Congestion

A policy is defined to prevent Buffers usage congestion or starvation. It is similar to the one already defined for CPU Load control:

- Since most buffers are used to store LLC PDUs, a first threshold (BUFFER_LOW_XOFF) with restriction affecting only new incoming traffic is defined:

o All incoming accesses leading to new TBFs are rejected.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 20/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

- A second threshold (BUFFER_XOFF) is defined from which any new incoming TBF is rejected and some currently established TBFs are forced released so that the Buffers Usage decrease. The choice of the TBF to release doesn’t take into account the type of PFC supporting the TBF (RT or NRT, TBF concurrency). A percentage of Tbf to release must be defined.

- When there is no more buffer, a GPU restart (fast restart without SW reload or restart with SW reload) has to be performed.

3.1.2.2 Defence against leaks of GPU-Signals

Today there is no mechanism in the PMU to detect and fix any GPU Signal loss. Signals leaks result more or less rapidly in a Signal starvation. No Defence action is performed and a GPU Reset is done.

A GPU-level Pool of Signals created at GPU Start is maintained. Lists of allocated or free Signals are maintained.

To detect any Signal leak resulting in the loss of Signal the following mechanism is introduced:

• Each Signal is assigned a maximum life-time. Signals are very rapidly consumed except those which are postponed. The descriptor of each Signal is enriched with the date corresponding to its maximum life-time. This information is built from the current GPU internal clock.

o The maximum life-time of any created and not postponed signal is 10 minutes to be able to fix a rapid leak.

o The maximum life-time of any created and postponed signal is 30 minutes

• At each Signal allocation, if no free signal is available the Pool-Manager checks the maximum life-time of all allocated but not postponed Signal.

o Allocated Signals with their maximum life-time expired are autonomously released , the GPU-Defence-Manager is informed and some Debug-information about them dumped in the GPU Trace and Defence-Log file.

• At automaton-level some checks are added:

o A maximum number of postponed signals per automaton is defined and cannot be exceeded. The local defence agent is in charge ofunblocking the faulty automaton .

o At the automaton deletion all internal signals issued by this automaton are released and its queue of postponed signals purged.

However buffers associated with signals are retrieved and released by the Garbage Collector dedicated to Buffers.

3.1.2.3 Memory access control

3.1.2.3.1 Principle

A bugged SW may write where it is not supposed to, not only corrupting the program data, but also its code if the code is loaded in RAM, which is the PMU case.

Such situations are very difficult to manage, and cannot be completely mastered by the program itself, since any included control or defence SW can also be overwritten due to some bug. Two memory areas should not be changed during program execution, in order to keep SW consistency: code and constant sections.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 21/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Two solutions can be foreseen:

- CRC-check code and constant sections, periodically and/or on identified events (e.g. reset), and trigger a "reload + restart" if the check fails. This is very insufficient, since the inconsistency is not avoided, and the check/defence SW itself may be overwritten. This can be kept only as a fallback if the MMU controlled access cannot be implemented.

- Use the MMU HW access control, which can forbid CPU write access to some memory areas. The PPC 750 includes such a MMU.

Advantages of the MMU solution:

- It guarantees neither code nor constant change (the CRC control detects the change, but does not prevent it).

- A MMU exception can be raised at the very moment the CPU tries to execute a forbidden write access. Since the code is safe, any defence action is possible (specific check, trace, memory dump, ...), and if a SW restart is decided, a fast restart (i.e. without reloading the code and constant data) can be executed from the exception procedure.

Possible drawbacks include a higher CPU usage, since the MMU executes some HW checks on each CPU memory access, thus a very simple architecture must be chosen, in particular dynamic use of paged memory must not be used.

3.1.2.3.2 PPC 750 MMU

The PPC 750 offers 3 different memory access mechanisms that can be used separately for instruction access and data access:

- Real addressing mode: the program address (sometimes called effective address, which results from the linker and loader) is identical to the physical address (sometimes called real address) in memory, without translation nor check.

- Segmentation + paging mode: the program address is interpreted in a segment number (highest part), a page number in the segment, and an offset in the (4ko) page. Segments and pages are described in descriptors in memory, loaded in MMU registers through caching mechanisms for memory access optimisation.

- Block address translation mode: the following description is extracted from the "MPC750 user's manual" (note that the whole PMU SW executes in supervisor mode, so that only supervisor mode controls can be used, such as the Vs bit) : "(…) 7.5 Block Address Translation The block address translation (BAT) mechanism in the OEA provides a way to map ranges of effective addresses larger than a single page into contiguous areas of physical memory. Such areas can be used for data that is not subject to normal virtual memory handling (paging), such as a memory-mapped display buffer or an extremely large array of numbers. 7.5.1 BAT Array Organization The block address translation mechanism is implemented as a software-controlled BAT array. The BAT array maintains the address translation information for eight blocks of memory. The BAT array is maintained by the system software and is implemented as a set of 16 special-purpose registers (SPRs). Each block is defined by a pair of SPRs called upper and lower BAT registers containing the effective and physical addresses for the block.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 22/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Four pairs of BAT registers are provided for translating instruction addresses and four pairs of BAT registers are used for translating data addresses. These eight pairs of BAT registers comprise two four-entry fully-associative BAT arrays (each BAT array entry corresponds to a pair of BAT registers). The BAT array is fully-associative in that any address can reside in any BAT. In addition, the effective address field of all four corresponding entries (instruction or data) is simultaneously compared with the effective address of the access to check for a match. Each pair of BAT registers defines the starting address of a block in the effective address space, the size of the block, and the start of the corresponding block in physical address space. If an effective address is within the range defined by a pair of BAT registers, its physical address is defined as the starting physical address of the block plus the low-order effective address bits. Blocks are restricted to a finite set of sizes, from 128 Kbytes (217 bytes) to 256 Mbytes (228 bytes). The starting address of a block in both effective address space and physical address space is defined as a multiple of the block size. It is an error for system software to program the BAT registers such that an effective address is translated by more than one valid IBAT pair or more than one valid DBAT pair. If this occurs, the results are undefined and may include a spurious violation of the memory protection mechanism, a machine check exception, or a checkstop condition. The following equation determines whether a BAT entry is valid for a particular access:

BAT_entry_valid = (Vs & ¬MSR[PR]) | (Vp & MSR[PR]) If a BAT entry is not valid for a given access, it does not participate in address translation for that access. Two BAT entries may not map an overlapping effective address range and be valid at the same time. (…) 7.5.4 Block Memory Protection After an effective address is determined to be within a block defined by the BAT array, the access is validated by the memory protection mechanism. If this mechanism prohibits the access, a block protection violation exception condition (DSI or ISI exception) is generated. The memory protection mechanism allows selectively granting read access, granting read/write access, and prohibiting access to areas of memory based on a number of control criteria. The block protection mechanism provides protection at the granularity defined by the block size (128 Kbyte to 256 Mbyte). For block address translation, the memory protection mechanism is controlled by the PP bits located in the lower BAT register. The PP bits define access options for the block.

Table 7-11. Access Protection Control for Blocks PP Accesses Allowed 00 No access x1 Read only 10 Read/write

Thus, any access attempted (read or write) when PP = 00 results in a protection violation exception condition. When PP = x1, an attempt to perform a write access causes a protection violation exception condition, and when PP = 10, all accesses are allowed. When the memory protection mechanism prohibits a reference, one of the following occurs, depending on the type of access that was attempted: • For data accesses, a DSI exception is generated and bit 4 of DSISR is set. • For instruction accesses, an ISI exception is generated and SRR1 bit 4 is set. (…)"

3.1.2.3.3 Application to the PMU SW

It is proposed to apply the Block Address Translation (BAT) mode to the PMU SW, and define the code and constant sections as "read-only", so that any write access in this area results in an exception.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 23/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

As an example, the figure below represents the B9 GPU memory map (memory address values are maybe out-of-date), and the possible Data BAT (DBAT) register array to be used.

Since the goal is only to prevent data write access in code+constant sections, instruction (execution) access does not need to be protected, thus Real Addressing Mode can still be used for instruction access. This preserves execution performance, since the MMU is by-passed for all instruction accesses, and the current makefile should not be modified (link+map commands) for code addresses. Performance is also impacted by data access times (e.g. data cache enabled or not)

In a later step, BAT addressing could also be used for instructions (through IBAT registers), in order to prevent instruction execution attempts outside the code section. This can happen e.g. in case of return instruction through a corrupted CPU stack. In a first step, it can be assumed that such a spurious instruction flow will quickly raise a data access exception.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 24/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

B9 GPU2 B9 GPU3 Used MMU DBAT Register

0x0000 0000 Reserved for Firmware & boot

[ DBAT0 (Read-Only) ] 1 Mo

1 Mo

0x0010 0000 Buffer Pool 1 ( Start Addr bufferpool 1 + Lenght Buffer pool 1( 0x0050 0000 ) defined in BAM.cpp )

DBAT1 (Read-Write) 3 Mo

3 Mo

0x0040 0000 Download DSP Sw

DBAT2 (Read-Only) 19 Mo

< 1 Mo

0x004E FFE0

> 1 Mo

0x0060 0000 PMU software All Sections

FreeMemStart 0x016d 8a20 OS & Driver Initializations 0x015f fae0 0x0170 0000

DBAT3 (Read-Write) 105 Mo

0x018A 0080 Region 0 0x01C0 0080

0x03B0 0000 Region 1 0x03C0 0000

0x05AF FC00 Buffer Pool 2 ( Start Addr bufferpool 2 + Lenght Buffer pool 2 ( 0x0250 0000 ) defined in BAM.cpp )

0x07FF FC00

Since the whole PMU code executes in Supervisor mode, the defined memory access modes should be : Both (i.e. User + Supervisor) Read-Write or Both Read-Only.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 25/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

The definition of DBAT0 as Read-Only is optional if read attempts are not needed in this section (To Be Confirmed) : if no DBAT register is defined, a memory access exception will also be triggered on any read or write access attempt in this area.

DBAT2 describes a (read-only) common memory block for Psos code and constant data, PMU code and constant data, and PTU code and constant data. This protects the PTU code image from spurious write accesses, as well as PMU and Psos. The PTU SW can thus be downloaded on a DSP without a GPU code reload (read access to the PTU code section is allowed), provided that this memory section is not re-used as a data section by the PMU after the PTU SW is first downloaded (To Be Confirmed).

This memory access control scheme also raises a memory access exception in case of data access outside the CPU physical memory.

Psos should be transparent to this mechanism, thanks to the following points :

- Psos does not make any use of the MMU features, and does not suppose any particular MMU usage,

- All Psos tasks (Gb, RRM, BSCGP, IPGCH) use the same memory image (and the same one as Psos), so that nothing must be changed in the MMU when switching tasks,

- The proposed memory map and DBAT register contents describe both application (PMU) addressing needs and Psos addressing needs,

- The whole SW (Psos + PMU) executes in Supervisor mode, avoiding any problem with privileged instructions which manipulate BAT registers.

Note that the Psos code will be protected as well as the PMU code, and also Psos data accesses will be controlled as well as PMU data accesses. This is needed since in the current SW, Psos itself can overwrite any memory address due to a bugged parameter in a system call from PMU (despite Psos controls on application parameters).

Note that, since the whole SW executes in Supervisor mode, the system is not protected against spurious Supervisor instructions (mtspr) that could change the DBAT registers.

Some memory addresses must be adapted, in order to cope with the 128 ko BAT granularity : for example, from the address map above, FreeMemStart should be increased to 0x016e0000 or 0x01700000, as the start address of DBAT3.FreeMemStart is a dynamic value computed during the link phase.

The DBAT registers are initialised by SW. Their contents determine which memory access is allowed (Valid / NotValid, Supervisor / User / Both, NoAccess / ReadWrite / ReadOnly) and the address translation which is processed on any allowed memory access, through the Block Start Address and Block Size.

It is proposed to initialise the Block Start Address and Block Size so as to match the current memory map, so that no real address translation takes place, only access control : physical address (when access is granted) = program address. This should allow to keep the makefile unchanged (link+map commands) for data addresses.

If it appears that DBAT0 (first memory pages from physical address 0x0) is not needed, one DBAT register can be used to define an additional memory block in the DBAT3 R/W area. For example, with one block for Cell and/or Ms contexts, and a separate one (not contiguous) for buffer pool, any attempt to address a context (with a wrong index) out of the dedicated area will raise an exception, instead of silently read wrong data or overwrite a buffer.

In the same say, the PPC750GX of the Mx GP board contains 8 DBAT registers instead of 4. This also allows to define more memory blocks to improve the data access control.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 26/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.2.3.4 Impacts

The CPU addressing mode and DBAT registers are managed through special-purpose registers (SPR), that can only be accessed by means of dedicated privileged instructions (mtspr, mfspr). These instructions probably require that a part of the source code be written in assembly code.

3.1.2.3.4.1 MMU initialisation

At SW start and after restart, the MPC 750 is set by default in Real Addressing Mode, and the BAT registers are not initialised. The PMU SW must initialise the DBAT registers and switch the CPU in Block Address Translation mode.

This initialisation must be done only after the GPU code has been loaded from GEM, since write access in the code section is needed for the download.

3.1.2.3.4.2 Exception handlers

Several exceptions may be raised following such a controlled data access : DSI exception, page access fault, machine check exception or checkstop. An exception handler must be implemented for each possible exception type.

Several dedicated CPU registers automatically save the machine state on exception occurrence, in order to let the exception handler retrieve the execution context of the instruction which caused the exception.

In principle, it is possible for the exception handler to return from the exception and restart the program execution (from the instruction which raised the exception, or another code address), after restoring a consistent execution context. In PMU, this would mean to identify the faulty context (in case of exception in RRM code : Cell or Ms or automaton or …), delete it, modify the interrupted CPU stack so as to restart execution at a well-known identified point (e.g. RRM entry point), and return from the exception. Note that the exception handler executes in real addressing mode (unless it changes it).

To simplify this aspect, it is proposed to trigger a PMU SW restart (probably in BAM code : To Be Confirmed) in case of exception. Of course, the exception occurrence must be traced and/or registered somewhere for reporting and later analysis.*

3.1.2.4 Stack overflow

The memory region 0 (please refer to the memory map example in the preceding §) is defined by PMU, and used by Psos for :

- System resource control blocks,

- System stack (the stack is switched by Psos from the task user stack to the system stack on each system call),

- Kernel data,

- Task stacks (allocated by Psos on task creation),

- …

Since the task and system stacks are allocated by Psos, it may be difficult to identify the stack area inside region 0 (To Be Checked).

In a first step, it is proposed to initialise region 0 with an "easily identified" pattern before granting the area to Psos, so that stack usage limits can be checked against overflow in case of problem (e.g. in each exception handler).


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 27/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

In a second step we will use the control about the segment protection if possible

3.1.2.5 Defence against Timers corruption

Integrity words are added to the timer task internal data and their validity is periodically checked.

To check that the timer task is still operational artificial timers are periodically set-up to verify that they normally elapse. The control managing is made by using a redundant timer without timer task.

According to the general defence principles, the Timer task is in charge of checking its data integrity. If it detects an error it notifies the central GPU-Defence-Manager. This one is responsible of the defence response.

As the Timer task is a critical part of the GPU, The GPU Defence Manager saves the event in its Defence history and performs a fast warm GPU Restart (or GPU Reset if the restart without SW reload is not supported).

3.1.2.6 Defence against Sleeping Cells

3.1.2.6.1 Introduction

“Sleeping cells” refer to situations where incoming traffic cannot be served because of TRX-level or cell-level SW problems. From the customer point of view this results in GPRS service unavailability or serious QOS degradation on these cells.

It has been observed that TRX-level misbehaviour may lead to cell-level blocking or only QOS degradation without total cell turn-off. So far sleeping cells come most of the time from TRX-level misbehaviour.

Therefore in the MFS, two levels of defence against “sleeping cells” are identified:

- TRX-level defence: The defensive action is a TRX-Defence-Reset confining the defence at TRX-level.

- Cell-level defence: The defensive action is a Cell-Defence-Reset confining the problem at cell-level.

3.1.2.6.2 TRX-Level Defence

3.1.2.6.2.1 Problems detection and Defensive response

As described in the general defence principles abnormal conditions occurring at TRX-level are first detected locally by TRX-level entities:

- As functional treatments are in progress by functional code.

- By an Audit of the consistency between the TRX-level entities in the PMU (Pdch-Groups and their BSS counterparts).

o This audit would typically verify the consistency of Pdchs states and the transmission resources status.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 28/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o It may be requested when a stable state is achieved (typically at the end of bandwidth increase/decrease) on both sides (BSS and Cell Traffic packages).

o It is performed by the TRX Defence agent.

- By Asynchronous automatons blocking detection with the help of an external GPU-level supervisor as described earlier. For more details on this feature see § 3.1.2.9 Detection of blocked automatons

The TRX Defence Agent is first informed of the problem.

- It issues traces as described above, and in case of critical anomalies notifies through higher-level agents the central GPU-Defence-Manager with a report including the originator identity and the problem identification. This one then updates its defence History.

According to Decision Rules previously stated, the GPU-Defence-Manager may try to confine the Defence within this TRX by requesting a TRX-Defence-Reset on it.

- Its execution is delegated to the local Cell and TRX Defence Agents.

- It involves all TRX-level entities in the PMU plus PTU. For cross-connections and transmission resources the BTS and BSC will be aligned at the next transmission-related action.

B8/B9 Experience has showed that TRX-Defence-Resets are likely to cover most of cell blocking cases.

3.1.2.6.2.2 TRX-Defence-Reset procedure The Cell and TRX-level defence agents are in charge of the the procedure that consists of the following steps:

o Any traffic on the TRX is sharply aborted. This service is the same one used by the Garbage-Collector MS to abort frozen MS-Contexts. See ref § 3.1.2.7 Defence against frozen MS-Contexts. Warning only the resources used by the Trx are deleted

� MS-level contexts are aborted and deleted regardless TCH usage for DTM. All PFC via the MS defence agent PTU is notified with Tbfs Release and used Tbf indexes released. These Tbf-Releases are not acknowledged by PTU.

• From this time the interface layer with PTU will not forward received messages from PTU on these indexes. Considering the PMU-PTU interface crossing case, please emphasize that PMU discards the primitives for these indexes. This needs the Proxy controls the signature of all received PTU messages

� Aborted Tbfs release their granted Throughput and Radio-Resources.by sending a MsDeletionIind message to CellT and cellT releases all the resources (Throughput ,Radio-Resources and delete pending requests)

� the LSP used for Ul Tbf is be released

� MS-level contexts purge all their LLC-PDUs queues.

� MS-level contexts abort any Allocation or Reallocation request pending in Cell-level Handlers.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 29/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o The TRX defence agent clears all Pdch-level or TRX-level Allocators of Radio-Resources or Throughput related to this TRX.

o Entities (automatons, queues, handlers, contexts) dedicated to that TRX are reinitialised and cleared in both Packages (Cell-Traffic and BSS Mngt ). In particular, transmission requests are cancelled and negative replies sent to cell Handlers.

o Any Gch used for that TRX is released. Additionally since TRX-level entities will be re-created with an empty radio-allocation, corresponding basic Nibbles are deleted too.

o A message Abis-Nibbles-List-Allocate-Req is sent to the BTS notifying that no more Abis nibble is used on the TRE attached to the TRX. As a result the BTS will stop ongoing traffic on the TRE.

o The TRX is deleted in PTU in mode forced and from PMU point of view un-mapped from the DSP. Its index in the interface layer with PTU is released.

o The BSC is notified with the message TRX-RESET-IND indicating that any PS traffic is lost on the TRX. The BSC removes the Pdchs used for TCH from the PS Zone at the next Radio-Res-Alloc-Ind to prevent to MFS to allocate PS traffic on TCH.

� Incoming RR-Allocation-Indications are then dropped and Cell-State-Change-Indications postponed until the end of the procedure.

o Any message awaited by Cell-level entity from the TRX-level entities is sent.

� This depends on which operation was ongoing on the TRX (Configuration change, Shutdown, New allocation treatment with T1 Reallocations or Fast-Pre-emption ongoing, Deletion) and will unblock these automatons.

o All Entities related to that TRX are deleted and recreated with the last TRX Radio Configuration (HW and O&M) but an empty Radio-Allocation.

o For any deleted automaton the following treatments are performed:

� The Manager of GPU-Signals is notified with their deletion. It deleted all created signals for these automatons. The RRM queue of internal signals is screened to remove all pending signals issued by these automatons.

� Queues of postponed signals are purged at these MS-level contexts deletion.

� Pending Timers are cancelled.

� Buffers linked to pending signals will be later released by the Garbage-Collector of Buffers.

In TDM mode at receipt of the TRX-DEFENCE-RESET-CNF from the BSC, the procedure is completed, and a completion report sent to the GPU defence manager. In IP mode at receipt of the TRX-DEFENCE-RESET-CNF from the BSC, if the TRX is the only one of the remote PTU the BSS Defence agent sent PTU-Defence-Reset to IPGCH. At receipt of the PTU-DEFENCE-RESET-CNF from IPGCH, the procedure is completed and a completion report sent to the GPU defence manager.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 30/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

A TRX-Defence-Reset in a cell has no impact on other TRXs of the cell except for used Abis nibbles and is likely to unblock possibly blocked cell-level automatons.

3.1.2.6.3 Cell-Level Defence

3.1.2.6.3.1 Problems detection and defensive response

Similarly, abnormal conditions occurring at cell-level may first be detected locally by cell-level software entities possibly with the help of the external GPU-level automatons supervisor.

o For instance, a cell automaton is blocked due to an error in a Pdch-Group (e.g. a notification is missing) while there is no inconsistency or blocking in this Pdch-Group. This is why a defence at cell-level is necessary.

The cell defence agent is first informed of the problem. For critical anomalies it then notifies through higher-level Defence Agents the central GPU-Defence-Manager with a report. This one then updates its defence history.

o Besides for critical anomalies the cell defence agent issues a comprehensive dump of any useful cell internal data as described above.

To let the TRX-level Defence a chance to fix the problem, supervision timers at cell-level are substantially higher than the timers at TRX-level.

o Therefore if a TRX-Defence-Reset fails at unblocking a Cell, the blocking at cell-level is subsequently detected, a critical anomaly issued and received by the Cell Defence Agent.

Sleeping Cells alarms issued from PM can also be used to trigger a Cell-level defence action.to detect some sleeping cells due to entities other than MFS. If the cell-lock turns-out to a failure, the BSS-level entity notifies the GPU-Defence-Manager with the problem.

The GPU-Defence-Manager may then request a GPU-level Cell-Defence-Reset to fix and confine the problem at cell-level with no impact on the BSC and SGSN:

o This GPU-confined Cell-Reset extends the previous TRX-Reset to all TRX of the cell and reinitialises the TRX-dependent part of the cell.

o Its goal is to naturally unblock the cell control part thereby resuming the ongoing operation on the cell (Cell-Reset, Cell-Stop, Cell-State-Change) if any.

o It does not affect the cell life-cycle control and is transparent for the BSC and SGSN.

If this GPU-level Cell-Defence-Reset fails at unblocking the cell a heavier and more complex procedure is to be doneeither BSS-level Cell-Defence-Reset if available see § 3.1.2.6.3.3 BSS-level Cell-Defence-Reset procedure or GPU reset

o The BSS Cell-Defence-Reset includes the whole cell deletion and re-creation and guarantees that the cell has completely been reinitialised.

o The cell is also stopped and restarted at the BSC and SGSN sides.

o This procedure is monitored from the Control Station.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 31/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Furthermore at BSS-level, Cell-Locks are secured because their supervision timer is higher than the cell-level guard timers.

o Consequently any blockage in Cell-level entities has already been detected and likely to be fixed by a synchronous GPU-level Cell-Defence-Reset before the BSS-level guard timer elapses.

Besides the guard timers in the Control Station to supervise ongoing cell operations are to be substantially higher than cell-level guard timers used to detect anomalies.

o Otherwise cell-level Defence mechanisms would not have the chance to fix the problem before a GPU Reset triggered by the Control Station.

3.1.2.6.3.2 GPU-level Cell-Defence-Reset procedure It is a light procedure extending the previous TRX-Reset to all TRX of the cell as well as re-initialising the TRX-dependent automaton of the cell. It aims at resuming any blocked operation ongoing on the cell (Stop, Reset, State Change) without impact on the BSC and SGSN and consists of the following steps:

o Defence-Reset of all the TRX of the cell. One single TRX-RESET-IND including a list of Trx is sent to the BSC.

� Incoming RR-Allocation-Indications are dropped and Cell-State-Change-Indications postponed until the end of the procedure.

o Re-initialisation/Clean-up of the entities of the cell (Radio-Resource-Allocation-Handler automaton in charge of dealing with RAE4 and serializing of cell-level operations, Reallocation or Allocation Handlers) interacting with the Pdch-Groups.

� The automaton in charge of the cell life cycle control and interactions with the BSC or SGSN is not affected.

� The Radio-Resource-Handler keeping the radio configuration of the cell is not affected.

o Any message awaited by the cell-controller in charge of the cell life-cycle is to be sent to it. Consequently on going operations on this cell will be unblocked.

o The procedure is completed at receipt of the TRX-RESET-CNF from the BSC.

Its benefits are:

- This GPU Cell-Defence-Reset has no impact at the BSC and SGSN sides (No change of the external Cell state) and will unblock a pending cell-lock. On-going interactions with the SGSN or BSC are not affected.

- This light procedure naturally is likely to unblock any operation on going on this cell (Cell State Change to Disable, Cell-Reset, Cell-Stop).

- It covers the most of blocking cases since they are actually all related to Traffic or TRX issues.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 32/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

- It does not introduce interaction problem with O&M because it is a local process

If after the procedure the Cell-Controller responsible for the cell life cycle and interactions with the BSC/SGSN is still blocked the procedure has failed and the supervision of this automaton will issue a new critical anomaly.

3.1.2.6.3.3 BSS-level Cell-Defence-Reset procedure It is a comprehensive procedure extending the GPU-level Cell-Defence-Reset and that involves stopping the cell at the BSC and SGSN sides plus deleting and re-creating it with a new configuration downloaded from the Control Station. The BSS Defence agent performs this procedure. It involves synchronisation with the Control Station and consists of the following steps:

o Any incoming UL access or DL PDU is dropped. Any ongoing treatment on the cell is stopped.

o The Cell operational state is set to disabled and the state change is sent to the Control Station.

o A Re-Initialisation request is sent to the Control Station. The Control Station is responsible for aborting on-going O&M operation on the cell and postponing any subsequent O&M operation on the cell.

o The GPU Cell-Defence-Reset is synchronously performed.

o The Control Station then monitors the cell lock, deletion, re-creation with its up to date configuration and restart.

This procedure is complete although complex and guarantees that the cell restarts in a clean and safe state. Collisions with on-going operations on the Control Station are to be further studied.

3.1.2.7 Defence against frozen MS-Contexts

Per cell the activity of at least n mobiles (active or inactive) Mobile Context is checked each second by a Garbage Collector.

N is computed providing that at least one mobile in the cell is checked every 4 minutes. If no traffic has occurred for 10 minutes in both UL and DL directions and the MS does not use TCH resources for DTM and Flush or Rerouting procedures are not running. and the MS is not related to a RT PFC the contexts related to this MS are considered as frozen and an automatic cleaning is performed (same behaviour as MsDefenceReset):

o TBFs with the connection to the MS are aborted. The MS-Session is aborted too. As a result Tbfs-Release ( in forced mode ) are sent to PTU. These requests are not acknowledged by PTU.

o LLC-PDUs queues are purged.

o Buffers linked to pending signals will be later released by the Garbage-Collector of Buffers.

o Pending Timers are cancelled.

o The cell is notified with this deletion so that it can autonomously release forgotten radio-resources or throughput.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 33/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o The Manager of GPU-Signals is notified with their deletion. All created signals for these automatons are deleted.

o All contexts and automatons related to this MS are deleted.

Besides checking the inactivity of existing MS-Contexts, this Garbage Collector also requests the audit of the queued LLC-PDUs. PDUs with their maximum life-time expired are released.

3.1.2.8 Defence against lost Radio-Resources

Per cell every 10 minutes a Garbage Collector checks the validity of allocated Radio-Resources per direction and per MS. We are sure with this 10 minutes timer to avoid to take a defence action too early and not really necessary

If the MS client does not exist any more or if it considers the Radio-Resources as not valid any more for this direction the Garbage Collector:

o Retrieves the complete allocated Throughput and Radio-Resources that was granted to the MS.

o Autonomously releases this Resources-Set in the same way as it were released by the MS.

3.1.2.9 Detection of blocked automatons

The PMU software is a mix of entities using synchronous or asynchronous interactions. Asynchronous interactions take place between automatons and external actors.

To detect the blocking of automatons inside the PMU, a periodic control of each automaton is introduced without taken into account the automaton state:

o This periodic check must not be too time consuming due the potentially high number of supervised automatons. Therefore the period for each automaton is 10 Minutes.

o However to smooth CPU load induced by this monitoring a limited number of automatons are checked every minute.

o Consequently this is only a last-resort defence mechanism and automatons must still manage their own short supervision timers to be able to quickly detect a blockage in a temporary state if necessary.

The Subscription policy is defined by:

o Per default at its creation, each automaton automatically subscribes to a central supervisor (which could be the GPU Defence manager).

o At its deletion it automatically un-subscribes from it.

o A service to unsubscribe on demand is offered not to supervise automatons already supervised by existing Garbage Collectors.

Each automaton offers an interface with a synchronous method to check its internal state. The point is that this check does not use the asynchronous automaton (i.e. the check shall not send an event to the automaton).

Periodically the automatons supervisor invokes this method to check some automatons so that CPU usage is better spread. The purpose of this method is:


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 34/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o To find-out if it is normal that an automaton is still in its current state. This is why each automaton maintains the date of its last state change. The maximum delay during which an automaton is in a state naturally depends on that state whatever its nature: stable or temporary.

� For instance a Pdch-Group automaton may stay in its “Available” state forever because it is a stable condition whereas it should not stay in “Wait-For-Trans-Release” for too long because this is a temporary condition.

� The maximum delay during which an automaton is in a temporary state is specific to that state.

o If a blocking case is detected the automaton notifies its assigned Defence agent responsible to take appropriate measures:

� The Defence agent may be a Ms Defence agent, Cell Defence Agent, TRX Defence Agent, or the GPU-Defence-Manager.

3.1.2.10 Recovery of a DSP KO: reload only one DSPRecovery after a DSP KO

A DSP KO event is reported by the DSP meaning that the PTU has met an unrecoverable anomaly. To avoid that this DSP anomaly leads to a GPU reset, it is proposed to modify the PMU and PTU behaviour: a “DSP Restart” is introduced. This feature is only supported on GPU3 and MX GP.

The advantage of this solution is that the 3 other DSP are not impacted by the “DSP restart”. There is no GPRS traffic interruption for them.

The “DSP restart” uses the following mechanisms:

- DSP offer the address of the internal DSP counter (ms timer) to DSP driver. The counter is updated each millisecond by PTU. When DSP KO happens, the counter is no more updated. DSP driver detects the DSP KO by polling this counter periodically (default 100ms).

- At receipt of the DSP KO event, the PMU internally releases all resources related to all Cell/TRX mapped on that DSP and reset this DSP.with PTU SW reload.

- The PMU is able to autonomously reload the PTU SW. Therefore it has to keep the PTU SW in read only memory or memory controlled by checksum. Today memory containing the PTU SW is reused for buffers after GPU start for GPU2 type

The GPU Defence Manager is notified with the DSP KO critical anomaly from the BSS Defence Agent and decides to request a “DSP Reset” procedure with t PTU SW Reload:

This procedure is performed by the BSS-Level Defence Agent and consists of the following steps:

o First GPU-level Cell-Defence-Resets of all the cells mapped on that DSP are performed. Thus all these cells are cleanly stopped from the PTU point of view.

� As all TRX of a Cell are mapped on the same DSP, other DSP and cells will not be affected.

o On option before the DSP reset a whole DSP dump can be made. (Warning this dump can take about 30 secondes)


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 35/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o Reload the PTU SW on the DSP.

o Restart of the DSP. The clean job can be done by setting DSPINT in HPIC register , which triggers the start of the DSP software at address 0 of the external SBSRAM.

o Before the DSP initialisation sequence, all requests of incoming TBF leading to choose a new DSP are rejected by the GPU. Here is the rationale of this restriction:

� During the time between the reception of the DSP KO and the end of the DSP initialisation sequence, only 3 DSP in the GPU are operational for traffic. The choice of the DSP which will handle the GPRS traffic is done at BTS level:

• Indeed when the traffic starts for the first time in one TRX in a given BTS, one DSP is chosen, and afterwards all other TRX within this BTS will use the same DSP.

� In case of DSP KO, all cells are unmapped from the DSP (as all their traffic is stopped). Therefore when the traffic will restart, the choice of the DSP will be done again.

� It is important to note that due to the DSP choice policy, the assignment of one DSP to all cells within the same BTS is done for a long time. A fair balancing all BTS over all DSP is therefore crucial. The only occasion to do that is the occurrence of the first traffic in a BTS.

� The traffic of all cells previously mapped on this troubled DSP are likely to restart very soon and at the same time. If only 3 DSP are available at that time, all these cells would be mapped on them, and when our faulty DSP will be re-initialised, all the BTS will be spread over the other 3 DSP and our restarted DSP would be unused.

� To avoid this, the simplest and safest means is to forbid the choice of a new DSP during that period of time.

� Importantly, for other BTS already mapped on other DSP, there is no impact: Incoming traffic on a new TRX in one of these BTS is allowed. Newly opened to traffic TRXs are mapped on the currently used DSP.

o Start the DSP initialisation sequence. The choice of a new DSP is allowed again and load will be balanced between all DSP.

To support the feature the PTU needs:

o Offer the address of the DSP counter ( ms timer) in the shared contexts with DSP Driver.

o A new monitoring timer shall be introduced in DSP driver to poll the status of DSP. If DSP KO, PMU shall be informed.

o Warning: DSP must reestablish the DSP-PPC communication before waiting for the DSP init sequence. So, DSP driver must implement a new function for it.

This feature relies on two mechanisms already needed for other Defence purposes in the GPU:

- The PTU SW is kept in read only memory or memory controlled by checksum : This is needed for the Fast GPU Restart.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 36/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

- The GPU Cell-Defence-Reset.

3.1.2.11 Recovery of frozen GPU

In some field cases, the GPU appears as "sleepy", i.e. any OAM command is inefficient (e.g. a GPU SW reset request), although the GPU is seen as "Operational" by the Tomas HW supervision, or at least not reported as KO.

The hypothesis (To Be Confirmed) is that the Tomas supervision (NMA distributed KeepAlive test) is still OK, i.e. NMA answers the supervision request, although the upper layer (PMU) is not Operational, and is not able to execute any OAM request.

The proposal is to add a Keep Alive test from GOM to PMU, which uses the same interface as any OAM command to PMU. GOM then triggers a forced GPU HW reset if the KeepAlive test is NOK. The Keep Alive is received by RRM. RRM must interrogate all the PMU tasks before to answer to the Keep AliveThis mechanism will address the blocking case in which a PMU task is blocked although not dead. In that case the GPU PSOS watchdog cannot detect the problem and the GPU is frozen forever.

3.1.2.12 Fast GPU restart (without GPU SW code reload)

3.1.2.12.1 Description

On a GPU2 the time needed to become operational again after a reboot is about 2 minutes and 30 seconds.

On a GPU2 the time needed to become operational again after a switchover to the spare board is about 35 seconds.

A Fast Restart (no code reload) would substantially reduce the delay for the GPU to be operational again. The expected gain is about 2 minutes.

The Fast Restart requires:

• Either to guarantee the GPU code and constants integrity after download, thanks to MMU access control, or to check this integrity by means of a checksum,

• Not to reuse the memory containing the PTU code (as of today), after download on the DSPs, in order to allow for DSP restart (refer to §3.1.2.10).

o This may lead to decrease the number of GPU buffers and therefore probably the GPU traffic capacity.

• The GPU Defence Manager to provide the requested GPU starting type (complete or fast).

• That the restart type is "complete" (i.e. with code reload) by default (set just after the PMU has been downloaded).

Two options should be evaluated for Fast Restart :

• PROM restart: After the GPU reset the Fast Loader is downloaded and gets the restart type. If a Fast Restart is requested and the GPU (PMU+PTU) code integrity is correct, the GPU code is not downloaded again but restarted. The code memory must not be reinitialised (by the PROM code) during this kind of restart. This option is leaved due to the need to modify the GPU PROM

• PMU restart: A full GPU restart sequence (equivalent to the PROM one) is included in the PMU defence code. If a Fast Restart is requested and the GPU (PMU+PTU) code integrity is correct, the PMU code is restarted from its initial entry point, without involving the PROM code and the fast


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 37/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

loader. The CPU context must be identical as it is from a PROM reset (stack, interrupt vector, aso…). Otherwise, a complete restart is triggered as of today.

In both cases, the remaining steps are the same as for a classical GPU reboot. In particular, the O&M BSS and Cells configuration is downloaded from the Control Station.

In both cases, the following restart steps contain a DSP reset with PTU SW reload..

The PMU restart option does not require any change in the PROM code. It should be faster and more flexible, since independent from the PROM contents. But the feasibility and complexity of both options must be assessed.

A Warm Restart (no configuration download) can be envisaged too if:

- The GPU OAM configuration data can be clearly isolated from dynamic contexts and data,

- The GPU OAM configuration data integrity is controlled by means of a checksum,

- This checksum is updated on each configuration value set,

- If the checksum check is OK on reset, the configuration download may be skipped.

The gain on traffic stop duration could be up to 30s with respect to fast restart.

OP: TO BE STUDIED GPU Reset requested by the Control Station.

3.1.2.12.2 Impacts

All options:

GPU buffers can no more be defined in the PMU memory area used to first download the PTU code.

A fallback to complete restart (with code download) must be implemented in case of several unsuccessful

1. Fast restart.

The impact on the OAM SW components (GEM and/or GOM) must be evaluated, since a new GPU restart scenario is added, as seen from the CTL station. This scenario case is functionally equivalent to a spare GPU restart after GPU switch-over : restart without code reload.

2. PROM restart:

If the memory area used to download the GPU code is reset to zero at early restart time (i.e. before download) by the PROM code, this option cannot be used without modifying this part of the PROM code.

3. PMU restart:

A complete restart sequence (similar to the PROM one) must be implemented in the PMU.

4. WARM restart:

A checksum control of the GPU configuration data must be implemented.

A fallback to fast restart (with configuration download) or complete restart must be implemented in case of several unsuccessful warm restart.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 38/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

The impact on the OAM SW components (GEM and/or GOM) must be evaluated, since a new GPU restart scenario is added, as seen from the CTL station. This scenario case is functionally equivalent to the GPU re-connection after CTL station switch-over : the GPU OAM state is recovered without code nor configuration download.

3.1.2.13 Defence against PTU access, Trx or Tbf index leak in PTU proxy An audit on PTU access,Tbf and Trx index is added in case of saturation of these tables A PTU Defence Agent is created at the receipt of the Cell configuration. There is one instance of PTU defence agent per PTU access. It is in charge of

• Executing the PTU-Defence-Reset • Processing the anomalies received from PTU proxy or IPGCH

The Critical anomalies received from PTU Proxy are forwarded to the higher-level Defence Agent up to the GPU-Defence Manager that makes a defence decision. In our case this one is a PTU-DEFENCE-RESET sent to the BSS-Defence-Agent. The BSS-Defence-Agent uses:

• The TRX-DEFENCE-RESET procedure already described to reinitialise all TRXs of the PTU access • The PTU-DEFENCE-RESET procedure to be sent to the PTU DEFENCE AGENT

The PTU-Defence-Agent asks:

• PMU IPGCH to abort the TCP connection with the remote PTU and to release all associated contexts

• And waits the answer from PMU IPGCH before to answer to BSS-Defence-Agent

3.1.2.14 PMU IPGCH Defence tasks

A Abis Defence Agent is created at the receipt of the Cell configuration of the first cell of Abis.. There is one instance of Abis Defence Agent per Abis.

It is in charge of processing the anomalies received from MFS IPGCH linked to a Abis It not foreseen any specific action to be treated by the Abis-Defence-Agent. The Critical anomalies received from MFS IPGCH are forwarded to the higher-level Defence Agent up to the GPU-Defence Manager that makes a defence decision. In our case this one is a PTU-DEFENCE-RESET sent to the BSS-Defence-Agent for each Trx of the Abis

For defence purpose we need to offer a new interface in IPGCH PMU to release a PTU access without interaction with the Remote PTU IPGCH

This will be used in case of reset of all Trx of a PTU access

3.1.2.15 Defence against others GPU tasks

Not foreseen except blocking automaton if needed. The existing mechanisms are enough for BSCGP and GB


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 39/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.3 PTU defence mechanisms

3.1.3.1 Recovery of RLC fatal case related to one TBF

About 65% of the DSP fatal alarms locate in RLC module. Most crashes in RLC module are defence check for logical bugs.

A recovery sample:

When PTU detects an unrecoverable TBF, the recovery solution is to release the TBF by RLC. This is a very special request in RLC module:

1. TBF will release all applied containers.

2. Send out a TBF_RELEASE_IND to RRM.

3. Reset all timers.

4. Report an unmaskable critical alarm

When such semi-fatal alarm happens (En_DspReset_CriticalError is "off"), the TBF will be released abnormally, with a new cause "PTU Internal Error ". The DSP will not crash, and will not need any restart.

This defence is applicable for both TDM mode and IP mode(at least for G4/G5 TRE). By default, it is not applicable for G3 TRE taking into account the memory restriction.

3.1.3.2 Recovery of other easy fatal case

The fatal cases, which can be recovered easily, shall be changed to the critical alarm. Consequently, the recovery code shall be added for these cases. When the EnDspResetCriticalError flag is disabled, the DSP KO will not happen in these easy fatal cases. This defence is applicable for both TDM mode and IP mode.

The list of this kind of fatal error is the following (which is based on the TDM SW. For IP mode, the fatal cases relevant with GCH, HPI are removed due to the RLC/MAC porting in BTS);

3.1.3.2.1 Easy fatal cases in GCH


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 40/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

File name Line Description of the case Defence

l2egch.c 803 subgchid out of the range This alarm should be removed because it is added for debug.

l2egch.c 1140 wrong cs Issue a critical alarm and discard the PDU on this PDCH.

3.1.3.2.2 Easy fatal cases in MAC:


mac_sys.c 2011 The number of TRX exceeds the range in DSP-TRX-PM-GAUGE-req message.

A critical alarm and check the first TRXid and the last TRX id at reception of DSP-TRX-PM-GAUGE-req

mac_ulr.c 183 wrong UBN in PDCH context A critical alarm and return the P_ULR_NIL

MAC_ULReordering.spd P_ULR_NIL, wrong input from MEGCH

A critical alarm and clean the container and return

MAC_UnitDataInd.spd P_ULR_NIL, wrong input from MEGCH

A critical alarm and clean the container and return

MAC_AccessInd.spd P_ULR_NIL, wrong input from MEGCH A critical alarm and clean the container and return

MAC_NodataInd.spd P_ULR_NIL, wrong input from MEGCH A critical alarm and discard this message and return

3.1.3.2.3 Easy fatal cases in RLC:


rlclksup.c 160

TX_EFFICIENCY_TOO_LOW,error while calculate tx_efficiency, the denominator is zero.

A critical alarm and return TX_EFFICIENCY_TOO_LOW, which will lead TBF abnormal released

rlc_lla.c 133 In DL Ack mode, -1 is raised if no LLC PDU has been acknowledged

A critical alarm and return.

rlc_nacc.c 423 check 4 byte alignment

This alarm would be happened in UT platform now. So, nothing modification on it.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 41/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

rlc_rra.c 108

0

while checking the CS reduction complete,Defensive test detects that TBF is an unacknowledged mode TBF which lead to a fatal alarm.

The check has been done before this function. So, remove the alarm.

rlc_tbf.c 105

0

while increase error counters, it is detect that NstagnatingWindowDL is not error which lead to a fatal alarm.

A critical alarm and return.

rlc_tbf.c 109

7

while getting the value of NcReportingPeriodT in 20ms unit from the last TRX_SYS_DEFINE_REQ message, it is detected that NcReportingPeriodT>7 which lead to a fatal alarm.

A critical alarm and return the default value.

rlc_tbf.c 111

8

while getting the value of NcReportingPeriodT in 20ms unit from the last TRX_SYS_DEFINE_REQ message, it is detected that NcReportingPeriodT is not in the range from 0 to 7 which lead to a fatal alarm.

A critical alarm and return the default value.

rlc_tbf.c 149

7

while set Polling Period in delayed state according to the NC2 timer,it is detected that tDlDelayed_20ms + NcReportingPeriodT_20ms=0 which lead to a fatal alarm.

A critical alarm and let the tDlDelayed_20ms and NcReportingPeriodT_20ms as the default value and not return.

rltdlaso.c 362

Check the anticipated radio block polling request (MAC constraint: "Enable_Poll will be set if the poll is required by anticipation")

The defence has been added that is RLC sending NO_data to MAC. So, remove the alarm here.

rltdlaso.c 514

4

It is unexpected case. If the FATAL alarm is occurred, it must be a logical error.

A critical alarm and return N_polling.

rltdlaso.c 523

8 It is unexpected case. A critical alarm and set the T_Polling_interval_20ms to 60ms (Min value).

rltdlaso.c 821

9

If the FATAL alarm is occurred, it means that the ARQ status is wrong. It is logical error.

A critical alarm.

3.1.3.2.4 Easy fatal cases in SCH:


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 42/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent


schenvso.c 126

7

In xinenv fuction, an unexpected signal is received

A critical alarm, then release the buffer. Some container with the signal may be lost.

schtskso.c 259

3 800 us restriction A critical alarm

schtskso.c 325

6

unexpected message is coming from PPC ==> unexpected internal message, or, message from PPC but forget to assign here (implicit comsuption)

a critical fatal alarm, then,release the HPI buffer

3.1.3.2.5 Easy fatal cases in SDL:


mk_cpu.c 676

SDL want to send a signal, but the length(header+body) is larger than 1620 (partition CA_TYPAR_SDT_LARGE size)

Replaced by critical-fatal and truncate the signal.

3.1.3.3 Recovery of fatal alarm on memory operation

About 23% of the DSP fatal alarm are related to the wrong operation of RLC container, which is the hardest to prevent and hardest to trace. So we need to improve to implementation of RLC container operation.

3.1.3.3.1 Recall of current implementation:

In current implementation, all containers’ length and address are fixed at DSP initialisation. As 4 bytes at head of every container are used as a point, containers in same class are link up at initialisation. There are six descriptors in lsd struct to indicate six class of containers’ situation. (More detail please see function f_PAD_init. )

When a container is needed, we will refer to the corresponding descriptor in lsd struct to get a point naming p_first_free which pointing first free container. Then this point will point to the next container in chain.( More detail please see function f_LSD_containers_get.)

When a container or several linked containers are useless, they will be linked up in free container’s chain. Note: no check will be done to these containers and they are always assumed as linked up before we chain them up. So, in this action, we only add the first container to the free container’s chain and set last container’s point to NULL. (More detail please see function f_LSD_containers_release.)


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 43/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

In current implementation, if the pointer is written with wrong value (e.g. crossing the boundary of containers when put data into one container), then whole chain is broken and a DSP KO will be raised. As this container may only relevant to one TBF or one TRX, it’s not good for robustness. So we propose to separate the data part and pointer field of the container pool .

3.1.3.3.2 Separation of data and pointer zones

In this solution, the partition of RLC container is separated to two parts: a data table and an index table. The management of the container is same as the old implementation except the pointer to the next element is replaced by the index located in another area.

P_first_free P_last_free P_begin P_end NumberofFreeContainer NumberMaxOfContainer ContainerSize

Partition

Pointer to the next element

Status (to_be_free)

Data

@Next container

Status

Data

Index of the next element (instead of the pointer)

DATA table

Index table

Index_first_free Index_last_free P_begin P_end NumberofFreeContainer NumberMaxOfContainer ContainerSize

Partition

Status

Data


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 44/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

The data table is declared as an array. Each element contains a “status” and a pure “data” part of the old container.

The index table is declared as an array of 16 bits integer. There will be two kinds of chains of index in the index table. One for free containers and one for used containers. In the chain of the free containers, each element is an index corresponding to the next free container in the data table. In the chain of the used containers, each element is an index to the next container this instance (RLC/MAC/GCH) used.

Scenario of getting free container: − User gets Index_first_free from the chain of the free container, then locates the data area in the data

table by this index. The Index_first_free is set to the index of next free element.

Scenario of releasing containers: − The index of the useless container will be linked up in the chain of the free containers. Just like the old

implementation. Advantages:

− No pointer in the data area and it is ok when writing action cross the boundary of the data part.

− No extra memory needed compare with the old structure.

Note: this defense makes sense also for the container SDL_large relative with the messages “TRX_PM_gauge_cnf”, “TRX_PM_counter_cnf” and “TRX_sample_cnf”. In B10, we ever encountered a problem caused by writing action crossing the boundary of the data part of the TRX_PM_counter_cnf . Hence, the defense will also be introduced for the container SDL_large.

3.1.3.3.3 Impacts Much code modifications are needed (both container operation level and user level). It may imply some risks. The effort to introduce this improvement is huge. Note: this improvement is applicable for both TDM mode and IP mode.

3.1.3.4 Recovery of other critical fatal cases

A few DSP fatal alarms (10%) are related to the hardware/interface operation. These fatal alarms are not recoverable.

A few DSP fatal alarms (3%) are related to the SDL operation. These DSP fatal alarms can hardly be removed unless the SDT scheduler is abandoned. As we know, the SDT scheduler is the foundation of the PTU SW. Removing the SDT scheduler must imply a great effort and a high risk. Therefore, these fatal alarms will be kept.

The left DSP fatal alarms (3%) can be degraded to the critical DSP alarm even if removing these fatal alarms is risky from the system point of view ( These error cases have impacts on QoS, or on the exchanges of PMU-PTU ). This defence is applicable for both TDM mode and IP mode. The relevant WA is described in the following table.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 45/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

File name Line Description of the case Defence Comments

schtskso.c 1032

It is failed to get a resource of a specified type from the container (for TRX_PM_gauge_cnf)

Don’t send this message to PMU, raise an ummaskable alarm

Please be kindly noticed:

PMU will not receive the TRX_PM_gauge_cnf in this case.

schtskso.c 1087

It is failed to get a resource of a specified type from the container (for TRX_PM_counter_cnf)



PMU will not receive the TRX_PM_counter_cnf in this case.

schtskso.c 1038

It is failed to get a resource of a specified type from the container (for TRX_sample_cnf)



PMU will not receive the TRX_sample_cnf in this case.

schtskso.c 1786 Unexpected MAC or RLC message

attempt to fill the HPI buffer

Discard this message and raise a critical alarm

schtskso.c 1926 It is failed to release one container

for some UL message in task_hpi_up

Ignore this error and raise a critical alarm

schtskso.c 2448 failed to get container (for

DSP_load_ind)

Don't send the DSP-load-ind to PMU in this case

schtskso.c 3385 failed to get container (for

DSP_congestion_ind)

Don't send the DSP-congestion -ind to PMU in this case

l1_dlfnc.c 379 SW error? raise a critical alarm and return

l1_dlfnc.c 388 wrong PDCH number raise a critical alarm and return


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 46/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.3.5 Reaction of PMU command after DSP KO According to the different treatments, the 390 DSP fatal alarms are sorted in the following table (based on the PTU SW in TDM mode):

Type Occurrence Treatment Comments

Easy fatal cases (less risk)

24

One fatal alarm is not necessary and is changed to the non_fatal alarm, the left 23 fatal alarms are changed to the critical alarm after introducing the relevant WA Already one in B10 MR1

Fatal cases related to the logic check in RLC

206

These fatal alarms are changed to the critical DSP alarm, and the relevant TBF will be released by RLC internally Already one in B10 MR1

Fatal cases relevant to the RLC container operation

93 Introduction of the new container management mechanism

Need about 30 Kbytes to implement this WA

Fatal cases that can not be recovered easily (high risk)

13

These fatal alarms will be changed to the critical DSP alarms, the WA shall be implemented carefully one by one.

Removing these fatal cases is risky from the systemic viewpoint

Fatal alarm in SDL module 14

Keep, unless the SDT scheduler is removed from the PTU SW

Fatal alarm relevant to the HW problem or the interface

40

Not recoverable , keep From above table, we may still have DSP KO after implementing all above defenses. To improve the robustness of our software, an improvement of the KO mechanism is raised. − New shared variables will be introduced in PTU memory. PMU can use these variables to send urgent

command to PTU after DSP KO. − When a fatal error is detected by PTU SW, PTU will be tripped into a specific loop. In this loop, PTU will

poll the shared variables. − In case an urgent command detected in this loop (i.e. the shared variables are changed by the directly

access of PMU) , PTU may react it if any. This facility makes it possible to let PTU react some commands from PMU after DSP KO by itself.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 47/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.3.6 Forced TRX delete procedure

When a cell-blocking case or a PMU-PTU inconsistency case is detected by the TRX defence agent, the PTU is notified with the TBF release (no reply). Following the TBF release, the blocked TRX will be deleted in PTU in mode forced (no reply). This defence is applicable for both TDM mode and IP mode.

In case of TCP error detection in IP mode PTU applies the forced TRX delete procedure. Like said in § 3.1.2.6.2.2 TRX-Level Defence after the TRx Defence reset sending and answer from the BSC the PMU cuts the TCP connection. In this case the TrxDeleteRequest in forced mode can be lost in the IPGCH queues .

In PTU, MAC layer takes the responsibility of the relevant clean job.

The forced TRX delete defence consists of the following steps in PTU:

− MAC module detects if there are some TBFs still established in this TRX. If yes, the following treatments shall be performed at TBF level:

� MAC sends the TBF-release-req-INT to RLC to delete the contexts of TBFn in RLC.

� After the contexts of TBFn are cleaned in RLC, RLC shall send a message to MAC to clean its contexts in MAC layer. (after this step, the contexts of TBFn are cleaned in PTU)

� MAC performs the above 2 steps again for other TBF in the same situation if any.

− When all the relevant TBF contexts are cleaned from this TRX, MAC module detects if there are some GCHs still established in this TRX (TDM only). If yes, MAC layer shall build a special message to MEGCH layer to release all these GCH in MEGCH (no reply). Afterward, all the GCH/PDCH contexts in MAC layer shall be cleaned also.

− After all the TBF contexts and GCH( TDM only)/PDCH contexts are cleaned, MAC finally deletes the TRX contexts in PTU.

− In this forced TRX delete procedure, PTU don’t need to send the following messages to PMU:

� TRX-delete-cnf,

� TBF-release-ind,

� GCH-release-ind (TDM only)

In TDM mode, after the TRX is deleted forcedly by PMU, the TRX can be recreated. As the BTS is not notified the TRX recreation, some messages, that are sent to the old TRX, may be received by the new TRX. For such kind of message from BTS, MEGCH layer shall ignore them to avoid any potential impacts on the recreated TRX.

3.1.3.7 Recovery of a DSP KO (fast detection & reload only one DSP)

3.1.3.7DSP restart (with PTU SW reload)

To avoid that a DSP KO leads to a GPU reset, the DSP fast restart is introduced for TDM mode. It can recover the PS quickly on the fault DSP and avoid the 3 other DSP impacted by the fault DSP.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 48/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

To support the DSP fast restart, PTU shall offer the address of the DSP timer (ms timer) to DSP driver. If a fatal error is detected by PTU SW, the DSP timer is frozen. DSP Driver shall read the DSP timer every 100ms, and notify PMU when the DSP KO happened.

Once a DSP KO is detected by Driver, by option PMU will dump the DSP memory first and reload PTU SW. Afterwards, PMU will set DSPINT in HPIC register. Then, a DSP restart procedure will be performed in the driver. After it, PMU will send the DSP-init-req to this DSP. The DSP restart procedure may need PMU to send a primitive to the driver.

The defence is only applicable for TDM mode. In IP mode, it makes no sense due to the facts that:

� In IP, RLC/MAC is located in BTS. Hence, no dedicated PTU SW , but a SCP SW (CS code+RLC/MAC code)

� In IP, in case the fatal case occurs in PTU task, OAMsys will restart the PS tasks.The CS tasks will not be impacted. In this case, the SW reload is not needed. The IPGCH handler will close the TCP connection. Thus, PMU is aware of the PTU reset.

� In IP, SCP has a "RUN time checksum" and a "download checksum".SCP will calculate during normal operation a checksum (CS code+PS code), and compare it with a previous calculated. Checksum error is a fatal error for the SCP, which will lead to restart TRE.

3.1.3.8 PTU Code corruption detection (Checksum)

In IP mode, RLC/MAC is located in BTS. SCP is responsible to protect the CS code, PS code and the constant area (with the MMU supervision). Hence, no need to control the checksum.

In TDM mode, a runtime checksum will be calculated in the background task of PTU. To avoid the impact on the PS service, the runtime checksum is divided to many parts:

− At the PTU initial state, a checksum of the whole PTU code is calculated and memorized in DSP context.

− Each time the background task is invoked, one part is calculated.

− If the code corruption is detected by the runtime checksum, a DSP KO is triggered.

This sub-feature can be activated/deactivated by a new parameter “En_Runtime_PTUCheckSum”. (Please refer to 3.2.3 Other parameters.)

Warning: in case PTU highly loaded, the time to compute the checksum of the entire PTU code may be very long.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 49/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.4 PMU debugging facilities improvements

3.1.4.1 Debug information on important events

3.1.4.1.1 Debug information for GPU Crash

Today when the GPU crashes there is no means to get some Debug information before automatic restart by the Control Station except with a manual action on site to install a link with the GPU.

o If an exception is raised (e.g. Access Violation) and before crashing, the OS should give control to a user-defined handler to dump some helpful information: Content of the execution stack to retrieve the last active piece of code and chain of calls.

o This information should be saved on a non-volatile support to be read after the GPU restart.

3.1.4.1.2 Debug information for a GPU Reset or Restart

A GPU Reset or Restart is exclusively triggered by the GPU-Defence-Manager. Before this action the GPU-Defence-Manager issues some Debug information:

o It may request to local Defence agents a comprehensive dump of data for their supervised entities depending on the cause of Reset.

o It guarantees that ongoing issued traces are sent to the Control Station by waiting a few seconds.

o It may request to local Defence agents a dump of the last interactions at interface-level depending on the cause of Reset.

3.1.4.1.3 Debug information for a TRX-Reset

Before undertaking a TRX-Reset the TRX Defence Agent performs a comprehensive dump of any helpful internal data of the functional entities related to that TRX in traces.

Besides the following traces are also issued:

o Last exchanges between that TRX and the Transmission-resources management subs-system.

o Last exchanges between that TRX and the cell.

o Last exchanges between that TRX and its supported MS.

o Last exchanges between that TRX and PTU.

o History of automatons and other entities.

This will make new histories more interface-oriented necessary at TRX-Level.

3.1.4.1.4 Debug information for a Cell-Reset

Before undertaking a Cell-Reset the Cell Defence Agent performs a comprehensive dump of any helpful internal data of the functional entities related to that cell in traces.



0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 50/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

o Last operations in the cell life.

o Last exchanges between all TRX and their cell.

o History of automatons and other entities.

This will make new histories more interface-oriented necessary at Cell-Level.

3.1.4.1.5 Debug information for autonomous clean-up

Before undertaking an autonomous Defence-Abort a MS-Context performs a comprehensive dump of any helpful internal data in traces.


o Last interactions with its TRX.

o Last interactions with PTU excluding Data.

Before undertaking an autonomous release of forgotten radio-resources the Cell Garbage-Collector performs the dump of the released resources and any other helpful information.

3.1.4.1.6 Debug information for a DSP KO

In TDM mode, before notifying a DSP KO to PMU, the PTU sends Debug Information to PMU including encoded data. These messages contain any useful information to identify the problem and facilitate investigation.

This Debug Information shall be issued by PTU only for fatal errors met and are dumped by the PMU in traces. To facilitate the investigation, the length of the debug information shall be enlarged. PTU shall notify its length and address to PMU in DSP-init-cnf.

In IP mode, when a fatal error met, PTU shall save the useful information in the post mortem area in SCP. Then, PTU task will be restarted by OAMsys. After the PTU task comes back, the debug info stored in post mortem area will be sent to PMU through the alarm indication.

3.1.4.1.7 Debug information for Problem involving the PTU

When the PMU meets a problem where PTU-level Debugging data is likely to facilitate investigation it may request PTU to issue a few Debug Information messages containing encoded data related to a specific TBF or TRX..

These messages are to be issued only at request of PMU and are dumped by the PMU in traces. It is applicable for both TDM mode and IP mode.

When the PTU meets a problem and has suspicious GCH Frames at hand it issues Debug Information messages (existing Dsp-Alarms) containing these frames to be dumped in GPU Traces by the Interface layer PTU-PMU. It is only applicable for TDM mode.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 51/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.4.1.8 Debug information for anomaly detection

When receiving an anomaly indication (current “Data-Err”) from functional entities a Defence agent dumps any helpful internal data related to the supervised entities.

3.1.4.2 Traces improvement (trace dictionary)

A dictionary of major and critical anomalies is provided along with each PMU delivery to allow test teams to detect potential problems. Automatic search of a predefined set of anomalies can be done by specific tools.

At each line besides the trace number, the Date (Instead of the task id ) and Time information is dumped by the Trace server.

Problems are categorized in classes and for each class a predefined mask of traces is provided at each PMU delivery.

Traces are enriched with the following identification data:

o Cell-identity, possibly the TRX-identity in the cell, possibly MS-Reference.

3.1.4.3 Dynamic trace level set

Give the possibility for the PTU or PMU software dynamically to change the trace level. This will be useful in case of problem detection by software

It is applicable for both TDM mode and IP mode.

Add two PMU API’s allowing setting and resetting dynamically the trace levels for a period defined in a BTP parameter see § 3.2.3 Other parameters for more detail

3.1.4.4 Miscellaneous

Possibility to stop traces when some events have been received

Possibility to choice the trace Repertory and the maximum number of files to generate

3.1.4.5 Traces service improvement

3.1.4.5.1 Traces Encoding (traces size reduction)

3.1.4.5.1.1 Mechanism

Being able to capture and manage traces from the ground with a minimum impact on CPU load and therefore Telecom traffic implies to reduce the traces volume.

This is achieved without loss of information by replacing characters strings with indexes:

o This encoding is performed before PMU generation as a pre-processing phase by a Source-Analyser.

o It is applied to both Traces and Anomalies. Consequently the Defence agents collecting anomalies are to use these indexes as well.

o Parameters are not impacted by this encoding.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 52/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Along with the executable file the Source-Analyser generates a Table of correspondence between characters strings and indexes. This table will allow automatic decoding through a Trace-Analyser.

3.1.4.5.1.2 Table of indexes management

The table is part of the delivery and labelled with the PMU label. It is therefore part of the PMU configuration management.

Each time the GPU reboots the PMU Marker is dumped in traces with a special code. The Trace server of the Control Station memorises it as the current PMU marker.

The Trace server systematically inserts the current PMU marker at the head of the current Trace file before encoded traces.

o This may happen at a new trace file creation or at rewind to the beginning of a full trace file

3.1.4.5.1.3 Encoded traces exploitation

Two consultation modes are available:

o Decoded traces may be exploited on-line in Alcatel-Lucent Lab.

o Encoded Traces may be exploited off-line from customer site or in Alcatel-Lucent Lab:

� The Trace-Analyser can then perform Filtering at Cell, TRX or MS-level before Decoding by taking as an input a Cell and/or TRX Identity or a MS-Reference

� Thus Traces encoding allows Post-Test Filtering. This will ease problems investigation

3.1.4.5.2 Buffering before sending to the Control Station

Instead of sending many small-sized Traces on flight to the control station Traces are first stored in a circular buffer. Then Traces are sent per bunch through UDP to the CS Trace server and with an amount of data equal to the maximum IP datagram size to avoid segmentation at UDP-level.

Fewer losses are expected as fewer disk accesses will be done.

3.1.4.5.3 Review of current traces

The most observed anomalies on field or lab must be improved to ease problem investigations.

3.1.4.6 HTML pages improvement

A new Html page is provided at cell-level to dump histories of exchanges between that cell and BSS-Trans. PTU Debug information will be readable from Html Pages (applicable for both TDM mode and IP mode).

Add synthesis values in HTTP pages based on information sent by PTU

A new script is provided to push Html pages for a list of cells out.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 53/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Only looking for existing entities (e.g. Pdch-Groups) would somehow fasten Html pages capture.

The protocol to capture Html Pages is to be optimised:

o For example, by using fewer requests or more data per request. But this should not be too CPU load consuming (delay between requests).

o Or by using Html Pages fusion to reduce the number of pages.

3.1.5 PTU debugging facilities improvements

Although the "Fatal Alarm" will cause DSP KO and GPU reset in TDM, it still has some pros: it can remind us there is something wrong happened in PTU SW, and, as mentioned above, BAM can dump the whole DSP memory, which improves efficiency of PTU FR correction. So it is proposed that "Fatal Alarms" could be turned on or off on the request of debug purpose.

MFS should define a new parameter En_DspReset_CriticalError as following:

- when En_DspReset_CriticalError is "off" (default value), we will switch off "Fatal Alarm" indefinite loop. When PTU runs to the branch of previous KO case due to internal / external error, instead of DSP crash (in TDM mode) or PTU task restart (in IP mode), the recovery / defense code will take it’s duty. At the same time, a DspAlarmInd (alarm type: "Critical_Alarm", with a readable string, e.g. “Err: MCS>MCS9”, plus the long debug information) will be sent to PMU. Then PMU prints this information through a Data_Err, and also stores it in xpu0xx.log (just like fatal_error).

when En_DspReset_CriticalError is "on" (to be used by MFS test or VAL test, or for the debug purpose in the field), we keep the same behavior as before, in TDM mode, DSP will "KO", so we will be informed there is a bug, and we can still get the whole DSP memory dump to speed up the problem investigation. In IP mode, the PTU task will be restart. The short post mortem info will be sent to PMU via Abis. If the whole memory dump is necessary for the investigation, the tester can dump the PTU memory by the SCP/PTU IP debug and monitor tool.

3.1.5.1 Trace filters of PMU-PTU interface trace

To capture the most valuable PMU-PTU interface traces and reduce the impacts on the PMU load, a few new trace filters of PMU-PTU interface messages will be introduced in PMU entity.

− To facilitate the investigation, the PMU-PTU messages can be filtered on the several classes e.g.:

� DSP class, including all the messages at DSP level

� TRX class, including all the messages at TRX level

� GCH class, including all the messages related to the GCH establishment or release

� PDCH class, including all the control messages related to one PDCH

� TBF class , including all the control messages to the TBF entity

� Data class, including the data messages to the TBF entity

� PM counter class, including the messages PM counter relevant

� Etc…


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 54/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent


Moreover, to investigate a problem corresponding to a specific DSP/TRX, the tester may ask PMU to capture the traces dedicated to a DSP (e.g.only the traces on DSP 0 in TDM mode) or a TRE (in IP mode) . PMU shall be able to dump such kind of PMU-PTU interface traces.

3.1.5.2 Dynamic PMU-PTU interface trace

Generally, the trace filters are configured in the xxx.ini file. The tester can use the IMT or the HTML pages to change online the trace level (without reboot).

To capture the helpful PMU-PTU interface traces, on traffic event or bug detection PTU may ask dynamically PMU to turn off the filters of the specific message classes. When PTU detects some major error, PTU shall notify PMU to turn off the filters related to this error. When PMU gets this indication, the relevant PMU-PTU traces will be traced by PMU if any.

For instance: In TDM mode, when a GCH error on DSP 2 is detected by PTU, PTU may ask PMU to trace the GCH class PMU-PTU interface messages on DSP 2. In IP mode, if the memory congestion error is detected by PTU, PTU may ask PMU to trace the DATA class PMU-PTU interface message on this TRE.

If the trace level indication is not repeated by PTU before half an hour, the trace filter is turned on.

This activation by PTU must be controlled by an option given in the DSP_INI_Req message. Just like the set a trace probably.


3.1.5.3 Dynamic trace of DSP alarm indication

According the experiences of B9 maintenance, normally, the filters of DSP-alarm-indication are turned on in field. To improve the investigation, PTU may ask dynamically PMU to turn off the DSP alarm filters when PTU detects some major error on one DSP.

Besides, all the PTU issues are DSP independent. Hence, such PTU indication is only applicable for the associated DSP.

This activation by PTU must be controlled by an other option given in the DSP_INI_Req message . Just like the set a trace probably

E.g: In TDM mode, when a CRC error on DSP 1 is detected in PTU, PTU may ask PMU to trace the GCH class DSP-alarm-ind on DSP 1. In IP mode, when the error due to loss of UL block is detected by RLC, PTU may ask PMU to trace the MAC class alarm indication on this TRE.

If the alarm trace level indication is not repeated by PTU before one hour, the alarm trace filter will return to the default value defined in xxx.ini file.


3.1.5.4 Improvement of PTU internal Debug trace

In PTU, a SHT ( “perf.sht” )area is used to save the historic of the SDL signal processed. The current SHT area can keep around 5 seconds traces in PTU. These traces are very important for the DSP KO investigation.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 55/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

For some particular PTU error, the last 1 second traces with more debug information may be more helpful for the DSP KO investigation. Taking into account this point, the dynamic SHT traces will be implemented.

At PTU start, the default short SHT traces will be saved in PTU. When PTU detects a critical error (En_DspReset_CriticalError is "off"), PTU may adjust the SHT trace type according to the specific error.

For instance: when PTU detects a TBF internal release, PTU may use the long SHT traces from this moment. The SFT trace level will be kept until another critical alarm, which needs the short SHT level, is raised.

Furthermore, to have a complete history of PTU signals, the non-SDL signals (e.g TRX-sample-req) shall also be traced in SHT.


Note, in IP mode, by default, the PTU memory will not be dumped taking into account the cost on Abis. This PTU memory is only activated on the request of the remote tools.

3.1.5.5 Partly memory dump for critical alarm

When En_DspReset_CriticalError is "off" , the DSP will not KO when a critical DSP alarm is met. Consequently, PMU will not dump the comprehensive DSP core. To facilitate the investigation of the critical DSP error, the related PTU contexts can be sent to PMU and be dumped by PMU in traces when the concerned trace level is activated.

It is only applicable for both TDM and IPmode. In IP mode, the PTU memory is dumped by OAMsys and charged by the SCP/PTU IP debug and monitor tool. Hence, when the memory dump is necessary, the entire PTU memory will be dumped by the remote tools (or through the USB port) .

3.1.5.6 PTU internal trace for DSP alarm indication

From the experiences of B9 maintenance, the history of nonfatal DSP alarm indication may help to locate the root cause of PTU issue quickly. Hence, the histories of non_fatal DSP alarm indications will be newly traced in PTU memory.

It needs around 12 Kbytes memory to save the histories of small-sized no fatal DSP alarm indication. Each item will consume 4 bytes. Hence, PTU may have at most 512 256 historic DSP alarm indications when a fatal/critical error occurred.

For the critical DSP error, in both TDM and IP mode, PTU will send the critical alarm indication to PMU. Besides, the history of nonfatal DSP alarm can be sent to PMU as long as the concerned trace level is activated.

For the fatal error, in TDM mode, PMU will dump the entire DSP memory.

In IP mode, by default, PTU will save the part of nonfatal_alarm histories and the latest SHT traces to the post mortem area and provide it to SIDMO.

3.1.5.7 Forced DSP KO

3.1.5.7.1 Critical DSP alarm

To avoid QoS greatly impacted by the critical DSP error in a long term, PTU will maintain a new Critical_alarm_counter.

When the number of DSP critical alarm indication exceeds the relevant threshold ( e.g.50e,g.10), PTU may turned on the “En_DspReset_CriticalError” forcedly. As a result, when PTU encounter the critical alarm again, DSP will KO (in IP mode, PTU task will be restarted).


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 56/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

To avoid to have less useful DSP KO, we can introduce a new parameter for the threshold ( in DSP-init-req):

Max_Nb_Of Critical_Alarm.

A specific value( i.e. 0x00) may be used to deactivate this option.

It is applicable for both TDM mode and IP mode. In IP mode, to avoid the impacts on CS traffic, .En_DspReset_CriticalError shall be hardcoded to true in PTU SW.

3.1.5.7.2 Fake DSP KO

In B9, a lot of fake DSP KO are observed on VAL or in field. For such DSP KO, the memory dump of DSP core is useless for the investigation as the PTU is still running when PMU is dumping the DSP core.

To facilitate the investigation of fake DSP KO detected by the DSP-PPC driver, before PMU dumps the DSP core, driver shall notify PTU to raise a real DSP KO. It will be done by the driver to set an internal flag “Stop_PTU” in DSP internal memory by PCI write access.

Meanwhile, the DSP KO will not be increased. As we know, the so called fake DSP KO is more serious than a real DSP KO. The fake DSP KO must lead to a GPU reset.

It is only applicable for TDM mode.

No defense against the fake DSP KO detected by PMU as most of this DSP KO can be recovered by the TRX reset procedure.

3.1.5.8 Improvement of DSP alarm mechanism

The DSP alarm indication can be improved in the following manner:

− All the non_error DSP alarm indication shall be traced in a new PMU-PTU interface message “DSP-trace-ind” in TDM mode. In IP mode, it depends on the configuration of the SIDMO (SCP/PTU IP debug and monitor tool). When the PTU IP traces are enabled, the PTU trace indication will be sent to SIDMO over Abis. (Note: The SIDMO terminal can be connected from every point in the BSS IP network.) Hence, in IP mode, the trace indication will be out of the scope of PMU-PTU interface.

Note: In IP mode, we can also get the traces from USB, LA. For detail information, please refer to the “RLC/MAC in BTS step3” document

− The left DSP alarm indications shall be redefined according to its severity and its module e.g. minor internal error in GCH module, or minor external error..

− PTU may ask PMU to provide the different alarm trace filters on the severity.

− PTU may also ask PMU to provide the different alarm trace filter on the module. The new mixed trace type (legible traces +normal debug information) shall be introduced in TDM mode, for all the critical/fatal DSP alarms, the long DSP alarm type is recommended. In IP mode, as no PTU memory dump available by default, both the alarm indication with long DSP alarm type and the new message introduced to report DSP contexts can be used to report the debug info shall be introduced. When the PTU task is restarted, the debug info saved in SCP post mortem area will be sent to PMU. For the mixed trace type used by the critical error or fatal error, PMU may be asked to print the legible traces in the log file.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 57/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.5.9 Synthesis debug information to be shown in html page

To have a global view of QoS and PTU status, PTU will report a PTU-Status-indication to PMU every 30 seconds.

The PTU-status-indication message contains a few new debug counters in RLC/MAC/GCH modules. E.g. new counter of GCH UL lost of sync and new counter of NACK cnf of TBF reallocation request.

PMU will provide a new Html page to dump histories of PTU Debug information. This will be made via BssTrans


3.1.5.10 Dedicated trace file for the specific PTU traces In B9, we can hardly have the DSP-init-req and DSP-modify-req message when we are investigating the slow QoS degradation problem even if we can use tracemon to trace on different trace files. The DSP-init/modify-req messages are also very useful to investigate the migration issue. Hence, it is proposed to print these traces to a separate trace file.

Besides, the memory dump files may be overwritten by the new traces due to the same trace filename. It means if the DSP KO happened on DSP 0 twice, we can only get the latest memory dump files. To improve it, PMU may be asked to introduce the timestamp in the filename of the COFF file(TDM only).

3.1.5.11 Other improvements

− Due to the Dynamic Abis feature, K12 can hardly track all GCHs linked to a cell. To improve it, K12 will be asked to capture the traces dynamically. It is a requirement to the K12.

− Often, it is difficult to align the GPU traces and the K12 traces. To facilitate the investigation of QoS issue, it is proposed to synchronize the clock of K12 with the trace sever before capturing the K12 trace. The alignment of the K12 trace and the GPU traces is very helpful to understand the PTU behaviour and to locate the problem. It is a requirement to the tester.

3.1.5.12 PTU restrictions

The main restriction in PTU is the DSP memory.

According to the latest PTU version (B10-MR1-17), there are 156k bytes left for the further CR/FR/improvements. Open point: can we sacrifice dimension to have stability and powerful trace?

Answer:Yes in the PTU if the GPU dimensioning (PMU part) is kept

Another restriction is the CPU performance. We must be careful when we plan to introduce a new defence which has negative impacts on the PTU performance.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 58/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.6 Features priorities

FEATURE PRIORITY

Defence against leaks of GPU Buffers Mandatory

Defence against leaks of GPU-Signals Mandatory

Prevention of buffers congestion Recommended

Memory Access Control Recommended

PMU Code corruption detection (checksum) RecommendedMandatory

PTU Code corruption detection (Checksum) Recommended

Stack overflow Nice to have

Defence against Timers corruption Nice to have

TRX-Level Defence and GPU defence manager Mandatory

GPU-confined Cell-Reset Mandatory

BSS-level Cell-Reset Nice to have

Defence against frozen MS-Contexts Mandatory

Defence against forgotten Radio-Resources Mandatory

Detection of blocked automatons Nice to have

Recovery after a DSP KO (fast detection & reload only one DSP)

Mandatory

Gpu frozen with detection from the Control Station Recommended

Fast Restart without GPU reload from Control Station

Recommended

Fast Restart without GPU reload from Control Station and without configuration reload

Nice to have

PTU proxy and PMU IPGCH IP defence Mandatory

Recovery of RLC fatal case related to one TBF Mandatory

DSP easy fatal error recovery Mandatory

DSP other fatal error recovery Mandatory


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 59/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Fatal alarm on container recovery Recommended

DSP critical fatal error recovery Mandatory

Reaction of PMU command after DSP KO Nice to have

PMU Debugging facilities improvements (Debug information on important events)

Mandatory

PMU traces improvement with dynamic trace level set

Mandatory

PTU traces improvement with dynamic trace level set

Mandatory

Anomalies dictionary Recommended

Traces size reduction Mandatory

Buffering before sending to the Control Station Mandatory

Review of current traces Recommended

HTML pages improvement Nice to have

PTU dynamic trace (new trace filters, dynamic trace level set, dynamic alarm indication)

Mandatory

PTU debugging facilities improvement (improvement of alarm mechanism, partly memory dump, alarm history, forced DSP KO)

Mandatory


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 60/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.7 Features per release


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 61/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

FEATURE B10 B11 >B11

Defence against leaks of GPU Buffers Yes

Defence against leaks of GPU-Signals Yes

Prevention of buffers congestion Yes

Memory Access Control Yes Yes

PMU Code corruption detection (checksum) Yes (1) Yes

PTU Code corruption detection (Checksum) Yes

Stack overflow Yes Yes

Defence against Timers corruption Yes Yes

TRX-Level Defence and GPU defence manager Yes

GPU-confined Cell-Reset Yes Yes

BSS-level Cell-Reset Yes

Defence against frozen MS-Contexts Yes

Defence against forgotten Radio-Resources Yes

Detection of blocked automatons Yes


Yes

Gpu frozen with detection from the Control Station

Yes Yes


Yes Yes


Yes

PTU proxy and PMU IPGCH IP defence Yes

Recovery of RLC fatal case related to one TBF Yes

DSP easy fatal error recovery Yes

DSP other critical fatal error recovery Yes

Fatal alarm on container recovery Yes Yes

Reaction of PMU command after DSP KO Yes

PMU Debugging facilities improvements (Debug information on important events)Debugging facilities improvements

Yes


Yes


Yes

Anomalies dictionary Yes Yes

Traces size reduction Yes

Buffering before sending to the Control Station Yes

Review of current traces Yes Yes

HTML pages improvement Yes Yes


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 62/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent


Yes


Yes

(1) PTU at least


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 63/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.8 MFS HW coverage

FEATURE GPU2 GPU3 MX_GP IP

Defence against leaks of GPU Buffers Yes Yes Yes Yes

Defence against leaks of GPU-Signals Yes Yes Yes Yes

Prevention of buffers congestion Yes Yes Yes Yes

Memory Access Control No (1) Yes Yes Yes

PMU Code corruption detection (checksum) Yes Yes Yes Yes

PTU Code corruption detection (Checksum) Yes Yes Yes Yes

Stack overflow Yes Yes Yes Yes

Defence against Timers corruption Yes Yes Yes Yes

TRX-Level Defence and GPU defence manager

Yes Yes Yes Yes

GPU-confined Cell-Reset Yes Yes Yes Yes

BSS-level Cell-Reset Yes Yes Yes Yes

Defence against frozen MS-Contexts Yes Yes Yes Yes

Defence against forgotten Radio-Resources Yes Yes Yes Yes

Detection of blocked automatons Yes Yes Yes Yes


No (1) Yes Yes Yes(3)

Gpu frozen with detection from the Control Station

Yes Yes Yes Yes


Yes Yes Yes Yes


Yes Yes Yes Yes

PTU proxy and PMU IPGCH IP defence Yes Yes Yes Yes

Recovery of RLC fatal case related to one Yes Yes Yes Yes


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 64/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

TBF

DSP easy fatal error recovery Yes Yes Yes Yes

DSP other critical fatal error recovery Yes Yes Yes Yes

Fatal alarm on container recovery Yes Yes Yes Yes

Reaction of PMU command after DSP KO Yes Yes Yes No

PMU Debugging facilities improvements (Debug information on important events)Debugging facilities improvements

No (1) Yes Yes Yes(2)


Yes Yes Yes Yes


Yes Yes Yes Yes

Anomalies dictionary Yes Yes Yes Yes

Traces size reduction Yes Yes Yes Yes

Buffering before sending to the Control Station

Yes Yes Yes Yes

Review of current traces Yes Yes Yes Yes

HTML pages improvement Yes Yes Yes Yes


Yes Yes Yes Yes


Yes Yes Yes Yes(4)

(1) Yes with Telecom GPU capacity reduction

(2) Except the improvements that make no sense in IP mode.

(3) No for G3 TRE if there is memory issue in G3 TRE

(4) No for “forced DSP KO”


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 65/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

3.1.9 Interfaces

No impact on Bss interfaces

3.1.9.1 Radio interface (05.02, 04.06, 04.60, 04.18, 24.008, etc)

3.1.9.2 Abis interface (08.58)

3.1.9.3 A interface (08.08)

3.1.9.4 Gb interface (08.18)

3.1.9.5 BSCGP interface

3.1.9.6 GCH interface

3.1.10 Simulations

None

3.2 Operation and maintenance

3.2.1 OMC-R parameters

To Be CompletedNone

Parameter name

Definition Sub-system

Instan-ce

Category1 / OMC-R access2

Type3 Range / default value

Provide here the extra O&M information: SC/PRC, import/export, multiple cell selection, filter, OMC screen... (to be filled in by O&M people). Indicate also if the default value of a parameter depends on the BCCH range, the cell type, the number of TRXs, etc. Indicate specific information, such as the need for a variable step size, etc.

3.2.2 Modelisation of OMC-R parameters

To Be CompletedNone

3.2.3 Other parameters

To Be Completed

1 The valid options are: Site (CAE), Network (CDE), System (CST), Not Used (NU) 2 The valid options are: changeable, set by create, displayed, OMC local display, Virtual – changeable, Virtual – displayed. 3 The valid options are: abstract, flag, list of numbers, number, reference, threshold, timer.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 66/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Parameter name

Definition Sub-system

Instan-ce

Category / OMC-R access4

Type5

Range / default value

EN_DSPRESET_CRITICALERROR

Enables/Disables the DSP reset in case of “Critical Error”

MFS MFS None (DLS)

Flag 0..1 0: No DSP reset is triggered in case of Critical Error. 1: DSP reset is triggered in case of Critical Error. Default :0

Note: It is significant only in TDM mode

MAX_NB_OF_CRITICALALARM

The maximum value of the number of consecutive critical DSP alarm

MFS MFS None (DLS)

Number

0..50

0: deactivate the forced DSP KO

Default:10

(Note: It is significant only when enDspResetCriticalError = false)

EN_Dynamic_TraceLevel

Enables/Disables the dynamic modification of the trace level

MFS MFS None

(DLS) Flag

0..3

- 0: "off" DSP can not ask

PMU to change the trace

level

- 1: DSP can ask PMU to

change the trace level of the

common PMU-PTU interface

messages (exclude dsp-

alarm-ind)

- 2: DSP can ask PMU to

change the filters of dsp-

alarm-ind

- 3: all filters can be

changed by DSP

default value: 2

Initial_PTU_Tra

ceFilter The initial PTU

trace level of

common PMU-PTU

interface messages

MFS MFS None

(DLS) Flag

0x0000..0xffff

Bit[0]: corresponds to DSP

class.

-0:”off”, the trace of this

class is not to be collected.

-1: “on”, the trace of this

class is to be collected.

4 The valid options are: None (in DLS), None (not in DLS) 5 The valid options are: abstract, flag, list of numbers, number, reference, threshold, timer.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 67/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

Bit[1]: corresponds to TRX

class

Bit[2]: corresponds to GCH

class

Bit[3]: corresponds to TBF

class

Bit[4]: corresponds to DATA

class

Bit[5]: corresponds to PM

counter class

Bit[6]: corresponds to PDCH

class

Bit[7]: corresponds to trace

class

Bit[8]: corresponds to RRM

Err class

Bit[9]: corresponds to

Major_Err_Trace class

Bit[10..15]: reserved

Default: 0X0301

Initial_PTU_AlarmFilter

The initial PTU trace level of the dsp-alarm-ind

MFS MFS None

(DLS) Flag

0x0000..0xffff


Minor_GCH_err

-0:”off”, the trace of this

alarm type is not be to

collected

-1: “on”, the trace of this

alarm type is to be collected


Minor_MAC_err


Minor_RLC_err


Minor_Misc_err


Minor_Ext_err

Bit[5..15]: reserved

Default:0X001F

BufferMaxLifeTime

Max duration of a GPU buffer

MFS MFS None

(DLS) Number

10 to 100 minutes

Default 30 minutes

SignalMaxLifeTime

Max duration of a Signal

MFS MFS None

(DLS) Number

10 to 255 minutes

Default 10 minutes

PdchGroup blocking detection delay

Timer to detect the Pdch-Group automaton blocking

MFS MFS None

(DLS) Number

10 to 255 seconds

Default 40 seconds


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 68/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

GPUDefense Guard Timer GPU Defence

Manager uses a dedicated Guard Timer to supervise Defence Actions. If the timer elapses the action is GPU reset

MFS MFS None

(DLS) Number

10 to 255 minutes

Default 10 minutes

PTU TraceMaxlifeTime

If the trace level indication is not repeated by PTU before PTU TraceMaxlifeTime the trace filter is turned on.

MFS MFS None

(DLS) Number

10 to 255 minutes

Default 30 minutes

PMU TraceMaxlifeTime

Period during which the PMU software can dynamically set trace level with a specific API

MFS MFS None

(DLS) Number

10 to 255 minutes

Default 30 minutes

PTU AlarmMaxlifeTime

If the alarm trace level indication is not repeated by PTU before PTU AlarmMaxlifeTime, the alarm trace filter will return to the default value defined in xxx.ini file

MFS MFS None

(DLS) Number

10 to 255 minutes

Default 60 minutes

En_Runtime_PTUCheckSum

Enables/Disables the control of PTU checksum

MFS MFS None

(DLS) Flag

0..1 0: the control of PTU checksum is disabled

1: the control of PTU checksum is enabled

Default :1

Note: It is only significant in TDM mode

3.2.4 PM counters


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 69/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

No new PM counter defined for the moment.

In case of abnormal release some new causes corresponding to the reset TRX have to be foreseen

Counters in the BSC

Counter number

Mnemonic Definition Type Measured object

Counters in the MFS

Counter number

Mnemonic Definition Measured object

3.2.5 PM indicators

To Be Completed

Mnemonic Definition Formula

3.2.6 Migration

Not applicable

3.2.7 Java scripts

No impact

3.2.8 Fault Management

To Be Completed

3.2.9 O&M Specification impacts

None

3.3 Validation To be filled in by VAL people, when technical content of the SFD is stable.

3.3.1 Testing tools

This section gives the list of tools needed to validate the feature. For each tool, it is mentioned :

- if the tool is a new one


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 70/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

- if the tool is modified compared to the previous release

- if the tool is used without modification compared to the previous release

Note that "tools" should include all what is external to the BSS product : so it includes mobiles, SGSN, etc.

3.3.2 Test strategy

3.3.2.1 System tests coverage This section gives a first estimate of the split between the subsystem tests and the system tests. The following points are addressed : - what is and is not to be explicitly tested at system level ? - can the feature be validated completely at sub-system level ? - what is the added value of system tests compared to sub-system tests ?

3.3.2.2 Overall strategy for system tests This section gives the guidelines that will be followed when defining the test plan. It gives, for system tests, the categories of tests that are needed to validate the features. - functional tests : tests performed with MS simulators and CORE network simulators, the purpose is to validate basic scenarios, error cases, etc. Note that the real added value of functional tests versus end to end tests needs to be addressed. - end to end tests: tests performed with real mobile and real CORE network. Purpose is to validate the feature from an end user point of view. - performance tests : this is a specific case of end to end tests with a specific purpose : measurement of the improvement brought by the feature. Note that the objective in terms of performance improvement has to be described in the technical part of the SFD. - Telecom and O&M load tests - migration tests : it should be mentioned if specific migration tests are needed because of this feature - industrialisation tests : it should be addressed in co-operation with SED team (see section on impact on methods)

3.4 Methods

To Be Completed

3.5 GCDs

No impact

3.6 Engineering rules

No impact

4 SUBSYSTEM IMPACTS


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 71/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

All impacts are MFS, either PMU, PTU, or OAM internal subsystems.

4.1 BTS

None

4.2 BSC

As a defence mechanism, the MFS can trigger a TRX reset, which consists in a complete clean up of a TRX done in the MFS when a critical error has been detected. RESET TRX PS INDICATION

The TRX reset will ensure that all PS traffic is stopped on this TRX, transmission and radio resources are released and all corresponding traffic information has been deleted.

A new message RESET TRX PS INDICATION is sent to the BSC.This message indicates the identity of a TRX list The Bsc answers with RESET TRX PS ACK

4.3 Transcoder

None

4.4 MFS

Refer to system impacts description.

4.5 OMC-R

None

4.6 LASER

None

4.7 MPM/NPA/RNO

None

4.8 Polo

None

4.9 OEF

None


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 72/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

5 PERFORMANCE & SYSTEM DIMENSIONING

5.1 Traffic model

No impact on any traffic model, whatever it is.

5.2 Performance Avoid big performance degradation( 3% max can be acceptable) We have to consider and balance the impacts before their introduction. There are 2 kinds of impacts: (1) GPRS processing speed degraded (more defence and check mechanism introduced); (2) GPRS capacity and dimension decreased;

5.3 Load constraints

To Be Completed

6 OPEN POINTS


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 73/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

7 IMPACTS SUMMARY

Equipments:

BTS BSC MFS TC OMC-R LASER MPM/NPA RNO OEF Polo

x X

Interfaces:

Telecom

Radio Abis A Ater BTS-TC MFS-BTS MFS-BSC Gb

X

O&M

Abis-O&M BSC-O&M TC-O&M MFS-O&M Q3

To be filled in by O&M experts.

List of Impacted Step2: List the Step2 documents which are impacted by the present feature.

8 GLOSSARY

8.1 Abbreviations

SIDMO: SCP/PTU IP debug and monitor tool

USB

SCP: Main board of TRE containing the PTU in IP mode I

8.2 SHT: SSD Signal Historical trace inside the PTU Terminology

Give, when necessary an unambiguous definition of terms and concepts used in the present document. Reference to standards is allowed.


0087_03.doc 07/12/2007

3BK 10204 0087 DTZZA 74/74

All

Rig

hts

Res

erve

d ©

Alc

atel

-Luc

ent

END OF DOCUMENT

Date post:	02-May-2017
Category:	Documents
Upload:	sumatrass
View:	226 times
Download:	3 times

0087_03

Documents