CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

transcript

8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

1/23

Copernicus Technology Ltd 2009 1

No Fault Found (NFF) occurrences and Intermittent Faults: improving

Availability of aerospace platforms/systems by refining Maintenance

Practices, Systems of Work and Testing Regimes to effectively identify

their root causes

J D CockramBEng(Hons) CEng MRAeS G M HubyBEng(Hons) CEng MRAeSCopernicus Technology Limited Copernicus Technology Limited

ABSTRACT

The adoption of preventive and corrective maintenance strategies that both provide

aircraft availability andassure safety, at minimum cost, is fundamental to aerospace

operations in all sectors. To provide aircraft availability with even greater success, a

change to traditional maintenance approaches is required: from assumptions-based

approaches and speculative component replacements, to knowledge-based strategies.One key area where knowledge-based approaches remain unexploited is the No Fault

Found (NFF)a scenario, for which intermittency in electrical and electronic component

circuitry is a major cause.

Tackling the NFF issue head on is, perhaps, what many maintenance managers would

like to do, but it is more complex than simply trying to eliminate problems by

speculative component changes and/or manpower resources alone. If it was that simple,

the issue would have long been consigned to the history books, but it has been estimated

that intermittency and NFFs account for a major proportion of fault occurrences in

aerospace maintenance organisations.

The key themes of this paper are as follows:

Treating a NFF occurrence as a Diagnostic Failure, and the impact and causes of

those Diagnostic Failures.

Mitigation of human factors and culture issues in maintenance systems of work, as

they pertain to NFF.

The capture of maintenance fault data and how it can contribute to diagnosing rootcauses of intermittent faults.

The contribution of test methodology to isolating intermittent fault root causes.

How the outcomes offunctional testing can be enhanced by proving the integrity of

components.

How all of these strands are brought together to define strategies to drive down NFF

arisings, thus increasing aerospace platform/system availability.

a Also referred to as Cannot Reproduce Fault (CNR), Cannot Duplicate (CND), Unable toReproduce Fault, Re-Test OK (RTOK), No Trouble Found (NTF), No Fault Indicated (NFI).


2/23


THE NO FAULT FOUND PHENOMENON

Removals of equipment from service for reasons that cannot be verified by the

maintenance process (shop or elsewhere) are a significant burden for aircraft

operators. This phenomenon is commonly referred to as No Fault Found.1

People like to categorise or pigeon-hole problems with simple labels from the CreditCrunch to GM Food irrespective of the complexity of tangible and intangible

interactions within the scenario concerned. NFF, and its corresponding effects on

system-level and aircraft-level availability, is one such scenario. It is easy to label, easy

to define and easy to see the effects of, but the ease with which this definition can be

applied to a given scenario is at odds with the depth and breadth of the problem.

In plain language, a NFF isa reported fault for which the root cause cannot be found.

Note that this definition applies irrespective of whether the associated diagnostic and

maintenance activity succeeded in reproducing the symptom(s) experienced by the

person reporting the fault: whether a symptom is present or not at the time of diagnosticinvestigation is academic if the actual root cause of the fault cannot be isolated. It also

applies equally whether the root cause of the symptom, as experienced by the user,

resulted from a physical fault condition or from user error.

The primary elements of a NFF occurrence are defined below, along with a simple car

example.

There is the fault itself. This is usually reported by an end-user, such as a pilot

(for faults occurring during a phase of flight) or by a maintenance technician (for

faults which manifest themselves during other maintenance activity, whetherrelated or unrelated to the fault concerned). The fault is the inability of a

component or system to fulfil its intended function.

There is the symptom, or symptoms, of the fault. The symptoms are the set of

circumstances that brought the fault to the attention of the end-user; it is the

effecton the operation of the platform or system. Chronologically, although the

symptom is a direct consequence of the fault, it is the symptom which provides

the starting point of the corresponding maintenance/diagnostic activity.

There is the root cause of the fault. This is the primary failure mechanismwhich caused both the specific fault and led to the corresponding manifestation

of symptoms.

Car Example

Fault: the car engine will not start.

Symptom: when you turn the key in the ignition there is the sound of a

click and then nothing else happens.

Root Cause: there is a corroded connector on the starter-relay. This

created a high resistance hence the relay could not operate.


3/23


There is a chain of events from the fault occurrence, to the report of the fault and/or its

symptoms by the end-user, to the point at which the maintenance task is either

completed successfully or categorised as No Fault Found. At that point the

airworthiness decision-maker must then direct what is to happen to next, perhaps basing

their decision on a combination of: personal experience, assumptions and knowledge;

the advice of technician colleagues; information in maintenance manuals; or the

aircrafts fault history, whether individually and/or at fleet level.

If this was the first time that the fault had been reported on this specific aircraft, and

you were the airworthiness decision-maker, what would you do?

The aforementioned NFF chain of events culminates in an inability to identify and fix

the root cause of the reported aircraft fault. In other words, to apply a different label, an

NFF occurrence is a diagnostic failure.

These definitions are pivotal to the concepts expounded in this paper and to how the

problem of reducing NFF occurrences could be addressed. From these definitions, asimplistic conclusion would be to state that the way to stop a NFF occurrence or a

diagnostic failure is to achieve diagnostic success. Hence, one must achieve

diagnostic success in order to identify the root cause of the fault, and thus enable

implementation of the necessary corrective maintenance activity. To achieve this

effectively and efficiently necessitates a closed-loop system that can readily correlate

data pertaining to the symptom, the fault and the successful rectification solution: the

fix.

THE IMPACT OF NFF DIRECT AND INDIRECT

Aerospace statistics for NFF demonstrate that achieving diagnostic success is not

simple, so merely changing NFF terminology to that of diagnostic failure is not going

to solve the problem and improve aircraft availability.

Published statistics reveal a wide range of perception in terms of the extent of the

problem and the impact. Avionics constitute 75% of NFF occurrences in aerospace;

furthermore, avionics NFF rates are typically in the region of 30% or higher2. The

situation does not appear to have improved significantly in recent years when one

examines 1996 figures for Boeing, which showed a 40% rate of incorrect parts removalfrom the airframe3.

The real financial cost of the problem is unclear. In 1997 the US Air Transport

Association estimated the cost of impact as equating to $100 000 per aircraft per year4.

More recently, British Airways has estimated the financial impact at 20M per year5.

Calculating the financial impact of NFF is highly complex, depending on how far the

effects are extrapolated in cost terms. Should the calculation only include the cost of

unnecessary NFF repair investigations at second line workshops? Should it include the

man-hour costs of unnecessary removals from aircraft, or include the additional spare

LRUs purchase costs in response to arising rates, and so on? These indirect costs arediscussed in more detail later in this paper.


4/23


The primary impact of NFF is on people. It prevents them from achieving their

operational business objectives, whether commercial or military, which puts additional

pressure and stress on the people that populate the operation. For example, there can be

few situations more frustrating than that in which military aircrew spend hours planning

and briefing for a complex training sortie as part of a formation, only to abandon it part

way through the sortie because of an intermittent fault which subsequently the

technicians are then unable to diagnose successfully. The frustration of the aircrew is

reflected in equal measure by the frustration felt by the maintenance technicians who

apply their best endeavours to rectify the fault, only for it to result in a diagnostic

failure. So how do they react to these situations? They want something done that is

tangible in order to feel that the problem has been solved, putting a resultant

Serviceable tag back on the maintenance planners whiteboard. Returning to the

earlier scenario of what the airworthiness decision-maker should do next, there are a

number of common options available to them and to their technicians:

Firstly, they could rule out finger trouble, ie confirm that the end-user utilised

the correct procedures to use the equipment.

They could insist that the technicians complete all the relevant functional tests of

the system(s) concerned and, if it was the first occurrence on the aircraft, they

might sign it off as NFF on the basis that it was a one-off.

They could insist that the technicians complete all the relevant functional tests of

the system(s) concerned and then request that a limited flight test be carried out

to see if the fault recurs in the same environment conditions as originally.

They could ask a different team of technicians to review the symptom, fault andinvestigation carried out so far to identify the potential for additional diagnostic

options.

They could have the fleet maintenance history examined for information of that

fault type on other platforms of the same type but only if this data is readily

available.

Similarly, they might seek the advice of colleagues or the Design/Type

Certification Authority on whether they have experience and/or ideas of this

fault and how to rectify it.

They could examine the maintenance manual and then use their judgement to

select the most likely component to replace, in the (calculated) hope that it

rectifies the problem.

They could opt to replace a component in the system that is quick to change and

readily available in stock in the (somewhat less calculated) hope that it

rectifies the problem.

A typical outcome would be that, having ruled out user error, the decision is made toselect what is deemed to be the most likely Line Replaceable Unit (LRU) and to

replace it. Having replaced the LRU, functional checks are carried out to confirm the


5/23


serviceability of the system; tools are returned back to tool stores and the paperwork is

completed. The aircraft is signed off as serviceable once more. It then flies again on

another route or on another training sortie..and the fault does not immediately recur.

Success! Replacing the LRU fixed it! Or did it?

Did the LRU replacement fix the reported fault, by removing the actual source of the

faults root cause (albeit undiagnosed) from the system? Orwas the faults actual rootcause successfully rectified when the LRU was replaced, because the root cause was in

fact the intermittent integrity and security of the wiring connections which happened to

be re-seated and made more secure as a consequence of the replacement activity? Or

was the fault an intermittent fault that has yet to manifest itself again perhaps owing to

a slightly different flight profile experienced on the subsequent flight?

The above scenario has outlined a typical set of circumstances where fault symptoms

experienced by the end-user lead to a diagnostic failure which does not have a black or

white solution. The resulting decision on how to deal with the diagnostic failure would

be made first and foremost with safety and airworthiness in mind, but there will be otherinfluences on the decision concerning resources available, skills/experience available

and commercial or military priorities (slot times, time-on-target and the like). But in this

scenario there is often no clear diagnostic approach to opt for, hence

business/operational/resource/deadline pressures can have a disproportionate influence

on the diagnostic process. Ironically, if the fault recurs shortly afterwards on a

subsequent flight and the fault is reported to the same technician staff as before then, by

default, the options of what to do have been narrowed immensely and the diagnostic

process would (or should) be directed elsewhere. Alternatively, subject to anecdotal

evidence, assumptions or recent experience, the system might be perceived to be

unreliable, or the specific LRU might be perceived as a problem item, in which case

an assumptions-based decision could be made in which the replacement LRU is deemed

unserviceable on fit and might be replaced again. Or the maintenance organisation

may deal with the issue proactively, possibly by means of a quality occurrence

investigation.

What are the implications of these scenarios, which are played out day after day at

airfields all over the world?

Firstly, in this scenario, the maintenance staff cannot provide a high level of confidence

that the fault was fixed right first time and would therefore not recur during a flight inthe immediate future so there is a risk to business, whether that business is package

holidays or precision bombing. Moreover, depending on the system concerned there

may also be a risk to safety, either because it is safety critical or because the potential

for a repeat fault erodes existing levels of system redundancy. In short, there is a

performance or safety risk to the business output or effect required.

If the direct impact on business output is the visible effect of NFF, like the tip of an

iceberg, then below the waterline the main bulk of the iceberg comprises the major

impact on the supply chain, on maintenance performance and capacity and, potentially,

indirect impacts such as effects on customer perception of the airline. If the wrong LRUor component is replaced whether through educated guesswork or simply hoping for

the best then this adds major costs to the organisation. These costs include the time


6/23


incurred by maintenance and logistics staff in removing, processing and transporting the

suspect LRU: because the correct rectification activity has yet to take place to the fix the

faults actual root cause. Then there are the additional costs to bear for the wasted

transportation incurred in sending the suspect LRU to the appropriate Maintenance,

Repair & Overhaul (MRO) organisation or Original Equipment Manufacturer (OEM).

Then there is the wasted time spent on diagnosing and testing an LRU that has nothing

wrong with it. And then further logistics processing activity, storage costs, more

transportation costs and so on.

Supply chain information systems are not typically configured to recognise or correlate

the relationship between second line shop repair NFF activity, and the scaling and

resource consumption monitoring required as part of ongoing forecasting and

procurement activity. The impact of this is illustrated as follows. If fault X generally

leads to initial replacement of LRU Y in 80% of cases, irrespective of the reason why -

even though the actual root cause in 80% of cases is actually to be found with LRU Z -

then the supply chain information system will detect an increased consumption per

flying hour of LRU Y and will forecast ahead to ensure that there is sufficient stock to

meet forecast demand levels. This phenomenon is sometimes referred to as the

Phantom Supply Chain, and it can be exacerbated even further as LRU Y becomes

available in greater numbers; and so the initial speculative replacement activity becomes

easier to justify in the context of the stock levels held. When operators calculate the

cost of NFF do they just look at MRO NFF costs, or do they calculate the real cost by

calculating the full cost of the Phantom Supply Chain? Assuming you wanted to

calculate the full cost impact of NFF in this way, would you possess the data to

successfully undertake such analysis in the first case?

The Phantom Supply Chain also influences the effect on the maintenance policy for an

item. If LRU Y was assigned a maintenance policy as a consequence of Reliability &

Maintainability (R&M) analysis - equating to On Condition or Run To Fail policy

(in other words, once fitted on aircraft there is no need to replace it until it fails) then

what would the effect be on the Mean Time Between Failure of the erroneous

replacements due to fault X? The R&M data would indicate an increase in arisings and

the maintenance policy might have to change. The changes required could range from

the introduction of scheduled inspection activity, to the assignment of a lifing limitation,

to the instigation of a modification. In turn, the increased maintenance activity - a

phantom maintenance policy, to continue the analogy then generates a furtherassociated impact on the supply chain, and so it continues. Yet another side effect that

impacts on the phantom supply chain/maintenance policy is the relationship between

repairs carried out at LRU/card level by OEMs or repair shops with the fault that caused

the item to be sent for repair in the first place. The higher tolerances of test equipment

used at these levels of repair may well uncover faults which have no relationship to the

reported fault. In these circumstances, the conscientious repair body will execute the

necessary repair, but is not guaranteed to isolate the fault which caused intermittency

and thus caused the original, reported fault. So the newly-discovered fault is rectified (a

fault was found, not the fault) and the item returned to the available stock. The cause of

the intermittency lays dormant until the component is back in operational use and it thenmanifests further intermittent fault symptoms at a later date: and so the loop continues.


7/23


The effect on consumption and stock levels resulting from the combination of

consuming LRUs for both genuine fault occurrences and for NFF occurrences, all

combine to exacerbate the impact of component obsolescence. If LRU Y has become

categorised as obsolete, because the platform is a mature platform and that LRUs OEM

no longer supports it, then it is intuitive that the operator can ill afford the cost and non-

availability of that LRU that are caused by avoidable NFF occurrences.

Irrespective of the specific details of each instance of NFF, the impact of NFF is felt at

every level of flight operations, on pilots, customers, technicians and logisticians; and

NFF occurrences result in major process waste, avoidable costs and wasted time.

CAUSES OF DIAGNOSTIC FAILURE

Basic scrutiny of the circumstances of diagnostic failure occurrences reveals that there

are several factors that conspire against effective fault diagnosis and root cause analysis.

These are listed below and discussed in the following paragraphs:

The inability to reproduce the symptom during maintenance/diagnostic activity.

The inability of test equipment to detect the root causes of intermittent,

randomly-occurring faults.

The lack of availability of, or lack of access to, relevant corporate technical

knowledge.

Human factors, including maintenance culture/practice.

THE ABSENCE OF FAULT SYMPTOMS

The symptom that does not manifest itself when attempting to diagnose a reported fault

is an obvious and frustrating characteristic of an NFF occurrence. Assuming operator-

error has been ruled out, logic decrees that there was a root cause of the fault symptom

that was experienced and subsequently reported. The absence of the reported symptom

during diagnosis means that the circumstances of maintenance on the ground have not

resulted in the root cause precipitating the same effect. This is a well-documentedconcept, hence the extent of environmental stress screening testing carried out as part of

LRU shop repair activity or as part of reliability growth testing during component

design and development. The effect is to replicate the operating conditions that were in

place at the time of the fault symptom occurrence conditions that might comprise

altitude, attitude, vibration, temperature and humidity. It may not be practicable to

replicate all of these conditions during diagnostic activity, but the most significant

aspects are sometimes attempted, such as vibration (with engines running) and by

physical manipulation of the airframe, connectors or cable looms whilst carrying out,

for example, continuity testing. If these approaches are unsuccessful, then the

airworthiness decision-maker is then confronted by the various options and scenarioslisted earlier in this paper.


8/23


If part of the problem is trying to replicate physical conditions that influenced the root

cause of the fault, the other part of the problem is the duration of that root cause

occurrence. The short duration deviation from the normal operating conditions of the

system is known as intermittency, a well documented phenomenon concerning electrical

and electronic circuitry. Intermittency6 has been shown to be influenced by mechanical

stress (fretting corrosion, for example) and thus this leads to transient variations or

intermittency in degraded contacts. These intermittent events can last for mere

nanoseconds, but this contact intermittency can be enough to result in system failure or

loss of information. Not only are these intermittent events extremely short duration

they are also, by definition, random. With the probability of detecting a random,

nanosecond-duration root cause event being marginal at best, the temptation of

speculatively replacing an LRU in the hope of removing the faults root cause from the

system becomes a great one. By replacing the LRU, however, the electrical contact

characteristics of the system have been changed but the susceptible components such as

cables and connectors have been left unchanged. For connectors in particular, they

cannot be permanently sealed and so they are susceptible to corrosion and debrisingress, plus they experience wear in use and as a consequence of maintenance.

Contrast those usage and environmental effects on connectors with those same effects

on an LRU: the LRU is far less susceptible to these factors than connectors and cables.

Intermittent micro-changes in a circuits ohmic characteristics and contact resistance

lead to performance deviations from the as designed condition and can occur at any

level within a given system. Moreover, it has been shown from work carried out by

Universal Synaptics Corporation7 over the past 15 years that intermittency is

predominantly found in what this paper will designate as the3 Cs: cables, chassis (of

LRUs) and connectors. This is not intended to ignore the feasible presence ofintermittency at circuit board level within a component or LRU; however, the higher

susceptibility of the 3 Cs to degradation mechanisms, compared with LRUs, means that

the benefit to be gained in tackling the problem versus the effort to be applied is

weighted heavily towards applying more resources towards intermittent faults found

within the 3 Cs.

Over time, left undetected, the physical mechanisms that are affecting the contact

intermittency and precipitating the faults root cause will degrade as a consequence of

ageing, usage, environmental factors and maintenance factors. The intermittent events

will become greater in duration and amplitude, degrading to the point where either theroot cause is diagnosed and detected, and/or the fault has become permanent: a hard

fault. Given the massive variation possible in the degrading factors mentioned, the

evolution of the faults root cause from initial intermittency to hard fault could take

place over a lifecycle ranging from seconds to years. Therefore, the longer and more

gradual this fault degradation life-cycle, the harder it is to detect.

TEST EQUIPMENT CAPABILITY

If intermittency can be found and a resultant fix carried out successfully, this provides a

level ofintegrity to the item under test, ie the circuit under test, or unit under test

(UUT), since it shows no signs of intermittency and can perform as designed without


9/23


any deviations or minute changes in circuit characteristics. Whether specifically

looking for intermittency or confirming the integrity of a UUT, functional testing must

be carried out to ensure that the functionality of the system is as designed. Moreover,

modern functional testing technology covers a great deal of the failures; this is

particularly true in those cases where the Automatic Test Equipment (ATE) testing

regime matures alongside the system or unit under test, due to the application of

historical in-service experience.

However, despite every effort to ensure the detection of known faults using ATE and

traditional testing methods/equipment like continuity tests, TDRs etc, these can only test

the UUT at a single point in time. Additionally, sample rates and digital averaging

techniques used to filter out noise have been developed in digital test equipment to

improve numerical accuracy in measuring circuit attributes, eg resistance. However the

combination of measuring at a single point-in-time, sampling rates and digital averaging

result in any intermittent occurrences being missed completely or masked. Therefore,

successfully finding a randomly occurring fault or micro-change in a UUT requires achange from these methods, to an approach that significantly increases the probability

of detection. Digital accuracy is not the solution to detecting random, intermittent

events: the objective is to detect the event, not to measure it.

CORPORATE TECHNICAL KNOWLEDGE

With the lack of a symptom to influence diagnostic thought processes and decision-

making, a logical next step is to obtain additional, specialist information. There are a

huge range of potential sources of such information and advice, ranging from

maintenance colleagues on other shifts or locations, to contacting the OEM or Design

Authority for advice, to analysis of the Maintenance Manual, to analysis of maintenance

data for the specific aircraft or for the fleet type. The first shortfall in knowledge to be

encountered is with colleagues or maybe even the manufacturers, because their

knowledge is focussed on the characteristics of the system when it is working correctly,

notwhen it is deviating from normal operating conditions. If the type of fault event has

occurred before, there is the potential that a maintenance colleague or technical

specialist will have come across it before and will recall the actual corrective actions

that genuinely remedied the root cause of the fault, without repeat occurrences. If the

operating agency and the design authority have a proactive and learning relationship,there may even be a process in place to capture diagnostic knowledge such as this in

order to integrate it into maintenance documentation and procedures. But the fault

symptom may not have occurred before.

If specialist knowledge cannot be sourced from technical publications or staff, then the

remaining option is historic maintenance data. Analysis of the data may show whether

the same fault has occurred before on the aircraft or on another of the same type, and

what action successfully rectified the problem; or it may show a trend of related failures

on the same aircraft which would give an indication of where to focus fault diagnosis

activity next. The way that the maintenance data is captured, configured and cross-referred will all have a huge bearing on the ease and extent to which it can be

successfully interrogated to inform the fault diagnosis process.


10/23


THE INFLUENCE OF HUMAN FACTORS

Human Factors refers to the study of human capabilities and limitations in the

workplace; it considers the interaction of personnel, the equipment they use, the written

and verbal procedures and rule that they follow, and the environmental conditions of

any system8. Having identified how issues concerning absent symptoms, shortfalls in

technical knowledge and maintenance culture conspire to prevent diagnostic success, itis necessary to fully explore the underlying Human Factors and behaviours which

contribute to instances of diagnostic failure. These contributory factors affect the

smooth passage of the diagnostic process and to the manner in which maintenance data

is captured. These factors are discussed in the following paragraphs with respect to the

m-SHEL modelb,9.

m: Management Control of the System

The system of work within an Aviation operation is highly complex and has a number

of major influences bearing on it at any one time, and not necessarily in acomplementary manner. These influences range from profit, to safety, to long-term

business objectives, to maintenance policy, to resource limitations. Each organisation

may implement and manage their system of work differently, but they will all have very

similar objectives that are fundamental to their success. In short, they all want to

achieve maximum output (ie profit, or military effect) to meet the customers needs with

the minimum consumption of input resources (ie spares, direct/indirect costs).

In most cases these organisation will implement Performance Management systems,

incorporating the collation and trend analysis of Key Performance Indicators (KPIs).

However, there is a growing body of evidence that demonstrates that slavish adherenceto these KPIs can actually undermine achievement of the business objectives, because

individuals within the system of work modify their behaviour to pursue success against

KPIs instead of against what matters: the needs of the customer10. If the aviation

organisation measures success with KPIs for number of flying hours achieved, or the

number of serviceable aircraft available at the start of the flying day, or the percentage

of flights which embarked and took off on time, then the organisations staff will

pursue those targets. Refer this concept back to the dilemma of the airworthiness

decision-maker described earlier, and it is evident that such business pressures can

influence the action taken. Thus this often leads to the short-term palliative approach,

the speculative LRU change for example, rather than the sustainment-driven approachwhich is to identify the root cause of the fault.

Aside from Performance Management, access to the right data is crucial to business and

to enhancing maintenance effectiveness. Depending on the magnitude of the

organisations operations, there may be a high turnover of airframes used, either within

or across operating sites, and across normal operations and maintenance activity.

Developing a knowledge-based system to underpin the sustained availability of these

assets necessitates information systems that allow the effective storage, sharing and

b Edwards SHEL model (Software, Hardware, Environment, Liveware) for Human Factors wasmodified by Kawano by the addition of an m forManagement (Control of the System).


11/23


accessibility of the knowledge and data required to run the business effectively.

However, this is more complex to achieve if the operating organisation, the supply-

chain management organisation and the MRO organisation are separate business

entities.

The organisations maintenance culture will also be a major influence on the

management system, and the effects of this will inexorably filter through to the tacticallevel to affect the manner in which the organisation deals with NFFs. Maintenance

cultures are shaped by an array of factors including the parent organisations culture, the

organisational aims, training, nationality, the experiences and knowledge of its staff and

the intensity of the imperative for the business to perform. In the context of NFFs, that

culture influences maintenance practice which, in turn, influences how the maintenance

organisation responds to NFF occurrence at the working level, ie by shop floor or flight

line maintenance staff. A full psychological profile of aircraft maintenance staff is

beyond the scope of this paper, suffice to say that they like to get things done and they

like to successfully meet targets. For many crisis-management and fire-fighting are

more fulfilling approaches to their jobs, and more fun, than studiously analysing data

and forward planning11. This can do attitude has its place, but if channelled

inappropriately, however well intentioned, it can lead to the scenario whereby

something tangible has to be seen to be done.

An airline operator or a squadron pilot planning for an operational mission may not

perceive extended fault diagnosis activity or analysis of maintenance data to find root

causes as being a pro-active approach to returning an aircraft to use. It could even be

interpreted as just the engineers fiddling with the aircraft. Compare that scenario with

technicians running round changing lots of LRUs, and the illusion of activity is then

easily associated with a concerted effort to solve the aircrafts problems. This approach

becomes established as a norm, and the more established it becomes the harder it is for

individuals within the system of work to assert themselves to break the pattern.

In the absence of a clear symptom or symptoms, and with specialist technical advice

that may well be assumptions-based, rather than knowledge-based, the airworthiness

decisionmaker is back to their list of possible courses of action. If time is pressing and

there are spare LRUs readily available, this is a course of action that is often selected;

especially so if the airworthiness consequences of a recurrence were felt to be relatively

minor. The major emphasis on LRUs can be seen in many arenas within aviation and

they are like the NFF iceberg referred to earlier. The electrical contact components ie

the wiring, connectors and circuit breakers are the glue that adhere together systems of

LRUs and other components, but they are more time consuming to test and maintain

using conventional methods and are less accessible than LRUs. These glue

components are classed as a system in their own right: the Electrical Wiring and

Interconnection System (EWIS)12. LRUs on the other hand are easier to see, to replace,

to supply chain manage and to apply BITEc to than alternative and more mundane

EWIS components, and so they become the focus of attention. However, in

considering connectors and the like and comparing their vulnerability to LRUs, it is

c Built In Test Equipment. To carry out BITE on an item or system refers to the act of runningits built-in test functions.


12/23


logical that EWIS components are the very aspect to focus on: but they are part of the

NFF icebergs main bulk, submerged well out of sight below the waterline.

S: Software

In this Human Factors context, software refers to maintenance procedures, technical

documentation, checklist layout etc. Having identified the complexity inherent insharing and integrating data end-to-end across the specific aviation enterprise, or system

of work, the next challenge concerns the corporate knowledge described previously. In

particular, do maintenance manuals include troubleshooting information that is

knowledge-based and that increases the likelihood that the maintenance technicians can

diagnose and fix the fault right first time, every time? If they do not include that kind

of information, why is this? Is this because the platform is brand new and that kind of

knowledge is still developing? Or is it because it has never been requested by the

customer? Or is it because the cost of including such data is prohibitive to the

operators? Or is it because it is just too complex a task to collate all the necessary data?

Collation of such data would be problematic depending on the quality and completeness

of data captured for maintenance activity. Individual technicians will perceive and

interpret symptoms and faults differently, or not at all, and they will record this

information in a mind-boggling array of variety. In the case of a hydraulic pipe failure,

one of the symptoms is likely to be the presence of hydraulic fluid leaking in a specific

area on the aircraft. The symptom could be recorded either as split pipe, burst pipe,

hydraulic leak, seeping fluid, damaged, unserviceable: the possible permutations are

numerous. But where does this variability come from? Human beings in the same

scenario would take in the same raw data as each other via their senses and

corresponding sensors (eyes, ears, nose etc); thereafter, the raw data is filtered by theapplication of experience, knowledge, values, culture before the end result is internally

re-presented to the individual13. The potential variability of the internal re-presentation

process is infinite given the variability of how every human being would filter and

modify that raw information. The effect is then exacerbated when more than one person

is involved in the scenario, especially if they are involved at different points in the fault

diagnosis chain of events.

The vital point to note is not how human variability influences the maintenance system

of work, but simply to note that it does; therefore, a method of mitigating its effects is

required to improve the establishment of the corporate knowledge needed to addressNFF problems.

H: Hardware

In this context hardware refers to tools, test equipment, physical structure of the aircraft

etc. Modern culture is such that any new product has to be bigger (or smaller!), better,

lighter and faster than its predecessor. These are the attributes we associate with

progress, but often at the expense of overlooking what a product is for and its fitness-

for-purpose at providing that function in a sustained and reliable way. The adages of if

it isnt broken dont fix it and keep it simple, stupid may be associated with moremature individuals in organisations, but the world of aviation has long known that it is

more important to focus on what a product is required to do, and for it to do that


13/23


effectively, repeatedly and at minimum cost. With the growth of IT solutions, test

equipment methodology is often perceived as needing to keep pace necessarily so if

obsolescence is a high risk but is it always necessary? For detecting intermittency it is

not necessary, as digital accuracy has already been discussed as being an impediment

rather than a benefit.

E: Environment

This encompasses the physical environment, such as conditions on aircraft operating

surfaces, to the work environment, such as working patterns. Identifying the root cause

of a fault can be hampered by working systems such as trade structure and shift

systems. Trade structure can become an issue if there are feasible solutions involving

more than one system and thus more than one trade: a fault investigation on a flying

controls system, for example, involving avionics and mechanical trades. The solution

selected may owe more to the strength of character of trade supervisors rather than to

analysis of objective facts and data! Similarly, colleagues on subsequent shifts or

reallocated from other tasks may question the diagnostic process undertaken so far andchoose to take the activity in a different direction rightly or wrongly. All of these

factors impede the smooth chain of events from symptom to first-time-fix, and

complicate unnecessarily the audit trail of corresponding maintenance data.

L: Liveware

This term refers to people. It includes the individual at the centre of an activity and the

other people associated with the activity, in whatever guise that may be. The major

influence of people has already been described in terms of how the variability of their

behaviour affects maintenance data capture and the implementation of the diagnostic

chain of events. A further human factor which impacts on the effectiveness of corporate

technical knowledge is: knowledge retention and recall. If there is a human factors-

caused incident, such as an aircraft towing incident involving ground equipment and an

airframe, the details and the conclusions of the resulting investigation are publicised

widely across the organisation. Several months later it happens again; how can that

possibly happen again the managers ask themselves?

How should the organisation genuinely learn from what had happened before?

The key to this scenario is whether the root cause of the original incident wasestablished, and was a mitigating countermeasure embedded fully into working

practices (checklists, manuals, training syllabi etc). If the countermeasure is

insufficiently embedded (for example, maintenance staff are merely briefed about the

incident and asked to take more care in future), then the incident soon fades from recent

memory, the effects are not as visual as they once were, and the corporate knowledge

may be diluted further by the influx of any new staff. The same is equally true for the

how the maintenance culture deals with NFFs. If the predominant approach is shotgun

maintenance, ie speculative LRU replacement, then there is limited opportunity to

retain and transfer knowledge between individuals of what rectification actions are

genuinely successful in eradicating the root cause of specific faults. Again, the

organisation has not genuinely learned from what has happened before.


14/23


A PROPOSED ROUTE TO DIAGNOSTIC SUCCESS

Insanity: doing the same thing over and over again

and expecting different resultsd.

Ifdiagnostic failure is caused by or exacerbated by the current test methodology used to

attempt detection of intermittent faults, by the non-availability of relevant technical dataand by the effects of certain maintenance behaviours and practices, then it follows that

different things - rather than the same thing over and over again - must be done to

assure diagnostic success. There are 3 crucial elements that must all be dealt with using

new approaches to increase significantly the prospect of diagnostic success in order to

drive down NFF arising rates and, ultimately, increase aircraft availability for the

customer. These are:

1. The maintenance data approach.

2.

The maintenance management approach.

3. The Intermittent Fault Detection approach.

1. THE MAINTENANCE DATA APPROACH

Variability in its many forms has been a significant theme in the human factors

discussions in this paper and how it can lead to inefficiency and waste in aircraft

operations. Given what problems exist in terms of maintenance and operational practice

and culture, what needs to be done differently to vastly improve aviation solutions to

NFF? Data capture and analysis comprise the first step in a journey to work smarter,

not harder in order to significantly reduce NFF occurrences.

Data capture has become much easier with the introduction of a multitude of BITE for

on-board systems, data-logs of various data-buses and other health monitoring systems.

Couple this with the vast amount of data captured for each flight in terms of the flight

plan, debrief, work order cards etc, and it is evident that there is a substantial amount of

data to analyse. Consider this alongside the burgeoning expanding capability to

manipulate and trend information using todays computer and software technology, and

the situation does beg the question: so why is there a problem?

In aviation organisationsthere are an array of disciplines, personalities, departments and

cultures. This creates information gapse in the flow and recording of relevant and

crucial data. As a result, the effective analysis of any captured data is adversely

affected by the number and extent of the information gaps that exist in a given

organisations system.

d Attributed to Albert Einstein, physicist, 1879-1955.e AnInformation Gap can be defined as a break in common communication between persons,

departments etc despite that they may speak the same language. For example a Pilot will notnecessarily think or speak or use the same terminology as the technician, and as a result anInformation Gap is formed.


15/23


An example that illustrates simply this phenomenon is the person who is the owner,

driver and maintainer of a car. If a person performs all related operational and

maintenance tasks, including being the budget holder, they have complete ownership,

accountability and responsibility for all operational aspects of the car; hence no

information gaps are induced. Therefore, for any symptom that arises, this is logged and

translated into a possiblefaultand an associatedfix and its cost remembered; if the fault

is intermittent and the fix was not effective, then a cost-benefit analysis would take

place to ensure that any subsequent course of action which may be taken balances

operational commitments and regulatory requirements. Therefore, if the same symptom

occurs again and again, this is quickly realised, so depending on the last action taken

other possible rectification activities might well take place until the symptom stops

arising. In reality, it is highly likely that different courses of action would be taken by

this lone operator and it is highly unlikely any of the applied fixes or courses of action

would be repeated. Therefore, if the symptom was the car engine keeps cutting out at

idle, the individual is unlikely to keep changing the engine management control unit at

600 per unit; in fact this might be the last item they would change given its high value,unless it is definitively identified as the source of the faults root cause. This is a clearly

defined closed-loop system in which all the data is presented and correlated by the same

person for the same vehicle, and by all the disciplines ie owner, budget holder,

maintainer and operator.

Introducing a second person into the scenarios to borrow the car complicates the

system. As they experience the symptom, they begin to process the raw information

presented to them and on their return they report their findings to the cars owner; this

may even be coupled with emotional anecdotes depending on the purpose behind

borrowing the car and the extent of the problem experienced. Depending on the secondpersons expertise and experience, the presented synopsis of the cars problem could

well range from the car is noisy with no supporting information, to a full blown

diagnostic of the possible fault(s). This is how information gaps begin. With multiple

cars, multiple drivers and multiple repair staff involved and the information flow is

impeded further whilst the number and extent of information gaps grow. In short, there

is no coherent correlation between the data pertaining to symptom - fault - fix.

The foundation of a successful data capture process is correct definition at the outset,

coupled with a robust and standardised format. It is vital that the symptom, as

experienced by the user, is captured using a standard and repeatable methodology.These symptoms can then be codified and entered into a searchable data field within

the operator/maintenance data management system. This eliminates two common

information gaps. Firstly, the fault debrief process with the operator/user is carried out

in their technical language, which prevents information from being missed. Secondly,

the codification of the symptom allows searching and trending on the database field and

therefore successfully eliminates the free-text syndrome. Base-lining of the symptom

data using this standard codification approach not only bridges the information gaps,

more significantly it enables the direct correlation of the symptom through to the fault,

and then to the actual fix.

To illustrate with another car example, consider the following symptom descriptions:

flat tyre, puncture, flat, nail in tyre, tyre flat. They are all the same, but in a free-text


16/23


database these would be seen as 5 different symptoms, none of which states which tyre

has the problem. In addition the text focuses on the fault and omits the symptom that

was experienced by the driver of the car which was that the car was hard to steer. The

standardising of symptoms, which can be constructed using historical data, experience

and knowledge, narrows the possible outcome scenarios; once this has been achieved,

the symptom can be codified into a discrete code that uses predetermined letters to

represent certain states and conditions which are common throughout all systems within

the aircraft. Referring back to the car example, to assist the driver in remembering as

much of the information as practical, it is imperative to create the debrief in the logical

manner as experienced by the driver, and not to debrief the operator as an engineer or

technician might instinctively tend to. Therefore, the resulting debrief could look like

Table 1:

System SymptomSpeed

(mph)Weather

Road

Surface

Road

Debris

First

Visual

Second

Visual

WordPicture

Steering Steering is:

Hard

Loose

Impossible

X Not listed

42 DryWet

Icy

TarLoose

Gravel

Boulder

Mud

NoneRocks

Glass

Screws/nails

Tyre:

OK

Low

Flat

Damage:

OK

Scuffed

Cut

Code ST H 42 D T S F C

Table 1 Example codification of a faults symptom

The resultant code would be something like this: STH42DTSFC. In this scenario the

last 2 aspects of this symptom diagnostic would probably be carried out by the receiving

staff of the hire car company. Similarly, for aircraft scenarios, flight line mechanics

might be required to complete the symptom debrief coding process in some cases to

include where appropriate, details of warning captions, BITE codes etc.

The output of this base-lined, standardised symptom codification activity is that the

process is now repeatable, and can thus be carried out by the operator directly and

without necessarily being dependent on an experienced technician to facilitate a debrief:

and noting that different technicians would each facilitate the debrief in a different way,asking different questions some relevant, some less so - and in a different order. The

captured symptom code is meaningful and concise, and does away with the free-text

problem so that there is now a genuine capability ability to conduct trend-analysis on

the symptoms being experienced. Furthermore this has enabled the capability to trend

analyse parts of the symptom, for example search for all ST*symptoms reported.

This approach of part trending allows the output to be defined according to data needs.

Thus the first part of the code is more useful to strategic management because they

might only need to know the total number of steering arisings that their customers are

having. The entire code would be useful to the cars in-service support/maintenance

organisation because it could be used to influence changes to manuals, to training or to

modify the systems concerned.


17/23


Overall, while the chosen example is a very simple one, the principles are the same for

more complex scenarios. The key to its success is the definition of the symptom as

experienced by the user of the equipment and in the language of the user. We have

termed this concept Symptom Diagnostics.

Applying Symptom Diagnostics to Intermittency and NFF

Now that the first part of data capture for a given fault has been constructed and

standardised, this codified symptom can be linked to the diagnosed fault and the

resulting, successful fix. Given the benefits in the ability to trend symptoms and their

related faults with the actual fix, this approach is therefore an invaluable tool to apply

when tackling intermittent faults and the NFF phenomenon.

To eliminate NFF, the fault has to be solved at its root cause (like any other fault, in

fact), and while a number of speculative solutions may be attempted to eliminate the

fault, the need to quickly identify subsequent symptoms is crucial in order to ascertain if

the applied fix has been effective ie it has been a diagnostic success. This use ofSymptom Diagnostics trending is pivotal to ensuring that repeat fixes are not carried out

and that other potential fixes are considered.

While Symptom Diagnostics trending can enhance NFF and maintenance solutions

activity at the single airframe level, it becomes far more powerful when applied over an

entire fleet. As the Symptom Diagnostics successful fix data builds up for the fleet, it

allows technical staff to see what proportion of different maintenance activities led to a

diagnostic success for a specific Symptom code, for example: for fault X the successful

fixes were 80% for LRU Z and 20% for LRU Y. This data could be used to inform the

technicians and airworthiness decision-makers into using the analysed, historical data todirect resources for maximum and long-term effect. Considering the concepts already

developed in this paper, if LRU Y takes 20 minutes to replace and LRU Z takes 2 hours

to replace, then it is possible for this to influence the diagnostic process; however,

coupled with the aforementioned Symptom Diagnostic data there is now the opportunity

to make a far more informed decision, based on knowledge and not on assumptions and

not on stock levels or on time-to-replace.

The symptom-fault-fix methodology provides the foundation to tackle both hard and,

more importantly, intermittent faults. For an aviation enterprise with established

corporate maintenance knowledge for a mature platform, Symptom Diagnostics may notbe an essential methodology for finding hard faults, but it does enable technicians to

find fault root causes more quickly. The ultimate application of this approach would for

Symptom Diagnostics codes to be compiled and transmitted by the flight crew while the

aircraft is still airborne. This would enable maintenance staff to prepare spares and

resources in advance of the aircraft landing, similar to a Formula One pit crew being in

position with their tools and tyres prior to the car entering the pits. In addition to this

operational advantage, Symptom Diagnostics provides the fundamental base-lining

methodology for tackling intermittency. In considering the issues of speculative

maintenance, set against a backdrop of time pressure and KPIs to meet, then without a

standardised means of accurately recording recurring symptoms the random nature ofthe intermittent fault becomes hidden in a plethora of unnecessary maintenance


18/23


activities and uncertainty, leading to increases in operating costs and inefficiency. This

is back in the realm of the NFF iceberg, deep below the waterline.

By using a Symptom Diagnostics method for trending for a fleet asset it is now possible

to ascertain whether a speculative LRU change has been successful on aircraft, or not.

Therefore, without the need for expensive test set equipment, a Ship or Shelve policy

could be used to provide the breathing space required to prevent an LRU from beingreturned for repair or overhaul. For example, if a symptom is reported and cannot be

reproduced and is believed to result from an intermittent fault, then a considered LRU

replacement could be carried out. The replaced LRU would then be quarantined on a

designated rack within the stores system with an appropriate identification label stating

the aircraft registration, reason for removal, time, date, Symptom Diagnostics code etc.

After a predetermined time or conditions have been met, ie if the symptom did not

reoccur within a specified number of flying hours/flights/usage cycles, then the LRU

would be categorised as unserviceable and inputted into the reverse supply chain for

repair. Alternatively, if the symptom did reoccur within the specified period, then the

LRU could be categorised as serviceable, subject to any prerequisite functional checks,

and returned to use. There are clear airworthiness implications to be considered with

the ship or shelve policy, subject to the safety/performance-criticality of the LRU

concerned, but the policy has been introduced with some success by certain operators 14.

Implementation of the above outlined procedural solutions enables the maintenance

staff to identify faults root causes much earlier in the diagnostic process than has

traditionally been the case in NFF occurrences. In doing so, the resultant data reduces

the instances of erroneous LRU replacements because trend analysis of the data

highlights rogue LRUs or roguefaircraft. While highlighting rogue assets and taking

the appropriate action to isolate or limit their use, the next step in achieving diagnostic

success is detection of the intermittent fault, fixing its root cause and returning the asset

to service with no restrictions and also with, just as importantly, vastly increased

confidence that the intermittency has been eradicated.

2. THE MAINTENANCE MANAGEMENT APPROACH

This paper has examined the maintenance management and maintenance practice issues

that contribute to NFF occurrences, or which prevent the arising rates from improving.

The major hurdles which stand out in terms of their relative effect on diagnostic failureare the disproportionate focus on LRUs, the influence of KPIs and business pressures on

fault diagnosis processes and the quantity of LRUs erroneously sent for repair.

The engineering and degradation factors discussed in this paper, coupled with excessive

NFF rates for LRUs at second line repair workshops, all suggest that the LRU is being

treated as a sticking plaster approach to rectifying NFF occurrences treating the

symptom, not the cause. Degradation mechanisms and operating environments mean

that the 3 Cs are the more significant source of NFF root causes and that this is where

diagnostic resources, including training, should increasingly be directed. However,

f Rogue is defined as a LRU or aircraft which, through data analysis, is proven to be responsiblefor more than the average number of symptom occurrences being recorded.


19/23


before committing to this approach, use of Symptom Diagnostics data would provide

the checks and balances to indicate that the approach was correct in practice and not just

in theory. In parallel with a revised emphasis away from LRU replacements,

implementation of a ship or shelve policy would complement all initiatives to reduce

unnecessary LRU replacements and would successfully minimise the throughput of

LRUs in the reverse supply chain.

KPIs can have a positive motivational effect or can sub-optimise the performance of a

system of work. If they are the right thing to revise for an organisation, then the

Performance Management system should be amended for those maintenance KPIs

which focus on long-term effectiveness, rather than short-term effect. The first-time-

fix rate for, say, the top ten critical faults could be the major KPI of a maintenance

organisations diagnostic capability, particularly if the figures display a continual,

downward trend.

Finally, the senior level of maintenance management should support airworthiness

decision-makers in all efforts to prevent shotgun maintenance and focus on diagnosticsuccess to identify faults root causes. Company policy and maintenance standard

operating procedures (SOPs) should reflect this and provide a documented process to

follow to vastly increase the likelihood of a first-time-fix in conjunction with the

additional measures outlined above.

3. THE INTERMITTENT FAULT DETECTION APPROACH

As previously alluded to with regard to human factorsHardware issues, the inexorable

advances of technology result in an increased focus on and fascination with theaccuracy of digital equipment. Despite this, probability of detecting intermittency is

more relevant to isolating intermittency root causes than digital measuring capability.

Therefore, to increase the probability of detection compared with conventional digital

equipment a technique is required that is continuous in its ability to detect nanosecond

intermittency over a specified test period, and not sampled or averaged, or limited to a

point-in-time testing method.

Extending the logic of this concept one stage further, if multiple lines of continuous

testing can be carried out simultaneously then, by default, the probability of detecting an

intermittent fault during a specified period of time in a UUT is substantially increased.

Combine this intermittency testing with an appropriate level of environmental

stimulation for the UUT and this testing methodology provides maintenance staff with

increased confidence that every part of the UUT has been subjected to representative

conditions while all test points have been continuously and simultaneously tested.

Analogue neural-networks provide the means to successfully achieve intermittent fault

detection. A very low analogue signal allows the UUT to be tested continuously for a

period of time, and without the aforementioned compromises introduced by digital

techniques. The use of detection-optimised analogue versus measurement-optimiseddigital equipment means that nanosecond intermittency can be detected successfully.

Widely-available Digital Multi-Meters are limited to point-in-time testing and their


20/23


sampling rates restrict their potential capability to detection of millisecond

intermittency, thus there is an increase in the probability of detection by analogue

equipment of over 1 x 106. The application of using an analogue neural-network

improves matters dramatically, because it uses a method whereby all the circuits of the

UUT are inter-linked in a synaptic-like networked arrangement. The neural-network

allows multiple test points (ie the circuits of the UUT) to be tested simultaneously and

continuously without missing any intermittency events across all the points under test.

Diagram 1 compares the detection capability of the analogue neural-network with that

of the conventional ATE methodology.

Diagram 1 - ATE vs Analogue Neural-Network

Conducting the test simultaneously using a neural-network provides an increase in

probability of detection proportional to the square of the number of circuits under test.

This substantial increase in the probability of detection, combined with the reduction in

the time taken to complete the test (because the testing is performed for multiple points

simultaneously, rather than testing one line at a time) mean that exploiting analogue

neural-network equipment to detect and eradicate intermittent faults in electrical and

electronic aerospace components, is the most effective test methodology to use.

THE REQUIRED OUTCOMES OF DIAGNOSTIC SUCCESS

Aviation maintenance policy concerns itself with ensuring that aircraft and their systems

are safe and fit-for-purpose throughout their full life cycle. In addition, commercial

demands mean that this is delivered to customers in a sustainable and cost-effective

manner. In short, this means providing the required aircraft availability at minimum

whole-life cost. NFF occurrences and intermittent faults directly affect that simple

equation, hence the required outcome of the route to diagnostic success must be tounderpin and enhance availability levels by enabling the correct diagnosis and

rectification of every fault (including intermittent faults), right first time, every time.


21/23


An availability-focused maintenance strategy is not applied using a one-flight-at-a-time

mentality, ie is it airworthy and serviceable for the next flight. If this was the case then

there would not be such considerable efforts invested in developments such as fatigue

life and component life extension programmes. The concept of aircraft structural

integrity is not new, and the approach to continuing airworthiness has spread to all

systems on aircraft, including the integrity of the EWIS15.

In the context of NFFs therefore, use of analogue neural-network Intermittent Fault

Detection equipment should be used to demonstrate the system integrity of a UUT. The

combination of proving system integrity using this testing capability, as well as

confirming serviceability or system functionality using traditional special-to-type

test equipment will lead to enhanced levels of sustained availability of systems and

platforms.

Functional Test +Integrity Test=Increased Availability

The combination of Intermittent Fault Detection using analogue neural-network testequipment (to enable rectification of the root cause) and functional testing using

existing test equipment can therefore provide the highest level of assurance of system

availability where the system can perform its function without interruptions from

transient faults.

CONCLUSIONS

A No Fault Found occurrence describes the set of circumstances which starts with an

end user experiencing a faults symptoms and ends in a diagnostic failure. It is areported fault for which the root cause cannot be found. NFFs impact on aircraft

availability directly and indirectly, through causing aborted flights and maintenance

rework; and through wasted time/money/resources involved in erroneous, speculative

and avoidable component/LRU replacement, repair and procurement.

Diagnostic failures are caused by:

Maintenance Human Factors - including: the pressure to meet operational

deadlines; an excessive troubleshooting focus on LRUs over EWIS components,

overlooking the 3 Cs (cables/connectors/chassis) in particular; the variability ofdata captured for symptoms, faults and fixes; and the retention, accessibility and

integrity of fault-finding and troubleshooting data.

Nanosecond Intermittency - resulting in intermittent symptoms that may not

manifest themselves during fault investigation activity; and which cannot routinely

be detected by conventional ATE because of their sampling rates, single point-in-

time application and digital averaging techniques all of which sub-optimise their

ability to detect randomly occurring nanosecond intermittency.

To mitigate these obstacles to diagnostic success there are 3 key strategies which mustall be applied:


22/23


Symptom Diagnostics the capture and correlation of standardised symptom-

fault-fix data to improve corporate technical knowledge of successful first-time-fix

methods for real symptoms; this knowledge informs fault diagnosis processes and

diverts attention away from LRU changes and back towards root cause solutions,

especially in the 3 Cs.

Management NFF-related maintenance management policy must underpin these

new strategies by inculcating the improvements in the system of work through

integration within the performance management system, within technician training

and within NFF fault-finding SOPs.

Intermittent Fault Detection the use of analogue neural-network test

technology to overcome the shortcomings of digital equipment in this specific

context, to the extent that the probability of detecting intermittency is increased by

multiples of 106.

These strategies provide the foundation for an availability-focused maintenancestrategy. Furthermore, thefunctional testing of electrical and electronic equipment by

ATE can be enhanced by proving the integrity of circuitry, connectors and the EWIS.

The combination of testing for Function and forIntegrity leads toIncreased

Availability, a reduced supply chain and vastly increased confidence in the equipment

being fitted to aircraft.

The aerospace maintenance organisation that could harness the combined

approach of this Intermittent Fault Detection methodology in concert with the

Symptom-Fault-Fix approach would be genuinely world class in its ability to fix

faults right first time, every time and, therefore, in its sustained ability to deliverincreased aircraft availability.


23/23

REFERENCES

1 ARINC Report 672, (2008), Guidelines for the Reduction of No Fault Found (NFF), AvionicsMaintenance Conference, Aeronautical Radio Inc.

2 http://www.aviationweek.com/aw/generic/story_generic.jsp?channel=om&id=news/om207cvr.xml accessed at 1930 on 24 Aug 09.

3 Knotts R, (1996), Analysis of the Impact of Reliability, Maintainability and Supportability onthe Business and Economics of Civil Air Transport Aircraft Maintenance and Support, M.Phil.thesis, University of Exeter, UK.

4 Reference 2.5 Blischke W R & Murthy D N P, (2003), Case Studies in Reliability & Maintenance, John

Wiley & Sons, New York.6 Dunwoody S, Bock E, Sofia J, (1996), A Practical and Reliable Method for Detection of

Nanosecond Intermittency, AMP Journal of Technology Vol 5.7 Kelly G, Sajecki A, Soresnson B A, Sorenson P W, (2001), An Analyzer for Detecting Aging

Faults in Electronic Devices, Updated from 1994.8 CAP 715, (2002) An Introduction to Aircraft Maintenance Engineering Human Factors for

JAR66, UK Civil Aviation Authority.9 Kawano R, (1997) Steps Towards the Realization of Human-Centred Systems, IEEE 6th

Conference on Human Factors and Power Plants, Conference Proceedings, Orlando, pp 13/27-13/32.

10 Seddon J, (2003), Freedom from Command and Control: a better way to make the work work,Vanguard Press.

11 Repenning N P, Sterman J D, (2001), No-One Ever Gets Credit For Fixing Problems BeforeThey Happened: Creating And Sustaining Process Improvement, California ManagementReview Vol 43 No 4.

12 European Aviation Safety Agency, (2008), EASA Certification Specification for Larger

Aeroplanes CS-25 Subpart H, Amendment 5.13 Reference 8.14 Reference 1.15 Reference 12.

Jim Cockram is a former RAF engineering officer, with a 25-year Service career that focussed heavily

on the maintenance and logistics support of fast-jet fleets and guided weapons systems, from the vantage

point of roles in Forward, Depth and Integrated Project Team environments. During this time he

participated in many exercise and operational deployments, and was an early advocate of applyingLean

Thinking to Defence organisations. His experiences in programme management and in running large,

aircraft maintenance organisations led him to develop maintenance and data-exploitation strategies which

he has subsequently employed successfully in business improvement projects in the private and public

sectors. Jim is the Technical Director of Copernicus Technology Ltd and is an enthusiastic member of

the Royal Aeronautical Society Highland Branch committee.

Giles Huby was also an RAF engineering officer, whose 16-year Service career encompassed land-based

and carrier-based fast-jet operations, plus Forward, Depth and Integrated Project Team roles in support of

fast-jet fleets and guided weapons. Like Jim, Giles also possesses considerable experience of Defence

programme management and running large, aircraft maintenance organisations. He was heavily involved

in Lean process improvement activity in Defence; plus he amassed significant experience of Human

Factors incident investigation and the associated development of enduring countermeasures to prevent

further recurrences. Giles is the Managing Director of Copernicus Technology Ltd and the Chairman of

the Royal Aeronautical Society Highland Branch committee.

CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

Documents