Post on 30-May-2018
transcript
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
1/23
Copernicus Technology Ltd 2009 1
No Fault Found (NFF) occurrences and Intermittent Faults: improving
Availability of aerospace platforms/systems by refining Maintenance
Practices, Systems of Work and Testing Regimes to effectively identify
their root causes
J D CockramBEng(Hons) CEng MRAeS G M HubyBEng(Hons) CEng MRAeSCopernicus Technology Limited Copernicus Technology Limited
ABSTRACT
The adoption of preventive and corrective maintenance strategies that both provide
aircraft availability andassure safety, at minimum cost, is fundamental to aerospace
operations in all sectors. To provide aircraft availability with even greater success, a
change to traditional maintenance approaches is required: from assumptions-based
approaches and speculative component replacements, to knowledge-based strategies.One key area where knowledge-based approaches remain unexploited is the No Fault
Found (NFF)a scenario, for which intermittency in electrical and electronic component
circuitry is a major cause.
Tackling the NFF issue head on is, perhaps, what many maintenance managers would
like to do, but it is more complex than simply trying to eliminate problems by
speculative component changes and/or manpower resources alone. If it was that simple,
the issue would have long been consigned to the history books, but it has been estimated
that intermittency and NFFs account for a major proportion of fault occurrences in
aerospace maintenance organisations.
The key themes of this paper are as follows:
Treating a NFF occurrence as a Diagnostic Failure, and the impact and causes of
those Diagnostic Failures.
Mitigation of human factors and culture issues in maintenance systems of work, as
they pertain to NFF.
The capture of maintenance fault data and how it can contribute to diagnosing rootcauses of intermittent faults.
The contribution of test methodology to isolating intermittent fault root causes.
How the outcomes offunctional testing can be enhanced by proving the integrity of
components.
How all of these strands are brought together to define strategies to drive down NFF
arisings, thus increasing aerospace platform/system availability.
a Also referred to as Cannot Reproduce Fault (CNR), Cannot Duplicate (CND), Unable toReproduce Fault, Re-Test OK (RTOK), No Trouble Found (NTF), No Fault Indicated (NFI).
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
2/23
Copernicus Technology Ltd 2009 2
THE NO FAULT FOUND PHENOMENON
Removals of equipment from service for reasons that cannot be verified by the
maintenance process (shop or elsewhere) are a significant burden for aircraft
operators. This phenomenon is commonly referred to as No Fault Found.1
People like to categorise or pigeon-hole problems with simple labels from the CreditCrunch to GM Food irrespective of the complexity of tangible and intangible
interactions within the scenario concerned. NFF, and its corresponding effects on
system-level and aircraft-level availability, is one such scenario. It is easy to label, easy
to define and easy to see the effects of, but the ease with which this definition can be
applied to a given scenario is at odds with the depth and breadth of the problem.
In plain language, a NFF isa reported fault for which the root cause cannot be found.
Note that this definition applies irrespective of whether the associated diagnostic and
maintenance activity succeeded in reproducing the symptom(s) experienced by the
person reporting the fault: whether a symptom is present or not at the time of diagnosticinvestigation is academic if the actual root cause of the fault cannot be isolated. It also
applies equally whether the root cause of the symptom, as experienced by the user,
resulted from a physical fault condition or from user error.
The primary elements of a NFF occurrence are defined below, along with a simple car
example.
There is the fault itself. This is usually reported by an end-user, such as a pilot
(for faults occurring during a phase of flight) or by a maintenance technician (for
faults which manifest themselves during other maintenance activity, whetherrelated or unrelated to the fault concerned). The fault is the inability of a
component or system to fulfil its intended function.
There is the symptom, or symptoms, of the fault. The symptoms are the set of
circumstances that brought the fault to the attention of the end-user; it is the
effecton the operation of the platform or system. Chronologically, although the
symptom is a direct consequence of the fault, it is the symptom which provides
the starting point of the corresponding maintenance/diagnostic activity.
There is the root cause of the fault. This is the primary failure mechanismwhich caused both the specific fault and led to the corresponding manifestation
of symptoms.
Car Example
Fault: the car engine will not start.
Symptom: when you turn the key in the ignition there is the sound of a
click and then nothing else happens.
Root Cause: there is a corroded connector on the starter-relay. This
created a high resistance hence the relay could not operate.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
3/23
Copernicus Technology Ltd 2009 3
There is a chain of events from the fault occurrence, to the report of the fault and/or its
symptoms by the end-user, to the point at which the maintenance task is either
completed successfully or categorised as No Fault Found. At that point the
airworthiness decision-maker must then direct what is to happen to next, perhaps basing
their decision on a combination of: personal experience, assumptions and knowledge;
the advice of technician colleagues; information in maintenance manuals; or the
aircrafts fault history, whether individually and/or at fleet level.
If this was the first time that the fault had been reported on this specific aircraft, and
you were the airworthiness decision-maker, what would you do?
The aforementioned NFF chain of events culminates in an inability to identify and fix
the root cause of the reported aircraft fault. In other words, to apply a different label, an
NFF occurrence is a diagnostic failure.
These definitions are pivotal to the concepts expounded in this paper and to how the
problem of reducing NFF occurrences could be addressed. From these definitions, asimplistic conclusion would be to state that the way to stop a NFF occurrence or a
diagnostic failure is to achieve diagnostic success. Hence, one must achieve
diagnostic success in order to identify the root cause of the fault, and thus enable
implementation of the necessary corrective maintenance activity. To achieve this
effectively and efficiently necessitates a closed-loop system that can readily correlate
data pertaining to the symptom, the fault and the successful rectification solution: the
fix.
THE IMPACT OF NFF DIRECT AND INDIRECT
Aerospace statistics for NFF demonstrate that achieving diagnostic success is not
simple, so merely changing NFF terminology to that of diagnostic failure is not going
to solve the problem and improve aircraft availability.
Published statistics reveal a wide range of perception in terms of the extent of the
problem and the impact. Avionics constitute 75% of NFF occurrences in aerospace;
furthermore, avionics NFF rates are typically in the region of 30% or higher2. The
situation does not appear to have improved significantly in recent years when one
examines 1996 figures for Boeing, which showed a 40% rate of incorrect parts removalfrom the airframe3.
The real financial cost of the problem is unclear. In 1997 the US Air Transport
Association estimated the cost of impact as equating to $100 000 per aircraft per year4.
More recently, British Airways has estimated the financial impact at 20M per year5.
Calculating the financial impact of NFF is highly complex, depending on how far the
effects are extrapolated in cost terms. Should the calculation only include the cost of
unnecessary NFF repair investigations at second line workshops? Should it include the
man-hour costs of unnecessary removals from aircraft, or include the additional spare
LRUs purchase costs in response to arising rates, and so on? These indirect costs arediscussed in more detail later in this paper.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
4/23
Copernicus Technology Ltd 2009 4
The primary impact of NFF is on people. It prevents them from achieving their
operational business objectives, whether commercial or military, which puts additional
pressure and stress on the people that populate the operation. For example, there can be
few situations more frustrating than that in which military aircrew spend hours planning
and briefing for a complex training sortie as part of a formation, only to abandon it part
way through the sortie because of an intermittent fault which subsequently the
technicians are then unable to diagnose successfully. The frustration of the aircrew is
reflected in equal measure by the frustration felt by the maintenance technicians who
apply their best endeavours to rectify the fault, only for it to result in a diagnostic
failure. So how do they react to these situations? They want something done that is
tangible in order to feel that the problem has been solved, putting a resultant
Serviceable tag back on the maintenance planners whiteboard. Returning to the
earlier scenario of what the airworthiness decision-maker should do next, there are a
number of common options available to them and to their technicians:
Firstly, they could rule out finger trouble, ie confirm that the end-user utilised
the correct procedures to use the equipment.
They could insist that the technicians complete all the relevant functional tests of
the system(s) concerned and, if it was the first occurrence on the aircraft, they
might sign it off as NFF on the basis that it was a one-off.
They could insist that the technicians complete all the relevant functional tests of
the system(s) concerned and then request that a limited flight test be carried out
to see if the fault recurs in the same environment conditions as originally.
They could ask a different team of technicians to review the symptom, fault andinvestigation carried out so far to identify the potential for additional diagnostic
options.
They could have the fleet maintenance history examined for information of that
fault type on other platforms of the same type but only if this data is readily
available.
Similarly, they might seek the advice of colleagues or the Design/Type
Certification Authority on whether they have experience and/or ideas of this
fault and how to rectify it.
They could examine the maintenance manual and then use their judgement to
select the most likely component to replace, in the (calculated) hope that it
rectifies the problem.
They could opt to replace a component in the system that is quick to change and
readily available in stock in the (somewhat less calculated) hope that it
rectifies the problem.
A typical outcome would be that, having ruled out user error, the decision is made toselect what is deemed to be the most likely Line Replaceable Unit (LRU) and to
replace it. Having replaced the LRU, functional checks are carried out to confirm the
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
5/23
Copernicus Technology Ltd 2009 5
serviceability of the system; tools are returned back to tool stores and the paperwork is
completed. The aircraft is signed off as serviceable once more. It then flies again on
another route or on another training sortie..and the fault does not immediately recur.
Success! Replacing the LRU fixed it! Or did it?
Did the LRU replacement fix the reported fault, by removing the actual source of the
faults root cause (albeit undiagnosed) from the system? Orwas the faults actual rootcause successfully rectified when the LRU was replaced, because the root cause was in
fact the intermittent integrity and security of the wiring connections which happened to
be re-seated and made more secure as a consequence of the replacement activity? Or
was the fault an intermittent fault that has yet to manifest itself again perhaps owing to
a slightly different flight profile experienced on the subsequent flight?
The above scenario has outlined a typical set of circumstances where fault symptoms
experienced by the end-user lead to a diagnostic failure which does not have a black or
white solution. The resulting decision on how to deal with the diagnostic failure would
be made first and foremost with safety and airworthiness in mind, but there will be otherinfluences on the decision concerning resources available, skills/experience available
and commercial or military priorities (slot times, time-on-target and the like). But in this
scenario there is often no clear diagnostic approach to opt for, hence
business/operational/resource/deadline pressures can have a disproportionate influence
on the diagnostic process. Ironically, if the fault recurs shortly afterwards on a
subsequent flight and the fault is reported to the same technician staff as before then, by
default, the options of what to do have been narrowed immensely and the diagnostic
process would (or should) be directed elsewhere. Alternatively, subject to anecdotal
evidence, assumptions or recent experience, the system might be perceived to be
unreliable, or the specific LRU might be perceived as a problem item, in which case
an assumptions-based decision could be made in which the replacement LRU is deemed
unserviceable on fit and might be replaced again. Or the maintenance organisation
may deal with the issue proactively, possibly by means of a quality occurrence
investigation.
What are the implications of these scenarios, which are played out day after day at
airfields all over the world?
Firstly, in this scenario, the maintenance staff cannot provide a high level of confidence
that the fault was fixed right first time and would therefore not recur during a flight inthe immediate future so there is a risk to business, whether that business is package
holidays or precision bombing. Moreover, depending on the system concerned there
may also be a risk to safety, either because it is safety critical or because the potential
for a repeat fault erodes existing levels of system redundancy. In short, there is a
performance or safety risk to the business output or effect required.
If the direct impact on business output is the visible effect of NFF, like the tip of an
iceberg, then below the waterline the main bulk of the iceberg comprises the major
impact on the supply chain, on maintenance performance and capacity and, potentially,
indirect impacts such as effects on customer perception of the airline. If the wrong LRUor component is replaced whether through educated guesswork or simply hoping for
the best then this adds major costs to the organisation. These costs include the time
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
6/23
Copernicus Technology Ltd 2009 6
incurred by maintenance and logistics staff in removing, processing and transporting the
suspect LRU: because the correct rectification activity has yet to take place to the fix the
faults actual root cause. Then there are the additional costs to bear for the wasted
transportation incurred in sending the suspect LRU to the appropriate Maintenance,
Repair & Overhaul (MRO) organisation or Original Equipment Manufacturer (OEM).
Then there is the wasted time spent on diagnosing and testing an LRU that has nothing
wrong with it. And then further logistics processing activity, storage costs, more
transportation costs and so on.
Supply chain information systems are not typically configured to recognise or correlate
the relationship between second line shop repair NFF activity, and the scaling and
resource consumption monitoring required as part of ongoing forecasting and
procurement activity. The impact of this is illustrated as follows. If fault X generally
leads to initial replacement of LRU Y in 80% of cases, irrespective of the reason why -
even though the actual root cause in 80% of cases is actually to be found with LRU Z -
then the supply chain information system will detect an increased consumption per
flying hour of LRU Y and will forecast ahead to ensure that there is sufficient stock to
meet forecast demand levels. This phenomenon is sometimes referred to as the
Phantom Supply Chain, and it can be exacerbated even further as LRU Y becomes
available in greater numbers; and so the initial speculative replacement activity becomes
easier to justify in the context of the stock levels held. When operators calculate the
cost of NFF do they just look at MRO NFF costs, or do they calculate the real cost by
calculating the full cost of the Phantom Supply Chain? Assuming you wanted to
calculate the full cost impact of NFF in this way, would you possess the data to
successfully undertake such analysis in the first case?
The Phantom Supply Chain also influences the effect on the maintenance policy for an
item. If LRU Y was assigned a maintenance policy as a consequence of Reliability &
Maintainability (R&M) analysis - equating to On Condition or Run To Fail policy
(in other words, once fitted on aircraft there is no need to replace it until it fails) then
what would the effect be on the Mean Time Between Failure of the erroneous
replacements due to fault X? The R&M data would indicate an increase in arisings and
the maintenance policy might have to change. The changes required could range from
the introduction of scheduled inspection activity, to the assignment of a lifing limitation,
to the instigation of a modification. In turn, the increased maintenance activity - a
phantom maintenance policy, to continue the analogy then generates a furtherassociated impact on the supply chain, and so it continues. Yet another side effect that
impacts on the phantom supply chain/maintenance policy is the relationship between
repairs carried out at LRU/card level by OEMs or repair shops with the fault that caused
the item to be sent for repair in the first place. The higher tolerances of test equipment
used at these levels of repair may well uncover faults which have no relationship to the
reported fault. In these circumstances, the conscientious repair body will execute the
necessary repair, but is not guaranteed to isolate the fault which caused intermittency
and thus caused the original, reported fault. So the newly-discovered fault is rectified (a
fault was found, not the fault) and the item returned to the available stock. The cause of
the intermittency lays dormant until the component is back in operational use and it thenmanifests further intermittent fault symptoms at a later date: and so the loop continues.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
7/23
Copernicus Technology Ltd 2009 7
The effect on consumption and stock levels resulting from the combination of
consuming LRUs for both genuine fault occurrences and for NFF occurrences, all
combine to exacerbate the impact of component obsolescence. If LRU Y has become
categorised as obsolete, because the platform is a mature platform and that LRUs OEM
no longer supports it, then it is intuitive that the operator can ill afford the cost and non-
availability of that LRU that are caused by avoidable NFF occurrences.
Irrespective of the specific details of each instance of NFF, the impact of NFF is felt at
every level of flight operations, on pilots, customers, technicians and logisticians; and
NFF occurrences result in major process waste, avoidable costs and wasted time.
CAUSES OF DIAGNOSTIC FAILURE
Basic scrutiny of the circumstances of diagnostic failure occurrences reveals that there
are several factors that conspire against effective fault diagnosis and root cause analysis.
These are listed below and discussed in the following paragraphs:
The inability to reproduce the symptom during maintenance/diagnostic activity.
The inability of test equipment to detect the root causes of intermittent,
randomly-occurring faults.
The lack of availability of, or lack of access to, relevant corporate technical
knowledge.
Human factors, including maintenance culture/practice.
THE ABSENCE OF FAULT SYMPTOMS
The symptom that does not manifest itself when attempting to diagnose a reported fault
is an obvious and frustrating characteristic of an NFF occurrence. Assuming operator-
error has been ruled out, logic decrees that there was a root cause of the fault symptom
that was experienced and subsequently reported. The absence of the reported symptom
during diagnosis means that the circumstances of maintenance on the ground have not
resulted in the root cause precipitating the same effect. This is a well-documentedconcept, hence the extent of environmental stress screening testing carried out as part of
LRU shop repair activity or as part of reliability growth testing during component
design and development. The effect is to replicate the operating conditions that were in
place at the time of the fault symptom occurrence conditions that might comprise
altitude, attitude, vibration, temperature and humidity. It may not be practicable to
replicate all of these conditions during diagnostic activity, but the most significant
aspects are sometimes attempted, such as vibration (with engines running) and by
physical manipulation of the airframe, connectors or cable looms whilst carrying out,
for example, continuity testing. If these approaches are unsuccessful, then the
airworthiness decision-maker is then confronted by the various options and scenarioslisted earlier in this paper.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
8/23
Copernicus Technology Ltd 2009 8
If part of the problem is trying to replicate physical conditions that influenced the root
cause of the fault, the other part of the problem is the duration of that root cause
occurrence. The short duration deviation from the normal operating conditions of the
system is known as intermittency, a well documented phenomenon concerning electrical
and electronic circuitry. Intermittency6 has been shown to be influenced by mechanical
stress (fretting corrosion, for example) and thus this leads to transient variations or
intermittency in degraded contacts. These intermittent events can last for mere
nanoseconds, but this contact intermittency can be enough to result in system failure or
loss of information. Not only are these intermittent events extremely short duration
they are also, by definition, random. With the probability of detecting a random,
nanosecond-duration root cause event being marginal at best, the temptation of
speculatively replacing an LRU in the hope of removing the faults root cause from the
system becomes a great one. By replacing the LRU, however, the electrical contact
characteristics of the system have been changed but the susceptible components such as
cables and connectors have been left unchanged. For connectors in particular, they
cannot be permanently sealed and so they are susceptible to corrosion and debrisingress, plus they experience wear in use and as a consequence of maintenance.
Contrast those usage and environmental effects on connectors with those same effects
on an LRU: the LRU is far less susceptible to these factors than connectors and cables.
Intermittent micro-changes in a circuits ohmic characteristics and contact resistance
lead to performance deviations from the as designed condition and can occur at any
level within a given system. Moreover, it has been shown from work carried out by
Universal Synaptics Corporation7 over the past 15 years that intermittency is
predominantly found in what this paper will designate as the3 Cs: cables, chassis (of
LRUs) and connectors. This is not intended to ignore the feasible presence ofintermittency at circuit board level within a component or LRU; however, the higher
susceptibility of the 3 Cs to degradation mechanisms, compared with LRUs, means that
the benefit to be gained in tackling the problem versus the effort to be applied is
weighted heavily towards applying more resources towards intermittent faults found
within the 3 Cs.
Over time, left undetected, the physical mechanisms that are affecting the contact
intermittency and precipitating the faults root cause will degrade as a consequence of
ageing, usage, environmental factors and maintenance factors. The intermittent events
will become greater in duration and amplitude, degrading to the point where either theroot cause is diagnosed and detected, and/or the fault has become permanent: a hard
fault. Given the massive variation possible in the degrading factors mentioned, the
evolution of the faults root cause from initial intermittency to hard fault could take
place over a lifecycle ranging from seconds to years. Therefore, the longer and more
gradual this fault degradation life-cycle, the harder it is to detect.
TEST EQUIPMENT CAPABILITY
If intermittency can be found and a resultant fix carried out successfully, this provides a
level ofintegrity to the item under test, ie the circuit under test, or unit under test
(UUT), since it shows no signs of intermittency and can perform as designed without
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
9/23
Copernicus Technology Ltd 2009 9
any deviations or minute changes in circuit characteristics. Whether specifically
looking for intermittency or confirming the integrity of a UUT, functional testing must
be carried out to ensure that the functionality of the system is as designed. Moreover,
modern functional testing technology covers a great deal of the failures; this is
particularly true in those cases where the Automatic Test Equipment (ATE) testing
regime matures alongside the system or unit under test, due to the application of
historical in-service experience.
However, despite every effort to ensure the detection of known faults using ATE and
traditional testing methods/equipment like continuity tests, TDRs etc, these can only test
the UUT at a single point in time. Additionally, sample rates and digital averaging
techniques used to filter out noise have been developed in digital test equipment to
improve numerical accuracy in measuring circuit attributes, eg resistance. However the
combination of measuring at a single point-in-time, sampling rates and digital averaging
result in any intermittent occurrences being missed completely or masked. Therefore,
successfully finding a randomly occurring fault or micro-change in a UUT requires achange from these methods, to an approach that significantly increases the probability
of detection. Digital accuracy is not the solution to detecting random, intermittent
events: the objective is to detect the event, not to measure it.
CORPORATE TECHNICAL KNOWLEDGE
With the lack of a symptom to influence diagnostic thought processes and decision-
making, a logical next step is to obtain additional, specialist information. There are a
huge range of potential sources of such information and advice, ranging from
maintenance colleagues on other shifts or locations, to contacting the OEM or Design
Authority for advice, to analysis of the Maintenance Manual, to analysis of maintenance
data for the specific aircraft or for the fleet type. The first shortfall in knowledge to be
encountered is with colleagues or maybe even the manufacturers, because their
knowledge is focussed on the characteristics of the system when it is working correctly,
notwhen it is deviating from normal operating conditions. If the type of fault event has
occurred before, there is the potential that a maintenance colleague or technical
specialist will have come across it before and will recall the actual corrective actions
that genuinely remedied the root cause of the fault, without repeat occurrences. If the
operating agency and the design authority have a proactive and learning relationship,there may even be a process in place to capture diagnostic knowledge such as this in
order to integrate it into maintenance documentation and procedures. But the fault
symptom may not have occurred before.
If specialist knowledge cannot be sourced from technical publications or staff, then the
remaining option is historic maintenance data. Analysis of the data may show whether
the same fault has occurred before on the aircraft or on another of the same type, and
what action successfully rectified the problem; or it may show a trend of related failures
on the same aircraft which would give an indication of where to focus fault diagnosis
activity next. The way that the maintenance data is captured, configured and cross-referred will all have a huge bearing on the ease and extent to which it can be
successfully interrogated to inform the fault diagnosis process.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
10/23
Copernicus Technology Ltd 2009 10
THE INFLUENCE OF HUMAN FACTORS
Human Factors refers to the study of human capabilities and limitations in the
workplace; it considers the interaction of personnel, the equipment they use, the written
and verbal procedures and rule that they follow, and the environmental conditions of
any system8. Having identified how issues concerning absent symptoms, shortfalls in
technical knowledge and maintenance culture conspire to prevent diagnostic success, itis necessary to fully explore the underlying Human Factors and behaviours which
contribute to instances of diagnostic failure. These contributory factors affect the
smooth passage of the diagnostic process and to the manner in which maintenance data
is captured. These factors are discussed in the following paragraphs with respect to the
m-SHEL modelb,9.
m: Management Control of the System
The system of work within an Aviation operation is highly complex and has a number
of major influences bearing on it at any one time, and not necessarily in acomplementary manner. These influences range from profit, to safety, to long-term
business objectives, to maintenance policy, to resource limitations. Each organisation
may implement and manage their system of work differently, but they will all have very
similar objectives that are fundamental to their success. In short, they all want to
achieve maximum output (ie profit, or military effect) to meet the customers needs with
the minimum consumption of input resources (ie spares, direct/indirect costs).
In most cases these organisation will implement Performance Management systems,
incorporating the collation and trend analysis of Key Performance Indicators (KPIs).
However, there is a growing body of evidence that demonstrates that slavish adherenceto these KPIs can actually undermine achievement of the business objectives, because
individuals within the system of work modify their behaviour to pursue success against
KPIs instead of against what matters: the needs of the customer10. If the aviation
organisation measures success with KPIs for number of flying hours achieved, or the
number of serviceable aircraft available at the start of the flying day, or the percentage
of flights which embarked and took off on time, then the organisations staff will
pursue those targets. Refer this concept back to the dilemma of the airworthiness
decision-maker described earlier, and it is evident that such business pressures can
influence the action taken. Thus this often leads to the short-term palliative approach,
the speculative LRU change for example, rather than the sustainment-driven approachwhich is to identify the root cause of the fault.
Aside from Performance Management, access to the right data is crucial to business and
to enhancing maintenance effectiveness. Depending on the magnitude of the
organisations operations, there may be a high turnover of airframes used, either within
or across operating sites, and across normal operations and maintenance activity.
Developing a knowledge-based system to underpin the sustained availability of these
assets necessitates information systems that allow the effective storage, sharing and
b Edwards SHEL model (Software, Hardware, Environment, Liveware) for Human Factors wasmodified by Kawano by the addition of an m forManagement (Control of the System).
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
11/23
Copernicus Technology Ltd 2009 11
accessibility of the knowledge and data required to run the business effectively.
However, this is more complex to achieve if the operating organisation, the supply-
chain management organisation and the MRO organisation are separate business
entities.
The organisations maintenance culture will also be a major influence on the
management system, and the effects of this will inexorably filter through to the tacticallevel to affect the manner in which the organisation deals with NFFs. Maintenance
cultures are shaped by an array of factors including the parent organisations culture, the
organisational aims, training, nationality, the experiences and knowledge of its staff and
the intensity of the imperative for the business to perform. In the context of NFFs, that
culture influences maintenance practice which, in turn, influences how the maintenance
organisation responds to NFF occurrence at the working level, ie by shop floor or flight
line maintenance staff. A full psychological profile of aircraft maintenance staff is
beyond the scope of this paper, suffice to say that they like to get things done and they
like to successfully meet targets. For many crisis-management and fire-fighting are
more fulfilling approaches to their jobs, and more fun, than studiously analysing data
and forward planning11. This can do attitude has its place, but if channelled
inappropriately, however well intentioned, it can lead to the scenario whereby
something tangible has to be seen to be done.
An airline operator or a squadron pilot planning for an operational mission may not
perceive extended fault diagnosis activity or analysis of maintenance data to find root
causes as being a pro-active approach to returning an aircraft to use. It could even be
interpreted as just the engineers fiddling with the aircraft. Compare that scenario with
technicians running round changing lots of LRUs, and the illusion of activity is then
easily associated with a concerted effort to solve the aircrafts problems. This approach
becomes established as a norm, and the more established it becomes the harder it is for
individuals within the system of work to assert themselves to break the pattern.
In the absence of a clear symptom or symptoms, and with specialist technical advice
that may well be assumptions-based, rather than knowledge-based, the airworthiness
decisionmaker is back to their list of possible courses of action. If time is pressing and
there are spare LRUs readily available, this is a course of action that is often selected;
especially so if the airworthiness consequences of a recurrence were felt to be relatively
minor. The major emphasis on LRUs can be seen in many arenas within aviation and
they are like the NFF iceberg referred to earlier. The electrical contact components ie
the wiring, connectors and circuit breakers are the glue that adhere together systems of
LRUs and other components, but they are more time consuming to test and maintain
using conventional methods and are less accessible than LRUs. These glue
components are classed as a system in their own right: the Electrical Wiring and
Interconnection System (EWIS)12. LRUs on the other hand are easier to see, to replace,
to supply chain manage and to apply BITEc to than alternative and more mundane
EWIS components, and so they become the focus of attention. However, in
considering connectors and the like and comparing their vulnerability to LRUs, it is
c Built In Test Equipment. To carry out BITE on an item or system refers to the act of runningits built-in test functions.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
12/23
Copernicus Technology Ltd 2009 12
logical that EWIS components are the very aspect to focus on: but they are part of the
NFF icebergs main bulk, submerged well out of sight below the waterline.
S: Software
In this Human Factors context, software refers to maintenance procedures, technical
documentation, checklist layout etc. Having identified the complexity inherent insharing and integrating data end-to-end across the specific aviation enterprise, or system
of work, the next challenge concerns the corporate knowledge described previously. In
particular, do maintenance manuals include troubleshooting information that is
knowledge-based and that increases the likelihood that the maintenance technicians can
diagnose and fix the fault right first time, every time? If they do not include that kind
of information, why is this? Is this because the platform is brand new and that kind of
knowledge is still developing? Or is it because it has never been requested by the
customer? Or is it because the cost of including such data is prohibitive to the
operators? Or is it because it is just too complex a task to collate all the necessary data?
Collation of such data would be problematic depending on the quality and completeness
of data captured for maintenance activity. Individual technicians will perceive and
interpret symptoms and faults differently, or not at all, and they will record this
information in a mind-boggling array of variety. In the case of a hydraulic pipe failure,
one of the symptoms is likely to be the presence of hydraulic fluid leaking in a specific
area on the aircraft. The symptom could be recorded either as split pipe, burst pipe,
hydraulic leak, seeping fluid, damaged, unserviceable: the possible permutations are
numerous. But where does this variability come from? Human beings in the same
scenario would take in the same raw data as each other via their senses and
corresponding sensors (eyes, ears, nose etc); thereafter, the raw data is filtered by theapplication of experience, knowledge, values, culture before the end result is internally
re-presented to the individual13. The potential variability of the internal re-presentation
process is infinite given the variability of how every human being would filter and
modify that raw information. The effect is then exacerbated when more than one person
is involved in the scenario, especially if they are involved at different points in the fault
diagnosis chain of events.
The vital point to note is not how human variability influences the maintenance system
of work, but simply to note that it does; therefore, a method of mitigating its effects is
required to improve the establishment of the corporate knowledge needed to addressNFF problems.
H: Hardware
In this context hardware refers to tools, test equipment, physical structure of the aircraft
etc. Modern culture is such that any new product has to be bigger (or smaller!), better,
lighter and faster than its predecessor. These are the attributes we associate with
progress, but often at the expense of overlooking what a product is for and its fitness-
for-purpose at providing that function in a sustained and reliable way. The adages of if
it isnt broken dont fix it and keep it simple, stupid may be associated with moremature individuals in organisations, but the world of aviation has long known that it is
more important to focus on what a product is required to do, and for it to do that
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
13/23
Copernicus Technology Ltd 2009 13
effectively, repeatedly and at minimum cost. With the growth of IT solutions, test
equipment methodology is often perceived as needing to keep pace necessarily so if
obsolescence is a high risk but is it always necessary? For detecting intermittency it is
not necessary, as digital accuracy has already been discussed as being an impediment
rather than a benefit.
E: Environment
This encompasses the physical environment, such as conditions on aircraft operating
surfaces, to the work environment, such as working patterns. Identifying the root cause
of a fault can be hampered by working systems such as trade structure and shift
systems. Trade structure can become an issue if there are feasible solutions involving
more than one system and thus more than one trade: a fault investigation on a flying
controls system, for example, involving avionics and mechanical trades. The solution
selected may owe more to the strength of character of trade supervisors rather than to
analysis of objective facts and data! Similarly, colleagues on subsequent shifts or
reallocated from other tasks may question the diagnostic process undertaken so far andchoose to take the activity in a different direction rightly or wrongly. All of these
factors impede the smooth chain of events from symptom to first-time-fix, and
complicate unnecessarily the audit trail of corresponding maintenance data.
L: Liveware
This term refers to people. It includes the individual at the centre of an activity and the
other people associated with the activity, in whatever guise that may be. The major
influence of people has already been described in terms of how the variability of their
behaviour affects maintenance data capture and the implementation of the diagnostic
chain of events. A further human factor which impacts on the effectiveness of corporate
technical knowledge is: knowledge retention and recall. If there is a human factors-
caused incident, such as an aircraft towing incident involving ground equipment and an
airframe, the details and the conclusions of the resulting investigation are publicised
widely across the organisation. Several months later it happens again; how can that
possibly happen again the managers ask themselves?
How should the organisation genuinely learn from what had happened before?
The key to this scenario is whether the root cause of the original incident wasestablished, and was a mitigating countermeasure embedded fully into working
practices (checklists, manuals, training syllabi etc). If the countermeasure is
insufficiently embedded (for example, maintenance staff are merely briefed about the
incident and asked to take more care in future), then the incident soon fades from recent
memory, the effects are not as visual as they once were, and the corporate knowledge
may be diluted further by the influx of any new staff. The same is equally true for the
how the maintenance culture deals with NFFs. If the predominant approach is shotgun
maintenance, ie speculative LRU replacement, then there is limited opportunity to
retain and transfer knowledge between individuals of what rectification actions are
genuinely successful in eradicating the root cause of specific faults. Again, the
organisation has not genuinely learned from what has happened before.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
14/23
Copernicus Technology Ltd 2009 14
A PROPOSED ROUTE TO DIAGNOSTIC SUCCESS
Insanity: doing the same thing over and over again
and expecting different resultsd.
Ifdiagnostic failure is caused by or exacerbated by the current test methodology used to
attempt detection of intermittent faults, by the non-availability of relevant technical dataand by the effects of certain maintenance behaviours and practices, then it follows that
different things - rather than the same thing over and over again - must be done to
assure diagnostic success. There are 3 crucial elements that must all be dealt with using
new approaches to increase significantly the prospect of diagnostic success in order to
drive down NFF arising rates and, ultimately, increase aircraft availability for the
customer. These are:
1. The maintenance data approach.
2.
The maintenance management approach.
3. The Intermittent Fault Detection approach.
1. THE MAINTENANCE DATA APPROACH
Variability in its many forms has been a significant theme in the human factors
discussions in this paper and how it can lead to inefficiency and waste in aircraft
operations. Given what problems exist in terms of maintenance and operational practice
and culture, what needs to be done differently to vastly improve aviation solutions to
NFF? Data capture and analysis comprise the first step in a journey to work smarter,
not harder in order to significantly reduce NFF occurrences.
Data capture has become much easier with the introduction of a multitude of BITE for
on-board systems, data-logs of various data-buses and other health monitoring systems.
Couple this with the vast amount of data captured for each flight in terms of the flight
plan, debrief, work order cards etc, and it is evident that there is a substantial amount of
data to analyse. Consider this alongside the burgeoning expanding capability to
manipulate and trend information using todays computer and software technology, and
the situation does beg the question: so why is there a problem?
In aviation organisationsthere are an array of disciplines, personalities, departments and
cultures. This creates information gapse in the flow and recording of relevant and
crucial data. As a result, the effective analysis of any captured data is adversely
affected by the number and extent of the information gaps that exist in a given
organisations system.
d Attributed to Albert Einstein, physicist, 1879-1955.e AnInformation Gap can be defined as a break in common communication between persons,
departments etc despite that they may speak the same language. For example a Pilot will notnecessarily think or speak or use the same terminology as the technician, and as a result anInformation Gap is formed.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
15/23
Copernicus Technology Ltd 2009 15
An example that illustrates simply this phenomenon is the person who is the owner,
driver and maintainer of a car. If a person performs all related operational and
maintenance tasks, including being the budget holder, they have complete ownership,
accountability and responsibility for all operational aspects of the car; hence no
information gaps are induced. Therefore, for any symptom that arises, this is logged and
translated into a possiblefaultand an associatedfix and its cost remembered; if the fault
is intermittent and the fix was not effective, then a cost-benefit analysis would take
place to ensure that any subsequent course of action which may be taken balances
operational commitments and regulatory requirements. Therefore, if the same symptom
occurs again and again, this is quickly realised, so depending on the last action taken
other possible rectification activities might well take place until the symptom stops
arising. In reality, it is highly likely that different courses of action would be taken by
this lone operator and it is highly unlikely any of the applied fixes or courses of action
would be repeated. Therefore, if the symptom was the car engine keeps cutting out at
idle, the individual is unlikely to keep changing the engine management control unit at
600 per unit; in fact this might be the last item they would change given its high value,unless it is definitively identified as the source of the faults root cause. This is a clearly
defined closed-loop system in which all the data is presented and correlated by the same
person for the same vehicle, and by all the disciplines ie owner, budget holder,
maintainer and operator.
Introducing a second person into the scenarios to borrow the car complicates the
system. As they experience the symptom, they begin to process the raw information
presented to them and on their return they report their findings to the cars owner; this
may even be coupled with emotional anecdotes depending on the purpose behind
borrowing the car and the extent of the problem experienced. Depending on the secondpersons expertise and experience, the presented synopsis of the cars problem could
well range from the car is noisy with no supporting information, to a full blown
diagnostic of the possible fault(s). This is how information gaps begin. With multiple
cars, multiple drivers and multiple repair staff involved and the information flow is
impeded further whilst the number and extent of information gaps grow. In short, there
is no coherent correlation between the data pertaining to symptom - fault - fix.
The foundation of a successful data capture process is correct definition at the outset,
coupled with a robust and standardised format. It is vital that the symptom, as
experienced by the user, is captured using a standard and repeatable methodology.These symptoms can then be codified and entered into a searchable data field within
the operator/maintenance data management system. This eliminates two common
information gaps. Firstly, the fault debrief process with the operator/user is carried out
in their technical language, which prevents information from being missed. Secondly,
the codification of the symptom allows searching and trending on the database field and
therefore successfully eliminates the free-text syndrome. Base-lining of the symptom
data using this standard codification approach not only bridges the information gaps,
more significantly it enables the direct correlation of the symptom through to the fault,
and then to the actual fix.
To illustrate with another car example, consider the following symptom descriptions:
flat tyre, puncture, flat, nail in tyre, tyre flat. They are all the same, but in a free-text
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
16/23
Copernicus Technology Ltd 2009 16
database these would be seen as 5 different symptoms, none of which states which tyre
has the problem. In addition the text focuses on the fault and omits the symptom that
was experienced by the driver of the car which was that the car was hard to steer. The
standardising of symptoms, which can be constructed using historical data, experience
and knowledge, narrows the possible outcome scenarios; once this has been achieved,
the symptom can be codified into a discrete code that uses predetermined letters to
represent certain states and conditions which are common throughout all systems within
the aircraft. Referring back to the car example, to assist the driver in remembering as
much of the information as practical, it is imperative to create the debrief in the logical
manner as experienced by the driver, and not to debrief the operator as an engineer or
technician might instinctively tend to. Therefore, the resulting debrief could look like
Table 1:
System SymptomSpeed
(mph)Weather
Road
Surface
Road
Debris
First
Visual
Second
Visual
WordPicture
Steering Steering is:
Hard
Loose
Impossible
X Not listed
42 DryWet
Icy
TarLoose
Gravel
Boulder
Mud
NoneRocks
Glass
Screws/nails
Tyre:
OK
Low
Flat
Damage:
OK
Scuffed
Cut
Code ST H 42 D T S F C
Table 1 Example codification of a faults symptom
The resultant code would be something like this: STH42DTSFC. In this scenario the
last 2 aspects of this symptom diagnostic would probably be carried out by the receiving
staff of the hire car company. Similarly, for aircraft scenarios, flight line mechanics
might be required to complete the symptom debrief coding process in some cases to
include where appropriate, details of warning captions, BITE codes etc.
The output of this base-lined, standardised symptom codification activity is that the
process is now repeatable, and can thus be carried out by the operator directly and
without necessarily being dependent on an experienced technician to facilitate a debrief:
and noting that different technicians would each facilitate the debrief in a different way,asking different questions some relevant, some less so - and in a different order. The
captured symptom code is meaningful and concise, and does away with the free-text
problem so that there is now a genuine capability ability to conduct trend-analysis on
the symptoms being experienced. Furthermore this has enabled the capability to trend
analyse parts of the symptom, for example search for all ST*symptoms reported.
This approach of part trending allows the output to be defined according to data needs.
Thus the first part of the code is more useful to strategic management because they
might only need to know the total number of steering arisings that their customers are
having. The entire code would be useful to the cars in-service support/maintenance
organisation because it could be used to influence changes to manuals, to training or to
modify the systems concerned.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
17/23
Copernicus Technology Ltd 2009 17
Overall, while the chosen example is a very simple one, the principles are the same for
more complex scenarios. The key to its success is the definition of the symptom as
experienced by the user of the equipment and in the language of the user. We have
termed this concept Symptom Diagnostics.
Applying Symptom Diagnostics to Intermittency and NFF
Now that the first part of data capture for a given fault has been constructed and
standardised, this codified symptom can be linked to the diagnosed fault and the
resulting, successful fix. Given the benefits in the ability to trend symptoms and their
related faults with the actual fix, this approach is therefore an invaluable tool to apply
when tackling intermittent faults and the NFF phenomenon.
To eliminate NFF, the fault has to be solved at its root cause (like any other fault, in
fact), and while a number of speculative solutions may be attempted to eliminate the
fault, the need to quickly identify subsequent symptoms is crucial in order to ascertain if
the applied fix has been effective ie it has been a diagnostic success. This use ofSymptom Diagnostics trending is pivotal to ensuring that repeat fixes are not carried out
and that other potential fixes are considered.
While Symptom Diagnostics trending can enhance NFF and maintenance solutions
activity at the single airframe level, it becomes far more powerful when applied over an
entire fleet. As the Symptom Diagnostics successful fix data builds up for the fleet, it
allows technical staff to see what proportion of different maintenance activities led to a
diagnostic success for a specific Symptom code, for example: for fault X the successful
fixes were 80% for LRU Z and 20% for LRU Y. This data could be used to inform the
technicians and airworthiness decision-makers into using the analysed, historical data todirect resources for maximum and long-term effect. Considering the concepts already
developed in this paper, if LRU Y takes 20 minutes to replace and LRU Z takes 2 hours
to replace, then it is possible for this to influence the diagnostic process; however,
coupled with the aforementioned Symptom Diagnostic data there is now the opportunity
to make a far more informed decision, based on knowledge and not on assumptions and
not on stock levels or on time-to-replace.
The symptom-fault-fix methodology provides the foundation to tackle both hard and,
more importantly, intermittent faults. For an aviation enterprise with established
corporate maintenance knowledge for a mature platform, Symptom Diagnostics may notbe an essential methodology for finding hard faults, but it does enable technicians to
find fault root causes more quickly. The ultimate application of this approach would for
Symptom Diagnostics codes to be compiled and transmitted by the flight crew while the
aircraft is still airborne. This would enable maintenance staff to prepare spares and
resources in advance of the aircraft landing, similar to a Formula One pit crew being in
position with their tools and tyres prior to the car entering the pits. In addition to this
operational advantage, Symptom Diagnostics provides the fundamental base-lining
methodology for tackling intermittency. In considering the issues of speculative
maintenance, set against a backdrop of time pressure and KPIs to meet, then without a
standardised means of accurately recording recurring symptoms the random nature ofthe intermittent fault becomes hidden in a plethora of unnecessary maintenance
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
18/23
Copernicus Technology Ltd 2009 18
activities and uncertainty, leading to increases in operating costs and inefficiency. This
is back in the realm of the NFF iceberg, deep below the waterline.
By using a Symptom Diagnostics method for trending for a fleet asset it is now possible
to ascertain whether a speculative LRU change has been successful on aircraft, or not.
Therefore, without the need for expensive test set equipment, a Ship or Shelve policy
could be used to provide the breathing space required to prevent an LRU from beingreturned for repair or overhaul. For example, if a symptom is reported and cannot be
reproduced and is believed to result from an intermittent fault, then a considered LRU
replacement could be carried out. The replaced LRU would then be quarantined on a
designated rack within the stores system with an appropriate identification label stating
the aircraft registration, reason for removal, time, date, Symptom Diagnostics code etc.
After a predetermined time or conditions have been met, ie if the symptom did not
reoccur within a specified number of flying hours/flights/usage cycles, then the LRU
would be categorised as unserviceable and inputted into the reverse supply chain for
repair. Alternatively, if the symptom did reoccur within the specified period, then the
LRU could be categorised as serviceable, subject to any prerequisite functional checks,
and returned to use. There are clear airworthiness implications to be considered with
the ship or shelve policy, subject to the safety/performance-criticality of the LRU
concerned, but the policy has been introduced with some success by certain operators 14.
Implementation of the above outlined procedural solutions enables the maintenance
staff to identify faults root causes much earlier in the diagnostic process than has
traditionally been the case in NFF occurrences. In doing so, the resultant data reduces
the instances of erroneous LRU replacements because trend analysis of the data
highlights rogue LRUs or roguefaircraft. While highlighting rogue assets and taking
the appropriate action to isolate or limit their use, the next step in achieving diagnostic
success is detection of the intermittent fault, fixing its root cause and returning the asset
to service with no restrictions and also with, just as importantly, vastly increased
confidence that the intermittency has been eradicated.
2. THE MAINTENANCE MANAGEMENT APPROACH
This paper has examined the maintenance management and maintenance practice issues
that contribute to NFF occurrences, or which prevent the arising rates from improving.
The major hurdles which stand out in terms of their relative effect on diagnostic failureare the disproportionate focus on LRUs, the influence of KPIs and business pressures on
fault diagnosis processes and the quantity of LRUs erroneously sent for repair.
The engineering and degradation factors discussed in this paper, coupled with excessive
NFF rates for LRUs at second line repair workshops, all suggest that the LRU is being
treated as a sticking plaster approach to rectifying NFF occurrences treating the
symptom, not the cause. Degradation mechanisms and operating environments mean
that the 3 Cs are the more significant source of NFF root causes and that this is where
diagnostic resources, including training, should increasingly be directed. However,
f Rogue is defined as a LRU or aircraft which, through data analysis, is proven to be responsiblefor more than the average number of symptom occurrences being recorded.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
19/23
Copernicus Technology Ltd 2009 19
before committing to this approach, use of Symptom Diagnostics data would provide
the checks and balances to indicate that the approach was correct in practice and not just
in theory. In parallel with a revised emphasis away from LRU replacements,
implementation of a ship or shelve policy would complement all initiatives to reduce
unnecessary LRU replacements and would successfully minimise the throughput of
LRUs in the reverse supply chain.
KPIs can have a positive motivational effect or can sub-optimise the performance of a
system of work. If they are the right thing to revise for an organisation, then the
Performance Management system should be amended for those maintenance KPIs
which focus on long-term effectiveness, rather than short-term effect. The first-time-
fix rate for, say, the top ten critical faults could be the major KPI of a maintenance
organisations diagnostic capability, particularly if the figures display a continual,
downward trend.
Finally, the senior level of maintenance management should support airworthiness
decision-makers in all efforts to prevent shotgun maintenance and focus on diagnosticsuccess to identify faults root causes. Company policy and maintenance standard
operating procedures (SOPs) should reflect this and provide a documented process to
follow to vastly increase the likelihood of a first-time-fix in conjunction with the
additional measures outlined above.
3. THE INTERMITTENT FAULT DETECTION APPROACH
As previously alluded to with regard to human factorsHardware issues, the inexorable
advances of technology result in an increased focus on and fascination with theaccuracy of digital equipment. Despite this, probability of detecting intermittency is
more relevant to isolating intermittency root causes than digital measuring capability.
Therefore, to increase the probability of detection compared with conventional digital
equipment a technique is required that is continuous in its ability to detect nanosecond
intermittency over a specified test period, and not sampled or averaged, or limited to a
point-in-time testing method.
Extending the logic of this concept one stage further, if multiple lines of continuous
testing can be carried out simultaneously then, by default, the probability of detecting an
intermittent fault during a specified period of time in a UUT is substantially increased.
Combine this intermittency testing with an appropriate level of environmental
stimulation for the UUT and this testing methodology provides maintenance staff with
increased confidence that every part of the UUT has been subjected to representative
conditions while all test points have been continuously and simultaneously tested.
Analogue neural-networks provide the means to successfully achieve intermittent fault
detection. A very low analogue signal allows the UUT to be tested continuously for a
period of time, and without the aforementioned compromises introduced by digital
techniques. The use of detection-optimised analogue versus measurement-optimiseddigital equipment means that nanosecond intermittency can be detected successfully.
Widely-available Digital Multi-Meters are limited to point-in-time testing and their
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
20/23
Copernicus Technology Ltd 2009 20
sampling rates restrict their potential capability to detection of millisecond
intermittency, thus there is an increase in the probability of detection by analogue
equipment of over 1 x 106. The application of using an analogue neural-network
improves matters dramatically, because it uses a method whereby all the circuits of the
UUT are inter-linked in a synaptic-like networked arrangement. The neural-network
allows multiple test points (ie the circuits of the UUT) to be tested simultaneously and
continuously without missing any intermittency events across all the points under test.
Diagram 1 compares the detection capability of the analogue neural-network with that
of the conventional ATE methodology.
Diagram 1 - ATE vs Analogue Neural-Network
Conducting the test simultaneously using a neural-network provides an increase in
probability of detection proportional to the square of the number of circuits under test.
This substantial increase in the probability of detection, combined with the reduction in
the time taken to complete the test (because the testing is performed for multiple points
simultaneously, rather than testing one line at a time) mean that exploiting analogue
neural-network equipment to detect and eradicate intermittent faults in electrical and
electronic aerospace components, is the most effective test methodology to use.
THE REQUIRED OUTCOMES OF DIAGNOSTIC SUCCESS
Aviation maintenance policy concerns itself with ensuring that aircraft and their systems
are safe and fit-for-purpose throughout their full life cycle. In addition, commercial
demands mean that this is delivered to customers in a sustainable and cost-effective
manner. In short, this means providing the required aircraft availability at minimum
whole-life cost. NFF occurrences and intermittent faults directly affect that simple
equation, hence the required outcome of the route to diagnostic success must be tounderpin and enhance availability levels by enabling the correct diagnosis and
rectification of every fault (including intermittent faults), right first time, every time.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
21/23
Copernicus Technology Ltd 2009 21
An availability-focused maintenance strategy is not applied using a one-flight-at-a-time
mentality, ie is it airworthy and serviceable for the next flight. If this was the case then
there would not be such considerable efforts invested in developments such as fatigue
life and component life extension programmes. The concept of aircraft structural
integrity is not new, and the approach to continuing airworthiness has spread to all
systems on aircraft, including the integrity of the EWIS15.
In the context of NFFs therefore, use of analogue neural-network Intermittent Fault
Detection equipment should be used to demonstrate the system integrity of a UUT. The
combination of proving system integrity using this testing capability, as well as
confirming serviceability or system functionality using traditional special-to-type
test equipment will lead to enhanced levels of sustained availability of systems and
platforms.
Functional Test +Integrity Test=Increased Availability
The combination of Intermittent Fault Detection using analogue neural-network testequipment (to enable rectification of the root cause) and functional testing using
existing test equipment can therefore provide the highest level of assurance of system
availability where the system can perform its function without interruptions from
transient faults.
CONCLUSIONS
A No Fault Found occurrence describes the set of circumstances which starts with an
end user experiencing a faults symptoms and ends in a diagnostic failure. It is areported fault for which the root cause cannot be found. NFFs impact on aircraft
availability directly and indirectly, through causing aborted flights and maintenance
rework; and through wasted time/money/resources involved in erroneous, speculative
and avoidable component/LRU replacement, repair and procurement.
Diagnostic failures are caused by:
Maintenance Human Factors - including: the pressure to meet operational
deadlines; an excessive troubleshooting focus on LRUs over EWIS components,
overlooking the 3 Cs (cables/connectors/chassis) in particular; the variability ofdata captured for symptoms, faults and fixes; and the retention, accessibility and
integrity of fault-finding and troubleshooting data.
Nanosecond Intermittency - resulting in intermittent symptoms that may not
manifest themselves during fault investigation activity; and which cannot routinely
be detected by conventional ATE because of their sampling rates, single point-in-
time application and digital averaging techniques all of which sub-optimise their
ability to detect randomly occurring nanosecond intermittency.
To mitigate these obstacles to diagnostic success there are 3 key strategies which mustall be applied:
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
22/23
Copernicus Technology Ltd 2009 22
Symptom Diagnostics the capture and correlation of standardised symptom-
fault-fix data to improve corporate technical knowledge of successful first-time-fix
methods for real symptoms; this knowledge informs fault diagnosis processes and
diverts attention away from LRU changes and back towards root cause solutions,
especially in the 3 Cs.
Management NFF-related maintenance management policy must underpin these
new strategies by inculcating the improvements in the system of work through
integration within the performance management system, within technician training
and within NFF fault-finding SOPs.
Intermittent Fault Detection the use of analogue neural-network test
technology to overcome the shortcomings of digital equipment in this specific
context, to the extent that the probability of detecting intermittency is increased by
multiples of 106.
These strategies provide the foundation for an availability-focused maintenancestrategy. Furthermore, thefunctional testing of electrical and electronic equipment by
ATE can be enhanced by proving the integrity of circuitry, connectors and the EWIS.
The combination of testing for Function and forIntegrity leads toIncreased
Availability, a reduced supply chain and vastly increased confidence in the equipment
being fitted to aircraft.
The aerospace maintenance organisation that could harness the combined
approach of this Intermittent Fault Detection methodology in concert with the
Symptom-Fault-Fix approach would be genuinely world class in its ability to fix
faults right first time, every time and, therefore, in its sustained ability to deliverincreased aircraft availability.
8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)
23/23
REFERENCES
1 ARINC Report 672, (2008), Guidelines for the Reduction of No Fault Found (NFF), AvionicsMaintenance Conference, Aeronautical Radio Inc.
2 http://www.aviationweek.com/aw/generic/story_generic.jsp?channel=om&id=news/om207cvr.xml accessed at 1930 on 24 Aug 09.
3 Knotts R, (1996), Analysis of the Impact of Reliability, Maintainability and Supportability onthe Business and Economics of Civil Air Transport Aircraft Maintenance and Support, M.Phil.thesis, University of Exeter, UK.
4 Reference 2.5 Blischke W R & Murthy D N P, (2003), Case Studies in Reliability & Maintenance, John
Wiley & Sons, New York.6 Dunwoody S, Bock E, Sofia J, (1996), A Practical and Reliable Method for Detection of
Nanosecond Intermittency, AMP Journal of Technology Vol 5.7 Kelly G, Sajecki A, Soresnson B A, Sorenson P W, (2001), An Analyzer for Detecting Aging
Faults in Electronic Devices, Updated from 1994.8 CAP 715, (2002) An Introduction to Aircraft Maintenance Engineering Human Factors for
JAR66, UK Civil Aviation Authority.9 Kawano R, (1997) Steps Towards the Realization of Human-Centred Systems, IEEE 6th
Conference on Human Factors and Power Plants, Conference Proceedings, Orlando, pp 13/27-13/32.
10 Seddon J, (2003), Freedom from Command and Control: a better way to make the work work,Vanguard Press.
11 Repenning N P, Sterman J D, (2001), No-One Ever Gets Credit For Fixing Problems BeforeThey Happened: Creating And Sustaining Process Improvement, California ManagementReview Vol 43 No 4.
12 European Aviation Safety Agency, (2008), EASA Certification Specification for Larger
Aeroplanes CS-25 Subpart H, Amendment 5.13 Reference 8.14 Reference 1.15 Reference 12.
Jim Cockram is a former RAF engineering officer, with a 25-year Service career that focussed heavily
on the maintenance and logistics support of fast-jet fleets and guided weapons systems, from the vantage
point of roles in Forward, Depth and Integrated Project Team environments. During this time he
participated in many exercise and operational deployments, and was an early advocate of applyingLean
Thinking to Defence organisations. His experiences in programme management and in running large,
aircraft maintenance organisations led him to develop maintenance and data-exploitation strategies which
he has subsequently employed successfully in business improvement projects in the private and public
sectors. Jim is the Technical Director of Copernicus Technology Ltd and is an enthusiastic member of
the Royal Aeronautical Society Highland Branch committee.
Giles Huby was also an RAF engineering officer, whose 16-year Service career encompassed land-based
and carrier-based fast-jet operations, plus Forward, Depth and Integrated Project Team roles in support of
fast-jet fleets and guided weapons. Like Jim, Giles also possesses considerable experience of Defence
programme management and running large, aircraft maintenance organisations. He was heavily involved
in Lean process improvement activity in Defence; plus he amassed significant experience of Human
Factors incident investigation and the associated development of enduring countermeasures to prevent
further recurrences. Giles is the Managing Director of Copernicus Technology Ltd and the Chairman of
the Royal Aeronautical Society Highland Branch committee.