Self-Learning and Adaptive Functional Fault DiagnosisA Look at What is Possible with “Data”
Krishnendu Chakrabarty
Department of Electrical & Computer EngineeringDuke University
Durham, NC
2
Acknowledgments
• Fangming Ye, Duke University• Zhaobo Zhang, Huawei• Xinli Gu, Huawei• Sponsors: Cisco (until 2011), Huawei (since 2011)
3
ITC 2014 Tutorial
• Tutorial 9:• TEST, DIAGNOSIS, AND ROOT-CAUSE IDENTIFICATION OF FAILURES
FOR BOARDS AND SYSTEMS
• Presenters: KRISHNENDU CHAKRABARTY, WILLIAM EKLOW, ZOE CONROY
• The gap between working silicon and a working board/system is becoming more significant as technology scales and complexity grows. The result of this increasing gap is failures at the system level that cannot be duplicated at the component level. These failures are most often referred to as “NTFs” (No Trouble Founds). The problem will only get worse as technology scales and will be compounded as new packaging techniques (SiP, SoC, 3D) extend and expand Moore’s law. This tutorial will provide detailed background on NTFs and will present DFT, test, and root-cause identification solutions at the board/system level.
4
Automated Diagnosis: Wishlist
Automated and accurate diagnosis
Reduced diagnosis and repair costs Identify the most-effective
syndromes
Accelerate product release Self-learning
Functional Test
Automated diagnosis system
• Report ambiguity• Develop new tests• Analyze and update
current test set
Faile
d bo
ard
Fixed board
6
What Data Should We Collect?• Pass/fail information, extent of mismatch with expected
values, performance marginalities, etc.• Counter values, BIST signatures, sensor data
– Example: A segment of traffic-test log
## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)
……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……
Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid
7
What Can We Do With This Data?
• Train machine-learning models for root-cause localization• Identify redundant syndromes (i.e., test outcome data) and
do data pruning• Identify deficiencies of tests in terms of diagnostic ambiguity
and provide guidance for test redesign• “Fill in the gaps” when the data is not sufficient for precise
diagnosis
8
Key Challenges and Solutions
• How to improve diagnosis accuracy?– Support-vector machines and incremental learning [Ye et al.,
TCAD’14] – Multiple classifiers and weighted-majority voting [Ye et al.
TCAD’13]
• How to speed up diagnosis?– Decision trees [Ye et al. ATS’12]
• How to do diagnosis using incomplete information?– Imputation methods [Ye et al. ATS’13]
• What can we learn from past diagnosis?– Root-cause and syndrome analysis [Ye et al. ETS’13]– Knowledge discovery and knowledge transfer [Ye et al. ITC’14]
9
Syndrome and Root-Cause
• A segment of traffic-test log
• Syndromes are test outcomes parsed from log• Root causes are replaced components, e.g. C37
## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)
……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……
Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid
s1 s2 s3
Case 1 1 1 1 A
Case 2 1 1 1 A
Case 3 1 1 1 B
Case 4 0 1 1 B
Syndrome Root cause
10
Success Ratio
• Success ratio = ¾ = 75%
s1 s2 s3
Case 1 1 1 1 A
Case 2 1 1 1 A
Case 3 1 1 1 B
Case 4 0 1 1 B
A
A
A
B
Actual root cause
Predicted root cause
Using Machine Learning (Support-Vector Machines)
– A binary classifier based on statistical learning– Train the model with data– Using trained model for diagnosis
11
Line2 (optimal separation)
Margin
(b)
Line1 (feasible separation)
Margin
(a)
Example• Suppose we have 6 cases (successful debugged boards) for training
• Let x1, x2, x3 be three syndromes. If the syndrome manifests itself, we record it as 1, and 0 otherwise.
• Suppose that the board has two candidate root causes A and B, and we encode them as y = 1 and y = −1
• Merge the syndromes and the known root causes into matrix A = [B|C], where the left (B) side refers to syndromes, while the right side (C) refers to the corresponding fault classes
Example (Contd.)• SVM calculation: w1 = 1.99, w2 = 0,
w3 = 0 and b= −1.00
• Therefore, the classifier is
x3
x2
x1
y = 1
y = −1
Distance betweenclassifier and support vector
• Given a new failing system with syndrome [1 0 0], then f(x) > 0. The root cause for this failing system is A.
• Given another new failing system with syndrome [0 1 0], f(x) < 0. The root cause for this failing system is B.
Results for Different Kernels Circa 2011 (Manufactured Boards)
• 811 boards for training, 212 for test
• Diagnostic accuracy under different kernel functions
14
40%
50%
60%
70%
80%
90%
100%92.23%
72.64%77.83%
81.13%
94.20%
61.79%
72.64% 75.47%
93.96%
57.08%
67.45%69.81%
polynomial degree=1
Gaussian
Exponential
Incremental SVMs
Support vector extraction
Initial training set
Extracted support vectors
New training cases
Combined Training Set
IncrementalLearning
Flow Chart
New training(incoming) data
Failing boards
Preparation stage
Extract fault syndromes and repair actions as additional training set S
Learning stage
Existing supportvectors S*
Optimization problem for combined training set (S U S*)
Solve and update SVM model
Diagnosis stagefor new systems
Determine root cause based on the output of final SVM model
Existing SVM model
More new training data?
Final SVM model
NoYes
400 500 600 700 800 900 1000 1100 1200 1300 14000
5
10
15
20
25
Non-incremental learning Incremental learning
SV
M m
odel
tra
inin
g ti
me
(sec
ond
s)
Number of training cases in SVMs
Comparison of Training Time
Comparison of Success Rates
400 500 600 700 800 900 1000 1100 1200 1300 140030%
40%
50%
60%
70%
80%
90%
100%
Non-incremental learning Incremental learning
SV
M m
odel
su
cces
s ra
te
(per
cen
tage
)
Number of training cases in SVMs
19
Typical Diagnosis Systems
• Number of syndromes (1,000 per board)• Diagnosis time (up to several hours per board)• Often requires manual diagnosis
Usefulsyndromes
The complete set of
syndromes
How to select
Comparison of Diagnosis Procedures
Start Diagnosis
Predicted root cause
Yes
No Confident about root
cause?
Observe ONE syndrome
Start Diagnosis
Observe ALL syndromes
Predicted root cause
Decision Tree-Based DiagnosisTraditional Diagnosis
20
21
Decision Trees
• Internal Nodes– Can branch to two child
nodes– Represent syndromes
• Terminal Nodes– Do not branch– Contain class
information
S 1
S 2
S 3
S 4
A 2
A 3
A 4
A 1
yes
yes
yes
yes
no
no
no
no
A 1
22
yes
Decision Trees
• We may reach root causeA1 in two different testsequences.
1) Start from the most dis-criminative syndrome S1
2) If S1 manifests itself, we then consider syndrome S2
3) If S2 manifests itself, wecan determine A1 to be the root cause
S 1
A 1
S 2
S 3
S 4
A 2
A 3
A 4
A 1
yes
yes
yes
yes
no
no
no
no
S 2
yes
S 1
A 1
23
Decision Trees
S 1
A 1
S 2
S 3
S 4
A 2
A 3
A 4
A 1
no
yes
yes
yes
no
no
no
no
S 1
S 3
yesS 4yes
A 1yes
• We may reach root causeA1 in two different testsequences
1) Start from the most dis-criminative syndrome S1
2) If S1 pass, we will considersyndrome S3
3) If S3 manifests itself, we will consider syndrome S4
4) If S4 manifests itself, then we can determine A1 to be the root cause
24
Training Of Decision Trees (Syndrome Identification)
• Goals:– Rank syndromes– Minimize ambiguity– Reduce tree depth
• Three popular criteria can be used for training decision trees– Information Gain– Gini Index– Twoing
Class 1 Class 2
Class 1 Class 2
25
Diagnosis Using Decision Trees
Training Data PreparationExtract all the fault syndromes and the repair
actions from historical data
DT Architecture DesignDesign inputs, outputs, splitting criterion, pruning
DT TrainingGenerate a tree-based predictive model and
assess the performance
DT-based DiagnosisTraverse from the root node of DTs and obtain the
root cause at the leaf node
Diagnosis Using Decision Trees
26
Start DiagnosisObserve the syndrome at
the root of DTs
Adaptive Diagnosis
Select and observe the new syndrome
based on the observation of
current syndrome
Predict Root Cause
Generate root cause for the failing board
Leaf Node? Yes
No
Experiments• Experiments performed on industrial boards
currently in production– Tens of ASICs, hundreds of passive components
• All the boards under analysis failed traffic test– A comprehensive functional test set for fault isolation,
run through all components
27
Board 1 Board 2 Board 3Number of test items 420 207 420
Number of root cause components
10 14 10
Number of failed boards 130 40 1000
28
Comparison Of Different Decision-Tree Architectures
Board 1 Board 2 Board 30
50100150200250300350400450
DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs
Total number of syndromes used for diagnosis
5x
6x
6x
29
Comparison Of Different Decision-Tree Architectures
Board 1 Board 2 Board 30
50100150200250300350400450
DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs
Average Number of syndromes used for diagnosis
15X17X
18X
30
Comparison Between DTs And SVMs
• Success rates (SR) obtained for Board 3
DTs SVM_1 SVM_2 SVMs_3(min) SVMs_3(max) SVMs_3(avg.)0%
20%
40%
60%
80%
100%
SR1 SR2 SR3
• SR obtained by DTs are similar to SR obtained by SVMs
31
Comparison Between DTs And ANNs
• Success rates (SR) obtained for Board 3
DTs ANNs_1 ANNs_2 ANNs_3(min) ANNs_3(max) ANNs_3(avg.)0%
20%
40%
60%
80%
100%
SR1 SR2 SR3
• SR obtained by DTs are similar to SR obtained by ANNs
Information-Theoretic Syndrome and Root-Cause Analysis for
Guiding Diagnosis
• Analysis of diagnosis performance• Feedback for guiding test improvement
33
Problem Statement
• Lack of diagnosis performance evaluation for individual root cause or individual syndrome
• Redundant syndromes
Syndrome set from team 1
Complete set of syndromes
Syndrome set
from team 3
Syndrome set from team 2
Reducedsyndrome set
Redundant syndrome set
34
Complete set of root causes
Which root cause is hard to diagnosis?
Problem Statement
• Lack of diagnosis performance evaluation for individual root cause or individual syndrome
• Redundant syndromes• Ambiguous root-cause pairs
35
Analysis Framework
Automated diagnosis system
Minimum-redundancyMaximum-relevance
Class-relevance analysis (precision, recall)
Syndrome analysis
Root-cause analysis
A reduced set of
syndromes
Root causes with low
ambiguity
A set of redundant syndromes
Root causes with high ambiguity
Update
Test design team
Root-cause prediction
Syndromes
Add/Drop tests
Synthetic boards
Feedback
36
Syndrome Analysis
• Problem– Which syndrome is useful for diagnosis? And, which
is not?
• Method– Select useful syndromes– Minimum-redundancy maximum-relevance (mRMR)
method
37
mRMR Method (demo)
s1 s2 s3 s4 s5RC
Case 1 0 1 1 0 1 A
Case 2 0 1 1 0 1 A
Case 3 0 1 1 0 1 A
Case 4 1 1 0 1 1 B
Case 5 1 0 0 1 1 B
Case 6 1 0 1 1 0 C
Case 7 1 1 0 1 0 C
38
mRMR Method (demo)
s1 s2 s3 s4 s5RC
Case 1 0 1 1 0 1 A
Case 2 0 1 1 0 1 A
Case 3 0 1 1 0 1 A
Case 4 1 1 0 1 1 B
Case 5 1 0 0 1 1 B
Case 6 1 0 1 1 0 C
Case 7 1 1 0 1 0 C
39
mRMR Method (demo)
s1 s2 s3 s4 s5RC
Case 1 0 1 1 0 1 A
Case 2 0 1 1 0 1 A
Case 3 0 1 1 0 1 A
Case 4 1 1 0 1 1 B
Case 5 1 0 0 1 1 B
Case 6 1 0 1 1 0 C
Case 7 1 1 0 1 0 C
40
mRMR Method (demo)
s1 s2 s3 s4 s5RC
Case 1 0 1 1 0 1 A
Case 2 0 1 1 0 1 A
Case 3 0 1 1 0 1 A
Case 4 1 1 0 1 1 B
Case 5 1 0 0 1 1 B
Case 6 1 0 1 1 0 C
Case 7 1 1 0 1 0 C
41
Experimental Setup
• Dataset– Industrial boards in high-volume production– Each board has tens of ASICs, hundreds of
passive components– A comprehensive functional test run though all
components
Number of syndromes 546
Number of root causes 153Number of failing boards 1613
42
Experimental Setup
• Selected diagnosis systems– Support-vector machines– Artificial neural networks– Decision trees– Majority-weight voting
43
Results (Syndrome Analysis)
0 100 200 300 400 5000%
50%
100%
mRMR MaxRel
Random selection
Number of syndromes
Suc
cess
rat
io
Support-vector machines
0 100 200 300 400 5000%
50%
100%
mRMR MaxRel
Random selection
Number of syndromes
Suc
cess
rat
io
Artificial neural networks
0 100 200 300 400 5000%
50%
100%
mRMR MaxRel
Random selection
Number of syndromes
Suc
cess
rat
io
Decision trees
0 100 200 300 400 5000%
50%
100%
mRMR MaxRel
Random selection
Number of syndromes
Suc
cess
rat
io
Weighted-majority voting
44
Results (Syndrome Analysis)
0 100 200 300 400 5000%
20%
40%
60%
80%
100%
mRMR MaxRel Random selection
Number of syndromes
Suc
cess
rat
io
Support-vector machines
45
Root-Cause Analysis
• Problem– Which root-cause is hard to isolate from another root-
cause? – Which root-cause is hard to isolate from other root
causes?
• Method– Screen root causes of high ambiguity– Statistical metrics (e.g., precision, recall)
46
Metrics for Root-Cause Analysis
Actual root cause Predicted root cause
Root cause A
Other root causes(Root cause B, C, …)
Root cause A
Other root causes(Root cause B, C, …)
TP (True Positive)
TN (True Negative)
FP (False Positive)
FN (False Negative)
47
Metrics for Root-Cause Analysis
Actual root cause Predicted root cause
Root cause A
Other root causes(Root cause B, C, …)
Root cause A
Other root causes(Root cause B, C, …)
TP (True Positive)
FP (False Positive)
48
Metrics for Root-Cause Analysis
Actual root cause Predicted root cause
Root cause A
Other root causes(Root cause B, C, …)
Root cause A
Other root causes(Root cause B, C, …)
FN (False Negative)
TP (True Positive)
49
Root-Cause Analysis (demo)
s1 s2 s3 s4 s5
Case 1 1 1 0 0 1 A A (TP)
Case 2 1 1 0 0 1 A A (TP)
Case 3 1 1 0 0 1 A B (FN)
Case 4 0 1 1 1 1 B B (TN)
Case 5 0 0 1 1 1 B B (TN)
Case 6 1 0 1 1 0 C C (TN)
Case 7 0 1 1 1 0 C C (TN)
Actual root cause
Predicted root cause
50
Root-Cause Analysis (demo)
s1 s2 s3 s4 s5
Case 1 1 1 0 0 1 A A (TP)
Case 2 1 1 0 0 1 A A (TP)
Case 3 1 1 0 0 1 A B (FN)
Case 4 0 1 1 1 1 B B (TN)
Case 5 0 0 1 1 1 B B (TN)
Case 6 1 0 1 1 0 C C (TN)
Case 7 0 1 1 1 0 C C (TN)
Actual root cause
Predicted root cause
51
Results I (Root-cause analysis)
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
1.00
20
40
60
80
Precision Recall
Num
ber
of r
oot-
caus
es
Support-vector machines
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
1.00
20
40
60
80
Precision Recall
Num
ber
of r
oot c
ause
s
Artificial neural networks
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
1.00
20
40
60
80
Precision Recall
Num
ber
of r
oot-
caus
es
Decision trees
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
1.00
20
40
60
80
Precision Recall
Num
ber
of r
oot-
caus
es
Weighted-majority voting
52
Results I (Root-cause analysis)
0-0.1 0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
1.00
10
20
30
40
50
60
70
80
Precision Recall
Nu
mb
er
of
roo
t-ca
use
s
Support-vector machines
53
Results II (Root-cause analysis)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Support-vector machines
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Artificial neural networks
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Decision trees
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Weighted-majority voting
54
Results II (Root-cause analysis)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Support-vector machines
No change to the testsAdd tests to avoid high
false negatives
These tests are useless and mislead
(incorrect root causes)
Add tests to avoid high false positives
55
Results III (Root-cause analysis)
Root-cause pair that can be differentiated
Ambiguous root-cause pair in a group of 4
Ambiguous root-cause pair in a group of 8
97.32% 2.68%
0.63%
1.29%
0.39%
0.37%
Ambiguous root-cause pairs
Ambiguous root-cause pair in a group of 2
Ambiguous root-cause pair in a group of 3
Support-vector machines
56
Summary
• Obtained reduced syndrome set with high
discriminative ability
• Analyzed ambiguity among root causes
• Provided guidelines for test-design teams to
re-design tests
Knowledge Discovery and Knowledge Transfer in Board-Level Diagnosis
• Acceleration of manufacturing time• Quick improvement of diagnosis accuracy
58
Problem Statement
• Low diagnosis accuracy at product ramp-up stage
Timeline Dia
gnos
is a
ccu
racy
100%
0
low
New product
Product 1
Product 2
Preparation Learning Mature
59
Knowledge Source?
Knowledge discovery
Component
Test program
Case’
Case’
Case’
New product
Case’
Case’
Case’
Knowledge transfer
Old product
Case
Case
Case
Case
New product
Component mapping Test-program mapping
• Discover relationship between root-causes and syndromes from text of test logs through keywords
• Transfer diagnosis experience from other products through component and test-program mapping
New diagnosis
engine
Keywords
Tools
Reliable
Scalable
Collabor-ative
Doable
From From Academia to IndustryToolkit
From Academia to Industry
Generic programming languages
Shell Programming Languages
Statistics Programming Languages
Machine learning-related tools
Integrated design environment
Miscellaneous
Electronic design-aid tools
------ Tool categories