Post on 03-Jul-2020
transcript
1
DesignCon 2014
How Design of Experiments
Saved my CEI VSR 28G Design
Richard Allred, SiSoft
[rallred@sisoft.com]
Barry Katz, SiSoft
Ishwar Hosagrahar, Inphi Inc.
Chao Xu, Inphi Inc.
Wiley Gillmor, SiSoft
2
Abstract
To determine the overall design margin, a CEI 28G VSR/100 Gigabit Ethernet design
required the analysis of 5 million combinations of channel variation, transceiver process
and equalization settings. Brute force simulation would have required 278 days, clearly
outside the available schedule.
Instead, we used Design of Experiments (DOE) and Response Surface Modeling (RSM)
to reduce the number of simulations by a factor of 19,000 and yet produce just as
meaningful results.
This paper will demonstrate the application of DOE and RSM to a CEI 28G VSR design.
We will show the process of creating a DOE, fitting the data to models, determining the
goodness and reliability of the fit and then using the model to perform “what if” analysis,
optimize design factors and quantify the impact of manufacturing variation.
3
Authors Biography
Richard Allred is a Senior Member of Technical Staff at SiSoft. Previously, Richard
worked at Inphi where he was responsible for Inphi’s 100G Ethernet PHY (28G per lane)
front plane interface. In the course of that work, Richard contributed to IEEE 802.3 and
OIF-28G-VSR standards discussions on next generation Ethernet. Before that he worked
at Intel, contributing to signal integrity methodology and tool development for
GDDR5/DDR3. He used Design of Experiments and Response Surface Modeling in the
course of this work to predict link performance across high volume manufacturing.
Richard received his MSEE from University of Utah, and has 4 publications.
Barry Katz, President and CTO for SiSoft, founded SiSoft in 1995. As CTO, Barry is
responsible for leading the definition and development of SiSoft’s products. He has
devoted much of his efforts at SiSoft to delivering a comprehensive design methodology,
software tools, and expert consulting to solve the problems faced by designers of leading
edge high-speed systems. He was the founding chairman of the IBIS Quality committee.
Barry received an MSEE degree from Carnegie Mellon and a BSEE degree from the
University of Florida.
Ishwar Hosagrahar is a Senior Staff Engineer at Inphi, with over 17yrs of industry
experience working with Networking & Communication Circuits and Systems. Prior to
Inphi, he worked at Texas Instruments on products ranging from 10/100Mbps Ethernet
PHYs to 15+Gbps SerDes transceivers. Before that, he worked at ArcusTech (later
acquired by Cypress Semiconductor) designing Ethernet and Telecom switch ICs. He
holds a Master's degree in VLSI/Circuit design from University of Texas. In his spare
time, he enjoys flying planes (and anything aviation-related), plays various musical
instruments (albeit somewhat poorly) and dabbles in hi-fi systems.
Chao Xu is a Sr. Director of Platform Engineering at Inphi. He is specialized in
integrated circuits design and architecture related to Computing Architecture, Memory
Architecture, High Speed Digital Communication and Digital Signal Processing in Server
and Data Communication areas. He has extensive experience in system signal integrity in
high speed channel analysis and implementation, high speed mixed signal integrated
circuit designs such as SERDES, PLLs. RF Transceivers, Optical transceivers etc. Chao
received his Ph.D. degree in Electrical Engineering from University of Pennsylvania. He
has more than 10 issued US patents.
Wiley Gillmor is a Principal Engineer engaged in software development at Signal
Integrity Software, Inc. His career in EDA has spanned over 35 years, focusing on tools
for physical design and engineering. Prior to that he had a brief academic career teaching
Mathematical Logic and Computer Science.
4
Introduction
With a unit interval of only 35.7 ps, the CEI VSR 28G/100 Gigabit Ethernet link presents
a challenging system level design. In this paper we describe the system design process
which resulted in the industry’s first 100 Gigabit Ethernet CMOS PHY with Inphi’s
GearBox and CDR chips. The jitter budget is tight and with all the variation possible in
the system there are more than 5 million system conditions to check and verify
performance at. Additionally, to ensure that the system is not over-designed or under-
performing, the manufacturing variation impact on performance and manufacturing yield
needs to be estimated. Achieving and optimizing all of these objectives simultaneously is
indeed a serious challenge.
The next generation core routers and data centers require more bandwidth, faster speeds
and lower power. This has forced the industry to implement line speeds as fast as 28
Gbps. An example topology of such interfaces is given below in Figure 1 and shows the
chip to module application, where a pluggable optical transceiver CFP2 module is
connected to a line card host IC. Don’t be fooled by the seemingly clean channel, there
are plenty of impairments here and with a with a bit error ratio (BER) allowance of only
one error in a quadrillion bits, this is a world class challenge.
Figure 1: 28G VSR host to module channel diagram.
The CMOS design allowed Inphi to reduce the power envelope to one third the power
and half the area of competing SiGe and FPGA solutions.
One of the many challenge in the design process was to ensure that the transmitter and
receiver were able to operate in a wide variety of channels, including some that exceeded
the 28G VSR 10 dB host to module channel. Here Design of Experiments (DOE) and
Response Surface Modeling (RSM) was used to determine the maximum allowable trace
lengths, the best layer for PCB routing, the optimum via anti-pad size and the
performance degradation due to the presence of manufacturing variance from IC process,
voltage and temperature (PVT); package impedance; printed circuit board (PCB)
impedance; and via stub length variations. Each of these design objectives will be
addressed and we will illustrate how the DOE/RSM approach provides an informed path
to answering them.
The trouble was, although each simulation of a system configuration took only 4.8
seconds to complete, with 5 million conditions to check it would require 278 days of
compute time to complete! A large compute farm would help but could be easily
overwhelmed with the computation and storage of the results if additional factors were
added to the exploration space or if the simulations required more time to complete. The
5
link simulation approached used was a fast analysis approach to estimate the 1e-12 BER
for the link configuration. If a bit by bit time domain analysis was used to estimate
performance then the simulation time could easily approach five minutes each. With this
computational load, it would only require 47.5 years of compute time for the 5 million
cases. Brute force analysis wasn’t going to be a very effective tool in this situation. We
need to find another way.
Ideally, it would be wonderful if there was a magic equation which, given the input
factors such as trace length, impedance and process corner, could reveal exactly what the
resulting system performance would be. This magic equation would enable a multitude of
analyses like optimization, virtual “what if” analysis and the ability to position the design
to minimize the impact of manufacturing variance.
Well of course that isn’t exactly possible, but by using DOE with RSM we can approach
this ideal. DOE is used to sample the factor space and RSM is utilized to create an
equation (or model) which best fits the data. After verifying the correctness of the model
we can utilize it to do all the things mentioned above, to optimize, perform “what if”
analysis and to minimize the impact of manufacturing variation. The DOE/RSM flow is
visualized in Figure 2 below which was made to emphasize the iterative nature of the
analysis. Additionally, each phase of the analysis has certain assumptions which will
need to be revisited and revised before a satisfactory result can be obtained.
Figure 2: Design of Experiments (DOE) and Response Surface Modeling (RSM) methodology flow.
Analysis Approach
The most important step of any type of analysis is to determine the objectives of the
work. Albert Einstein allegedly stated that “If I had only one hour to save the world, I
would spend fifty-five minutes thinking about the problem and only five minutes thinking
about solutions”. Similarly, it is absolutely essential that the questions to be addressed
and the level of accuracy of the answers be clearly defined before starting the actual work
utilizing the DOE/RSM methodology. Clear analysis goals will put you in the best
position to achieve your aim.
6
The DOE and RSM flow diagram emphasizes the iterative nature of the analysis as it
allows for the refining of the assumptions at each step of the process. It can also be seen
that without clear analysis goals and exit criteria, such an approach can result in loss of
weekends and never ending reviews.
A statistical mindset to signal integrity, one in which decisions are made in the presence
of uncertainty, is uncomfortable for some. Rather than relying on exhaustively certain
analysis, the DOE/RSM analysis will provide answers couched between confidence
intervals which indicate the accuracy of the results. The only difference between an
exhaustively certain and statistical mindset is that in the former, any uncertainty is
pretended away and in the latter, the uncertainty is quantified, scrutinized and
communicated. There will always be uncertainty, whether we wish it or not and the best
approach is to understand it, reduce it and embrace it. Get comfortable with uncertainty.
Please note that the JMP® statistical discovery software was utilized for the DOE creation
and model fit analysis. Many of the figures in this document are from or derived from
JMP® reports. The system link simulations were performed with SiSoft’s Quantum
Channel Designer ® (QCD).
Assumptions
In addition to a mindset change, the DOE/RSM techniques require an in-depth
knowledge of the related statistical concepts applied, which can be a challenge. The
approach we will use in this paper is to first discuss some of the key concepts and the
assumptions which go into them and then provide solid application examples of the
methodology to the VSR 28G interface design.
In a general sense, to model an object is to utilize a simplified description of some aspect
of interest which allows for an exploration of the object’s characteristics. A physical
example would be the use of a model airplane in a wind tunnel to study its aerodynamic
properties to aid in the design of a full size airplane. For us, the object to be modeled is a
bit more ambiguous but typically is a response of the system performance such as eye
height or width across the study factor ranges.
The modeling objective is to characterize the true response by taking samples of the
factor space (as provided by the DOE table) and then fit a polynomial equation to the
data. In this sense, the model is actually the equation which best fits the data. A visual
example of such a model is shown below in Figure 3 where the eye height of a link is
given versus trace length and impedance.
7
Figure 3: Example eye height response surface versus trace length and impedance.
Note that the model is merely an approximation of the actual response and only
represents reality as far as it is accurate. Also, as can be seen from the curvature of the
surface, this model is a two dimensional parabola which best fits the data. The model
form and the coefficient estimates obtained from an ordinary least squares fit is given
below in Equation (1):
𝒚 = 𝜷𝟎 + 𝜷𝟏𝒙𝟏 + 𝜷𝟐𝒙𝟐 + 𝜷𝟑𝒙𝟏𝒙𝟐 + 𝜷𝟒𝒙𝟏𝟐 + 𝛃𝟓𝒙𝟐
𝟐
𝜷𝟎 = −𝟎. 𝟐𝟒𝟕𝟖𝟔𝟐 𝜷𝟏 = 𝟎. 𝟐𝟒𝟑𝟔𝟕 𝜷𝟐 = 𝟎. 𝟎𝟐𝟏𝟎𝟒𝟗𝟐𝜷𝟑 = 𝟎. 𝟎𝟎𝟎𝟏𝟔𝟓𝟓𝟐 𝜷𝟒 = −𝟎. 𝟎𝟎𝟎𝟏𝟐𝟕𝟓 𝜷𝟓 = −𝟎. 𝟎𝟎𝟓𝟗𝟗𝟑𝟖
(1)
Although this surface represents the best fit of the model to the data, it does not address
the question: is this the best model to represent the ‘true response’? In this instance,
adding higher polynomial terms for the length factor would likely capture better the
resonant behavior of the trace. While finding the absolute best model is likely to be an
elusive goal we can certainly approach it with incremental improvements to the model
form. A ‘too simple’ model will result in the smoothing out of important response
characteristics and a ‘too complex’ model can result in over emphasizing certain response
features at the expense of other more important response characteristics.
One might think that to find the best model form one would simply apply a very large
range of polynomial terms to the model fit to see which terms are significant. The
difficulty with this approach is that the model form available for fitting is limited by the
sampling of the response. If there are only two data samples along a factor dimension
then at most a straight line can be fit to the data. If there are three data points then at
most a quadratic line can be fit and further if there are at most four data points then a 3rd
order polynomial can be fit. Thus at some point, given a fixed sampling set, there is a
limit of what model is available to apply to the fit. These limitations lead to the idea that
8
the sampling of the factor space must also take into account the model form so that the
model fit can accurately estimate the true response.
Conceptually, with continuous factors one could iterate until the ideal sampling and
model form is obtained to achieve a high quality model fit. In practice though, segments
of the interconnect are often represented by blocks which may only have a discrete
number of levels, such as would be the case from a family of connectors which are
characterized by S-Parameter models. In these situations, the resolutions of the sampling
are fixed and higher order system response characteristics are aliased with lower order
system response characteristics, thus limiting the analysis.
The final aspect of the model assumption is the idea that a polynomial can adequately
describe the response surface. There are other more mathematically sophisticated
models, such as Gaussian Process modeling which have some surprising characteristics
(such as zero residual and spatially cognizant interpolation) but will not be discussed
here. The authors have found that the large majority of the signal integrity applications
of the DOE/RSM methodology are adequately described by polynomial models.
The process of fitting the model to the data is typically performed by Ordinary Least
Squares (OLS) regression analysis. This estimation method assumes that the input
factors are uncorrelated, that the fit error variance does not vary across factors or factor
levels and that the residual error is normally distributed. These assumptions can be
relaxed if needed but require more sophisticated least square methods such as generalized
linear models (allows for other residual error distributions beyond the normal
distribution) and general least squares (allow for correlated factors and non-uniform error
variances). In the analysis here we will make an extensive investigation of the residual to
ensure that the model is adequate and that it meets the OLS assumptions.
Application of methodology to 28G VSR System
Analysis
The whole objective of the study is to answer design questions and quantify
manufacturing variance for the 28G VSR link design. The design questions are, which
layer of the PCB is preferable for the high speed routing, what is the best PCB via anti-
pad size and what is the max PCB trace length that can still pass with adequate system
performance? These questions should be answered in the presence of manufacturing
variation, which will be quantified as well.
The objectives will be obtained by carefully sampling the factor space with the Design of
Experiments approach, evaluating the system performance response at these DOE
conditions, fitting a response surface model to the data to produce a multi-dimensional
equation which can then be used to study the total factor space. Even though this
analysis description hints of a simple progression, in practice it is quite iterative as the
assumptions at each stage are refined to obtain a satisfactory model fit with sufficient
accuracy. It is desired that the uncertainty of any eye height and width predictions be less
than +/- 30 mV and +/- 0.5 ps. Achieving the desired level of accuracy on the first pass
rarely happens (it if does then it should raise your suspicions) but requires subtle
modifications to the model assumption, factor ranges, and DOE creation.
9
Table 1 below lists the nine factors which define the space to be explored. The factors
which directly influence design decisions are called design factors and the factors which
in production are not controllable are called manufacturing factors.
Parameter Name Factor Type Min Typ Max
Tx PVT Corner Manufacturing SS TT FF
Tx PKG Manufacturing 90 Ohm 100 Ohm 110 Ohm
Line Card PCB Via Anti-Pad Size
Design 32 mil 36 mil 40 mil
Line Card PCB Via Stub length
Manufacturing 2 mil 10 mil 18 mil
Line Card PCB Routing Layer
Design 3 -- 9
Line Card length Design 1 inch 3 inch 6 inch
Line Card TL Impedance
Manufacturing 90 Ohm 100 Ohm 110 Ohm
Rx PKG Manufacturing 90 Ohm 100 Ohm 110 Ohm
Rx PVT Corner Manufacturing SS TT FF
Table 1: 28G VSR interface factor space definition.
Design the experiment
Traditionally there have been several approaches to sampling a large multi-dimensional
factor space. Some of these approaches include an exhaustive sampling, where every
single condition is evaluated; random sampling, where a number of conditions are chosen
by chance; and one factor at a time sampling, where from a nominal condition, each
factor is swept in isolation. The objective of any type of sampling is to obtain a
representation of the total factor space such that by using statistics of the sample,
inferences about the total factor space can be made. The exhaustive sampling approach is
nice but more than likely an unrealistic approach. The random sampling approach (also
called Monte-Carlo sampling) will likely be unbiased but does not guarantee coverage of
the whole space and requires many sample points to ensure that all regions of the factor
space are considered. Lastly, one factor at a time (sometimes called OFAT) sampling
will miss out on many important factor interactions.
The design of experiments sampling approach attempts to sample the space to provide
good coverage of the whole factor space while minimizing the number of runs. This is
achieved by starting with a random sampling and then modifying each sample point until
the coverage of the DOE set is adequate. The coverage of the sampling is quantified by
the sample prediction variance which can be easily calculated with some straightforward
matrix manipulations. This quantification of the prediction variance allows the
uncertainty of the sampling to be quantified and thus optimized upon. This is why some
call the DOE approach optimal design.
10
For our VSR example, the nine factors are sampled with 256 runs and the prediction
variance of the sample is minimized with the D-optimal approach. D-optimal designs
sample the edges of the factor space more than the center and give more accurate model
parameter estimates than other optimality criteria. Since one of the objectives is to find
the worst case conditions of the factor space (these typically occur at an edge of the
space), accurate estimations of the edges and corners of the factor space is important.
The model assumed for the sampling was a 2nd order polynomial with 1st order
interactions between all factors.
Once the simulations are run and the model fit processed, the accuracy of the model fit
predictions is quantified by a confidence interval. A smaller confidence interval will give
more assurance than a larger one. The confidence interval size is dependent on three
things, the desired confidence level (95%, 99%, 99.9% etc.), the model fit error and the
coverage of the DOE sampling. At the DOE creation step of the analysis, knowledge of
the coverage of the DOE sampling allows insight into the relative confidence interval size
and is embodied by the prediction variance. While visualizing a 9 dimensional space is
quite a feat, we can summarize the prediction variance across the whole space with a
fraction of design space plot as shown in Figure 4 below.
Figure 4: Fraction of design space plot which shows the DOE prediction variance over the fraction of the space.
This plot shows that the relative prediction variance for 50% of the factor space is less
than 0.19. While the relative nature of the metric does not lend itself to absolute
guidelines, it allow for the comparison of competing designs. Thus it is recommended
that a few designs be generated and then compared to select the best one.
An ideal design, sometimes known as an orthogonal design, is one where the parameter
estimates are able to be calculated independently. This is only achievable for select
designs and in most situations (given the number of runs and the model form) an
orthogonal design is not possible. What is possible though, is for the design to approach
the orthogonal characteristics. When a design is not orthogonal then two or more
parameter effects are slightly correlated with each other and to that degree
undistinguishable. The degree of correlation can be quantified for all 54 terms of our
model in the color map on correlation plot in Figure 5 below. The names of each of the
11
model terms are given for each column of the plot and are the same for each
corresponding row. The color range in the plot goes from blue (un-correlation) to red
(totally correlated). The red diagonal of the plot shows that each model term is perfectly
correlated with itself as expected but most importantly there are no reddish off-diagonal
terms which would be an indication of a poor design.
Figure 5: Color map on correlation plot for all of the terms in the response surface design. The correlation between
two off-diagonal terms is ideally zero and is indicated by blue .
In order to put the upcoming model fit in the best possible position, the design of
experiments sampling has found a sample set which adequately covers the factor space
and allows for near independent estimation of the parameter effects.
Evaluate System Response
The DOE sampling conditions are brought into the EDA link simulation environment for
the 28G VSR topology, simulated and link performance metrics calculated. The metrics
which will be considered here are eye height and eye width at a BER of 1e-12. It is
essential to check the simulation result waveforms for consistency and accuracy. It is
recommended to check the outliers in the performance to ensure that they embody
realistic results. It will be assumed during the model fit that each condition represents the
12
actual response thus any discrepancies will propagate errors into the model fit and will
result in poor or even wrong analysis conclusions. Below in Figure 6 is an example of
how to visualize the results in SiSoft’s Quantum Channel Designer ® (QCD).
Figure 6: The simulation results must be carefully evaluated to ensure that all results are reasonable before
proceeding with the model fit.
Response Surface Model Fit
All of the precautions taken up to this point have been done to obtain a good model fit.
Once the fit is complete and its quality measured it will be determined whether those
precautions were sufficient or if more iterations and refinements are necessary. The
model form which we will utilize, called a Response Surface Model, is a multi-
dimensional polynomial with interaction terms as shown in Equation (2) below. Here 𝑦
is the measured response (such as eye height), 𝑥𝑖 is one of the n=9 factors and the 𝛽’s are
the unknown model coefficients which will be estimated by the least squares method.
𝒚 = 𝜷𝟎 + ∑ 𝜷𝒊𝒙𝒊
𝒏
𝒊=𝟏
+ ∑ 𝜷𝒊𝒊𝒙𝒊𝟐
𝒏
𝒊=𝟏
+ ∑ ∑ 𝜷𝒊𝒋𝒙𝒊𝒙𝒋
𝒏
𝒋=𝒊+𝟏
𝒏−𝟏
𝒊=𝟏
(2)
The difference between the simulated eye height and the eye height as predicted by the
response equation is called the error residual. By examining the residual we can obtain
several measures of model quality and validate assumptions.
Goodness of fit
The simplest fit metric is called the coefficient of multiple determination but everyone
just calls it “R-squared” for short and is written as R2. This metric ranges from 0 for a
poor fit to 1 for a good fit. Conceptually, 100*R2 can be thought of as the percentage of
13
the variation in the data that can be explained by the model. One interesting fact is that
the R2 metric will always improve if additional model terms are added whether or not
these new terms are actually significant. A modified R2 metric, called R2 adjusted, takes
into account the number of terms used in the model and penalizes for any extra
unnecessary terms. Thus a large difference between the R2 and R2 adjusted is an
indication that there are unnecessary terms in the model.
The error standard deviation of the fit can be estimated by taking the residual for each
point in the data set, squaring it, finding the mean and then taking the square root. This
RMSE metric can also be used as a quick estimate of the prediction confidence interval.
For a 95% confidence interval estimate, simply multiply RMSE by 2. If +/- this value is
larger than the needed accuracy then it will be necessary to go back and revisit earlier
assumptions such as the model form used in the DOE creation and the factor space
definition.
Figure 7: Goodness of fit summaries for eye height (left) and eye width (right).
Shown above in Figure 7, is the JMP statistical software fit summary for the eye height
and eye width fit. We see the R2 and R2adjusted are in the high 90’s and the RMSE is 22
mV for eye height and 0.45 ps for eye width.
Lastly the fit error residual needs to be examined itself to ensure that it is normally
distributed and that it does not contain any “structure”. Figure 8 below shows the
residual distribution of the eye height and width which can be seen as roughly normally
distributed.
Figure 8: Fit error residuals for eye height (left) and eye width (right). These show that the residuals are roughly
normally distributed as required.
Plots of the residual versus the response or other important factors are the best way to
search for “structure” in the residual. Structure, i.e. some systematic relation between the
residual and some explanatory variable, is evidence of a model bias and can provide clues
as to what new terms should be included in the model fit. Often such additional model
terms are not able to be immediately utilized because of insufficient sampling of the DOE
14
and require a reformulation of the DOE creation so that the additional model terms can be
added without aliasing other important effects. Below in Figure 9 is the plot of the
residual versus the response. The residual should be normally distributed no matter how
it is viewed but it can be seen from the figure that the lowest predicted eye height cases
have a positive residual as circled in the figure. Further investigation showed that an
additional model term of PCB_LEN*PCB_Z*RX_CORNER improved the fit. Although
there may be more such model terms which may improve the fit, the accuracy was
sufficient for the needs of the study and the model fit was deemed good enough.
Figure 9: Residual versus the predicted eye height (left) and predicted eye width (right). Systematic structure in the
residual is an indication of model bias. Note how the lowest eye height performance cases all have positive residuals,
this observation lead to the inclusion of an additional model term which improved the fit.
While these are very good model fit results, it is interesting to think about the source of
the remaining uncertainty. The residual error can only come from two sources, random
errors and lack of fit. For deterministic signal integrity simulations there is no random
noise so the residual is due solely to lack of fit. In practice, a perfect model fit is not
achieved because the underlying phenomenon is not a perfect polynomial and also
because the true underlying phenomenon factors are garbled and only imperfectly
represented by the study factors.
Explore and Optimize the 28G VSR System
Once everything has been done to ensure a proper fit, we can explore the factor space as
represented by the model with confidence that the uncertainty is roughly understood. In
this application, visualizing a 10 dimensional space is a daunting task, fortunately there
are some tools which facilitate this type of analysis. A plot, called the prediction profiler,
shows what the response would be across each factor if all of the other factors are held
constant. This type of plot is most useful when used interactively, but much can be
gleaned from the static views used below as well.
As an example, consider the prediction profiler plot for the two factor RSM fit as shown
in Figure 10.
15
Figure 10: Example prediction profiler plot of the predicted eye height response. It shows the response versus each of
the factors given all of the other factors are held constant.
This prediction profiler plot shows the eye height response across the explanatory factors,
trace length (PCB_LEN) and trace impedance (PCB_Z). Since the slope of the trace
length factor is greatest we can state that this is the most influential factor in this region
of the factor space. If there are any interactions between the factors then the effective
slope of the line could change in other areas of the factor space. Additionally, this plot
indicates that when PCB_LEN=3.5 and PCB_Z = 100 then the predicted eye height is
0.538 V with a 95% confidence interval of [0.514, 0.562] V, which is equivalent to
stating that the predicted eye height is 0.538 +/- 0.0234 V for a 95% confidence interval.
The confidence intervals are represented in the prediction profiler plot by the blue dashed
lines surrounding the solid black predicted response.
A confidence interval (sometimes abbreviated as CI) can be thought of conveying the
following information:
o a CI provides a range of plausible values for the true response with values outside
the range as relatively implausible, or
o a CI gives the precision of the estimation where the upper and lower bounds
provide the likely maximum error of estimation, although there is a possibility of
larger errors.
A confidence level of 95% roughly covers two standard deviations from the predicted
value and a confidence level of 99.5% roughly covers three standard deviations from the
predicted value.
The prediction profiler plot for the middle of the example factor space is given in Figure
11.
Figure 11: Eye height (top row) and eye width (bottom row) Prediction Profiler for the middle of the factor space.
Important factors can be identfied by the slope of the curves
16
As was noted before, the trace length is the most influential factor in this region of the
space. The two rows of plots are for the eye height and eye width, respectively. If it was
desired to understand the impact of a factor in this region of the space, this plot easily
provides this information.
Because the fitted response surface is represented by a well-defined function, the space
can be searched for the worst case conditions as shown in the prediction profiler plot in
Figure 12.
Figure 12: Prediction profile plot for eye height (top row) and eye width (bottom row) at the predicted worst case
condition.
It will be noted that while the trace length is still the most influential factor in this region
of the space, the trace impedance and receiver PVT corner ($RX_CORNER) have also
become somewhat influential as compared to their influence at the middle of the factor
space.
Determining Routing Layer for High Speed Signals
The design optimization strategy used here is to first identify the worst case
manufacturing condition and then put the design factors in their best case conditions to
minimize the impact of the worst case performance.
To quantify the impact of the PCB routing layer at the worst case corner, the predicted
response is calculated for PCB_LAYER=3 and PCB_LAYER=9 and compared. As
shown in Figure 13 below (note the vertical axis has been scaled to provide a better view)
the eye height and width difference between the two PCB layers is only 0.012 mV and
0.35 ps. Also note that the 95% confidence interval for eye height is +/-24 mV and for
eye width, +/- 0.5 ps. When the predicted difference is less than the confidence interval
then it can be stated that the model is unable to confidently identify the effect as
significant and provides no actionable information. This could lead to the following
conclusions:
o the PCB routing layer is a weak predictor of the system performance and layers 3
and 9 are equivalent, or
o PCB routing layer 9 has a slight but statistically insignificant advantage over
layer 3.
If either of these conclusions is insufficient then additional information on the impact of
the two layers will be required to make a definitive decision. A more focused DOE/RSM
around this factor could be defined where some of the other insignificant factors are
removed from the study to improve the accuracy of the analysis. Additionally, other
17
influences on the routing layer decision, such as cost in PCB space or cost in money,
should be considered when making any decisions.
For the VSR study, it will be concluded that PCB routing layer 9 has a slight advantage
(although statistically insignificant) over layer 3.
Figure 13: Example of how to evaluate "what if" scenarios across the PCB layer factor. The plot on the left shows the
predicted responses for when the PCB routing layer is on layer 3 and the plot on the right shows the predicted
responses for when the PCB routing layer is layer 9.
Determining the best Via Anti-pad size
A similar approach will be taken to understand the impact of the PCB via anti-pad size.
The link performance difference between an anti-pad diameter of 32 and 40 mil for eye
height is 45 mV and 0.47 ps where the eye height confidence interval is +/- 25 mV and
the eye width confidence interval is +/- 0.5 ps for a 95% confidence level. This can
visually be seen from the prediction profiler plots in Figure 14 as the dashed red
horizontal lines for the eye height plots are not contained within the dashed blue
confidence interval lines whereas the dashed red lines are contained by the dashed blue
confidence interval lines for eye width. Therefore we can conclude that an anti-pad size
of 32 mils is a better solution for eye height and is possible to be a better solution for eye
width performance but is statistically insignificant. Additional constraints, such as
increased manufacturing problems with fabricating a given anti-pad diameter should be
carefully weighed with the performance benefits before making a final decision.
For this study, an anti-pad size of 32 mils will be used in future analysis.
18
Figure 14: "What If" analysis across the via anti-pad factor showing the performance difference between different via
anti-pad sizes.
One important point to clarify is that while the DOE/RSM methodology can provide a
quantification of the impact of a given factor, it cannot give any indication as to why.
The reasons why a factor is impactful in a given situation must come from subject matter
expertise and engineering judgment. If no satisfactory physical explanation is
forthcoming, then a statistical indication of importance may be the impetus for further
analysis and it can be useful to make hypotheses about competing physical explanations
of the data.
Manufacturing Variation
One approach to quantify the manufacturing variation is to assign probability distribution
functions (PDFs) to each of the factors and then randomly generate millions of cases.
Utilizing the response surface equation, the system performance can be quantified for
each of the random cases and used to give an indication of the probability of yielding a
certain system performance level. Much care must be taken to obtain accurate PDFs as
the weights of the tails can make a large difference in such analysis. Figure 15 shows the
prediction profiler plot and the PDFs assigned to each of the manufacturing factors.
Figure 15: Prediction profiler plot with distributions assigned to each of the manufacturing factors. Randomly
sampling the factor space according to these distributions will give manufacturing Yield information.
19
It was desired to understand the manufacturing variation as a function of the line card
trace length. Therefore, for each of the line card lengths of 1, 2, 3, 4, 5 and 6 inches, one
million cases were randomly generated and eye height and width performance calculated
using the response surface equation. The average and standard deviation of the one
million cases for each of the lengths are given below in Figure 16. This analysis clearly
shows that for eye height the impact of manufacturing variation increases with higher line
card trace lengths. A possible explanation for the increase isn’t that more manufacturing
variation happens when the line card trace length equals 6 inches but that the system is
much more susceptible to said variations. It should also be noted that the eye width
variation changes very little with increased trace length.
Figure 16: Manufacturing Yield predictions versus maximum allowed trace length. The vertical bars indicate the
standard deviation of each yield analysis and shows increased eye height variability with increased trace length.
Defects per Million Analysis
The final piece of analysis is an estimation of the defects per million. Given that the
factor PDFs and response surface equations are accurate, the manufacturing yield of the
system can be calculated. The pass/fail criteria for this analysis depends on which link
uncertainties are included in the simulation, and which are budgeted. For this example,
the spec limits for a passing system are an eye height of 200 mV and eye width of 19 ps.
For a system with line card lengths of 6 inches, the yield distribution plots for eye height
and width are given in Figure 17 below.
Figure 17: Defects Per Million (DPM) analysis showing a predicted 225 and 170 DPM according to the eye height and
eye width requirement, respectively.
20
Here the lower spec limit (LSL) is plotted on the distributions and the percentage of the
distribution below this limit and the corresponding parts per million (PPM) values of 224
and 170 defects per million are shown.
Conclusion
We have shown the methodology which assisted in the first CMOS 28G VSR / 100G
Ethernet PHY design. The interface design questions: what layer of the PCB to route on,
what size of via anti-pad to use and what maximum trace length to allow were addressed.
The worst case conditions were identified and impact of manufacturing variations on
performance was quantified. All of which was accomplished with the DOE/RSM
methodology. Instead of simulating millions of conditions requiring months of compute
time, the DOE intelligently sampled the factor space with only 256 runs. A least squares
model fit found a response surface which best fit the data and after the model was
validated it was used to predict system performance throughout the factor space. It must
be emphasized that several iterations of the methodology were required as the model
assumption changed. Overall, the DOE/RSM methodology has been shown to be a
powerful approach to comprehend and optimize a dizzyingly large factor space and
contributed to Inphi’s success with the world’s first production ready 100G CMOS
PHY/SerDes Gearbox.
References
Hall, S. & Heck H. (2009). Advanced Signal Integrity for High-Speed Digital Designs.
Montgomery, D. (2009). Design and Analysis of Experiments, 7th Edition.
Goos, P. & Jones, B. (2011). Optimal Design of Experiments: A Case Study Approach.
www.JMP.com