+ All Categories
Home > Documents > Convolutional Neural Networks for Multi-Stage ...

Convolutional Neural Networks for Multi-Stage ...

Date post: 09-Feb-2022
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
7
449 Copyright © 2021 The Society of Chemical Engineers, Japan Journal of Chemical Engineering of Japan, Vol. 54, No. 8, pp. 449–455, 2021 Convolutional Neural Networks for Multi-Stage Semiconductor Processes Xiaofei Wu 1 , Junghui Chen 2 , Lei Xie 1 , Yishan Lee 2 and Chun-I Chen 3 1 State Laboratory of Industrial Control Technology, Institute of Cyber-Systems and Control, Zhejiang University, Yuquan Campus, Hangzhou, 310027, P. R. China 2 Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taoyuan 32023, Taiwan, R.O.C 3 Magnetic Head Wafer Manufacturing Fab, Western Digital Corporation, Fremont, California, USA Keywords: Convolutional Neural Network, Features Extraction, Multi-Stage Process, Virtual Metrology In semiconductor manufacturing processes, there are certain quality measurements cannot be easily obtained at a low cost. In such cases, virtual metrology (VM) is typically used to predict the relevant quality variables without increasing the number of physical measurements. Faced with large volumes of raw data, the traditional data-driven VM methods adopt data pre-processing for feature extraction before modeling with a predefined model. However, if the constructed model and the extracted features are not suitable, the identified VM model is generally not reliable. Moreover, almost no VM model has been proposed for multi-stage raw data. To improve the prediction performance of VM models, it is imperative that only suitable features are chosen and used in the modeling, especially for multi-stage raw process data. In this paper, we developed a convolutional neural network (CNN) based on the VM model for multi-stage raw semicon- ductor data. Owing to the intrinsic nature of CNN, the cascade-connected convolving filters and the regression part are trained together to provide appropriate features for the final prediction. The construction of CNN makes it possible to reasonably extract information at each stage separately when processing multi-stage data. The proposed method is validated using real semiconductor process data and found to be superior to conventional methods with significantly improved accuracy. Introduction Semiconductor manufacturing comprises a large number of process steps. In the production of electronic chips, for example, wire saws first slice the silicon ingots into sec- tions; this process is followed by several flattening stages, including cleaning, polishing, and lapping. e wafers are then transferred to the front-end and back-end processes where the final chips are manufactured (O’Mara et al., 1990). However, owing to the high throughput involved in semiconductor manufacturing, it is difficult to measure the quality variables of all production wafers at each stage. To address this issue, wafer-to-wafer modeling to adjust tools and equipment in time has been increasingly used to predict the final product quality and to reduce process excursions. However, owing to the complexity of the physical phenom- ena involved, it is oſten difficult to interpret the nanoscale deviations of the dimensions involved in semiconductor manufacturing based on basic mathematical models (Yang et al., 2008; Ringwood et al., 2010). As terabytes worth of manufacturing data with inherent valuable information are obtained from semiconductor devices, electronic boards, and systems, the data-driven virtual metrology (VM) has been widely applied as a model construction method in the semiconductor industry. Based on only measurements on sampled wafers, VM allows the values of the relevant quality variables to be predicted during the processing of each wafer without increasing the number of physical measurements. Relevant studies on VM in the field of semiconductor manufacturing have been conducted in the past. Hirai and Kano (2015) utilized locally weighted partial least squares (LW-PLS), which is a type of just-in-time modeling tech- nique. Using the pre-processed metrology data, they derived a linear regression model. Furthermore, extensive studies on VM models using support vector machines have been carried out. Among them, Lenz and Barak (2013) predicted the layer thickness in high-density plasma chemical vapor deposition using support vector regression. In addition, sev- eral probabilistic methods, such as k-NN regression, were proposed to improve accuracy and determine uncertainty information to support appropriate decision-making (Lee et al., 2014). Despite the burgeoning popularity of data-driven ap- proaches, some potential limitations still exist. Several exist- ing machine learning methods, such as support vector re- gression and nonparametric density estimation techniques, can provide good predictions, but they are valid only for properly pre-processed data. Typically, the parameters of the process tool, such as temperature, pressure, voltage cur- rent, are specified according to the product requirements. For wafer fabrication, several process tools corresponding Received on July 30, 2020; accepted on February 9, 2021 DOI: 10.1252/jcej.20we139 Correspondence concerning this article should be addressed to J. Chen (E-mail address: [email protected]). Research Paper
Transcript
Page 1: Convolutional Neural Networks for Multi-Stage ...

Vol. 54 No. 8 2021 449Copyright © 2021 The Society of Chemical Engineers, Japan

Journal of Chemical Engineering of Japan, Vol. 54, No. 8, pp. 449–455, 2021

Convolutional Neural Networks for Multi-Stage Semiconductor Processes

Xiaofei Wu1, Junghui Chen2, Lei Xie1, Yishan Lee2 and Chun-I Chen3

1 State Laboratory of Industrial Control Technology, Institute of Cyber-Systems and Control, Zhejiang University, Yuquan Campus, Hangzhou, 310027, P. R. China

2 Department of Chemical Engineering, Chung-Yuan Christian University, Chung-Li, Taoyuan 32023, Taiwan, R.O.C

3 Magnetic Head Wafer Manufacturing Fab, Western Digital Corporation, Fremont, California, USA

Keywords: Convolutional Neural Network, Features Extraction, Multi-Stage Process, Virtual Metrology

In semiconductor manufacturing processes, there are certain quality measurements cannot be easily obtained at a low cost. In such cases, virtual metrology (VM) is typically used to predict the relevant quality variables without increasing the number of physical measurements. Faced with large volumes of raw data, the traditional data-driven VM methods adopt data pre-processing for feature extraction before modeling with a prede�ned model. However, if the constructed model and the extracted features are not suitable, the identi�ed VM model is generally not reliable. Moreover, almost no VM model has been proposed for multi-stage raw data. To improve the prediction performance of VM models, it is imperative that only suitable features are chosen and used in the modeling, especially for multi-stage raw process data. In this paper, we developed a convolutional neural network (CNN) based on the VM model for multi-stage raw semicon-ductor data. Owing to the intrinsic nature of CNN, the cascade-connected convolving �lters and the regression part are trained together to provide appropriate features for the �nal prediction. The construction of CNN makes it possible to reasonably extract information at each stage separately when processing multi-stage data. The proposed method is validated using real semiconductor process data and found to be superior to conventional methods with signi�cantly improved accuracy.

Introduction

Semiconductor manufacturing comprises a large number of process steps. In the production of electronic chips, for example, wire saws �rst slice the silicon ingots into sec-tions; this process is followed by several �attening stages, including cleaning, polishing, and lapping. �e wafers are then transferred to the front-end and back-end processes where the �nal chips are manufactured (O’Mara et al., 1990). However, owing to the high throughput involved in semiconductor manufacturing, it is di�cult to measure the quality variables of all production wafers at each stage. To address this issue, wafer-to-wafer modeling to adjust tools and equipment in time has been increasingly used to predict the �nal product quality and to reduce process excursions. However, owing to the complexity of the physical phenom-ena involved, it is o�en di�cult to interpret the nanoscale deviations of the dimensions involved in semiconductor manufacturing based on basic mathematical models (Yang et al., 2008; Ringwood et al., 2010). As terabytes worth of manufacturing data with inherent valuable information are obtained from semiconductor devices, electronic boards, and systems, the data-driven virtual metrology (VM) has

been widely applied as a model construction method in the semiconductor industry. Based on only measurements on sampled wafers, VM allows the values of the relevant quality variables to be predicted during the processing of each wafer without increasing the number of physical measurements.

Relevant studies on VM in the �eld of semiconductor manufacturing have been conducted in the past. Hirai and Kano (2015) utilized locally weighted partial least squares (LW-PLS), which is a type of just-in-time modeling tech-nique. Using the pre-processed metrology data, they derived a linear regression model. Furthermore, extensive studies on VM models using support vector machines have been carried out. Among them, Lenz and Barak (2013) predicted the layer thickness in high-density plasma chemical vapor deposition using support vector regression. In addition, sev-eral probabilistic methods, such as k-NN regression, were proposed to improve accuracy and determine uncertainty information to support appropriate decision-making (Lee et al., 2014).

Despite the burgeoning popularity of data-driven ap-proaches, some potential limitations still exist. Several exist-ing machine learning methods, such as support vector re-gression and nonparametric density estimation techniques, can provide good predictions, but they are valid only for properly pre-processed data. Typically, the parameters of the process tool, such as temperature, pressure, voltage cur-rent, are speci�ed according to the product requirements. For wafer fabrication, several process tools corresponding

Received on July 30, 2020; accepted on February 9, 2021DOI: 10.1252/jcej.20we139Correspondence concerning this article should be addressed to J. Chen (E-mail address: [email protected]).

Research Paper

Page 2: Convolutional Neural Networks for Multi-Stage ...

450 Journal of Chemical Engineering of Japan

to the di�erent stages of production exist, resulting in large-scale datasets for analysis. Conventional machine learning methods cannot be easily applied when one has even a mod-erately large number of inputs (Hirai and Kano, 2015). �us, feature extraction is o�en used to reduce the dimensions of measurements in semiconductor manufacturing in a way that still retains most of the information necessary to dis-criminate among di�erent observation conditions. �is pro-cedure is especially important when the machine learning model is unable to handle huge amounts of input variables. In the past, before the introduction of data-driven models, a typical approach that reduced the trace data to four sum-mary statistics (including minimum, maximum, mean, and standard deviation) was widely applied in semiconductor processes. However, dimension reduction of data has its disadvantages. �e model prediction capability could su�er due to the loss of information. It is well known that the se-lected features can be used to easily organize and categorize data so that the target process can be modeled to produce similar outputs. However, the prediction results may be biased as the prediction model cannot directly select the required features from the data. Although raw data contain large volumes of information, great e�ort is still required to directly process raw data. �us, trial procedures are per-formed repeatedly until good features for the prediction model are obtained. As advanced machine learning tech-niques have been successfully applied to natural language processing, speech recognition, and many other technical �elds with high data volumes in recent years (Zheng et al., 2014; Lee et al., 2016; Tan et al., 2016), deep architectures have proved to be capable of transforming the original correlated but subtly di�erent information data into their corresponding features. More importantly, they are tuned to extract appropriate features for making suitable regres-sion models. However, the applications of deep structures in VM (Terzi et al., 2017; Lee and Kim, 2018; Maggipinto et al., 2018a, 2018b) focus on single-stage data. Indeed, the integration of deep structures with VM applications for raw multi-stage process data is worthy of investigating. In this paper, a VM modeling scheme for multi-stage process data, which is based on the convolutional neural network (CNN), is developed.

In general, CNN only looks at a small patch of the whole data; it uses a series of convolving �lters to extract the local features (Lecun et al., 1998; Ciresan et al., 2011). �e convolving part in CNN can capture the useful informa-tion hidden deep in the actual observations. Moreover, it is advantageous that it takes lesser time and a smaller number of parameters to learn CNN. Inspired by this idea, in this paper, multi-layer convolving �lters are applied to one-di-mensional process data to obtain distinct features from pro-cess data (particularly in semiconductor plant data) instead of computer-vision-related data. Feature extraction from CNN is now a practical approach when large-scale inputs become too complicated to handle. Moreover, the number of features to be adjusted on CNN depends on the required performance. �is can avoid the phenomenon called the

curse of dimensionality (Scott, 2008) when the number of features is high.

To overcome the aforementioned problems, the most im-portant features that best represent the multi-stage process should be chosen for constructing the regression model. In this study, a novel data arrangement is proposed that can extract the characteristics of di�erent stage variables for each wafer, so that the new CNN can directly use the high-dimensional raw input data to construct the nonlinear behavior of the multi-stage manufacturing process instead of selecting variables such as least absolute shrinkage and selection operator (LASSO). Without the pre-selection of features from data, the CNN model that represents the pro-cess can be identi�ed. �e CNN model can extract the pro-cess features simultaneously through the collected sample data. �e remainder of the paper is organized as follows. In Section 1, the CNN is brie�y reviewed. In Section 2, the framework of the novel CNN-based network proposed in this study and its working are described. In Section 3, the features of the proposed method and its application to an actual semiconductor process are presented to show the ef-fectiveness of the proposed algorithm. We then compare the proposed model with various existing models. Finally, we present our conclusions.

1. Preliminaries: CNN

�ere are many variants of the CNN architecture (Terzi et al., 2017; Lee and Kim, 2018; Maggipinto et al., 2018a; Tsutsui and Matsuzawa, 2019) for chemical processes in the literature, but the general structure of a CNN mainly comprises two parts. �e �rst part is used for feature extrac-tion and is made up of convolution and sub-sampling layers arranged alternatively, which are then followed by an activa-tion function and a batch normalization layer, altogether forming multi-layer convolving �lters (Figure 1(a)). �e other part handles regression and is made up of fully con-nected layers and an output layer.

Figure 1(b) shows the same structure in more detail. �e input layer �rst accepts data to be predicted. As the name implies, the convolutional layer is responsible for learning the feature representation of the inputs. �e abil-ity of a CNN to accurately match diverse patterns can be attributed to it using convolution operations for computing di�erent feature maps. To obtain the local feature map, the input and the learned �lter are convolved. �is is followed by an element-wise nonlinear activation function on the results from the convolution for handling the nonlinear situation. �e sub-sampling layer, which is always placed a�er the hierarchical convolution layer, reduces the feature map and is responsible for achieving shi�-invariance (Lecun et al., 1998). In traditional deep networks, the distribution of each layer’s inputs would be changed during training as the parameters of the previous layers change. �is requires a lower learning rate and careful parameter initialization to slow down the adjustment; however, it is still hard to train the model to avoid saturating nonlinearity. �is issue can

Page 3: Convolutional Neural Networks for Multi-Stage ...

Vol. 54 No. 8 2021 451

be overcome through batch normalization. By normalizing the layer inputs before the data propagates through the deep network, one can use higher learning rates without a careful selection of initialization (Io�e and Szegedy, 2015).

�e aforementioned procedures consider a single layer of convolution only; learning complex patterns with a CNN may involve more than a single convolution layer. �e layers shown in Figure 1(a) are interpreted from le� to right where related layers are grouped. As the input progresses further into the network, its data dimensions are reduced while its depth is increased. Finally, the extracted features are sent to the fully connected layer of the network where every input data is associated with every corresponding correct output. �e backpropagation learning algorithm is then used to adjust the parameters in the convolution layers and the fully connected layer to match the multilayer CNN network out-puts to the desired targets.

2. Proposed Method

In a traditional CNN, information of the condensed fea-ture maps from raw data is transferred to the last layer for classi�cation or regression tasks. Most of the raw data come from a single stage. Owing to the proven capability of deep learning models in image classi�cation, an e�ective CNN model for multi-stage data is proposed in this paper. Each channel, also known as every sensor variable in the input layer, of the model takes a single dimension of a multivariate time series as an input; then CNN is used to extract features from sensor variables. �e time spent at each stage in a typi-cal multi-stage wafer manufacturing process is as short as a few seconds. �e measurements of the wafer at each stage are completely obtained almost instantaneously although the data in each stage are time-series data. Considering the internal dynamics of each stage is impractical in this work. In the proposed data structure, each channel represents the whole stage data for a single variable. Hence, CNNs are ap-propriate in this case to extract features from the measure-

ments of sensor variables instead of the recurrent neural network (Mikolov et al., 2010). �e model �nally combines the learned features of each stage and feeds them into the fully connected layer to perform regression. �e input size is reduced by the convolution layer, resulting in a gradual de-crease in the dimension as the stack of layers of the feature extraction becomes deeper.

2.1 Novel data arrangement�e typical sequential operation with S stages in the in-

dustrial process is considered. �e data are �rst separated based on di�erent stages to preserve stage information of the data. �e collected data, which are the time series data in each stage, are arranged into a single channel for each vari-able. Conventional image data contain spatial information; by contrast, multi-stage data do not contain any spatial in-formation. �us, it would not be appropriate to arrange data in a two-dimensional time axis against the sensor variable value axis and to conduct the traditional two-dimensional convolution as the order of variables that is di�erently ar-ranged would get outputs with di�erent features. As a result, each variable is represented by a single channel. Such an arrangement allows for extracting each variable’s local time-series information without being in�uenced by the order of the variables. As Figure 2 shows, the measurements during Ts time intervals in the s-th stage are fed into the network with each sensor variable separated.

At each stage (s), the input (x0s,vs

0∈RTs, s=1, …, S, vs0=1,

…, Vs0) at each produced wafer consists of Vs

0 variables at a stage s, also called channels, and all the measurements of each corresponding variable have Ts data points in time series. Unlike conventional CNN, which extracts features from all the sensor variables in the square receptive �eld, the novel CNN (Figure 2) is suitable for handling multivari-ate time-series data as it allows the receptive �eld to process along the time axis for each respective variable. �us, in-stead of the features of the conventional CNN in the two-dimensional spatial directions, only the local features among inter-variable correlations along the one-dimensional pro-cessing time direction are extracted, In the novel CNN, the receptive �eld still attempts to �nd features of similar characteristics across the entire input area. Here, the recep-tive �elds for one feature map in di�erent variables have

Fig. 1 (a) Block diagram of CNN regression; (b) Structure of CNN

Fig. 2 Model architecture of CNN for a process with S stages

Page 4: Convolutional Neural Networks for Multi-Stage ...

452 Journal of Chemical Engineering of Japan

di�erent kernels. As there are Vsl variables at each stage, one-

dimensional Vsl kernels are needed to extract feature out-

puts.

2.2 Convolution layer and activation layer�e con�guration of the layer l−1 to layer l is illustrated

in Figure 3. �e inputs of each convolutional layer (l) are several univariate time series denoted as xl−1

s,vsl−1, s=1, …, S,

vsl−1=1, …, Vs

l−1, where Vsl−1 represents a selection of input

vectors (channels) in the stage s of the layer l−1. For the stage s of the �rst layer, the number of channels is Vs

0. �e feature value of the vs

l-th feature map in the l-th layer, qls,vs

l, is calculated by

*1

1 1

1

1, , , , ,

1

ls

l l l l ls s s s s

ls

Vl l l ls v s v s v v s v

v

− −

−q x w b=

= + (1)

where “*” is the convolution operator, wls,vs

l−1,vsl is the ker-

nel vector with a designed dimension, and bls,vs

l is the bias term. To capture local temporal information, each trainable wl

s,vsl−1,vs

l of small size should be restricted.�e activation function introduces nonlinearities to

CNN. It is desirable to detect nonlinear features of multi-layer networks. �e most widely used activation functions are sigmoid, tanh, and recti�ed linear unit (ReLU) func-tions. In the proposed work, sigmoid is used as the activa-tion function and is expressed as

1( )1 qf q

e−=+

(2)

where q denotes every element in qls,vs

l. A�er activation, qls,vs

l is transferred to xl

s,vsl with the same size. Although the sub-

sampling can reduce the size of the input thus reducing the computational load and normalization can help to identify high-frequency features, the subsampling is not used in this work to keep all data information for the next layer.

A�er the convolution layer, the feature outputs pass through the activation function to better handle model nonlinearity. �us, each convolving �lter contains the con-volution layer and the activation function. To extract more distinct features, the convolving �lters will be repetitively applied to new input channels generated from the previous convolving �lter to produce new feature outputs layer by layer separately.

2.3 Regression layerFinally, all the individual feature outputs of all the stages

of the last layer of ‘feature extraction,’ in Figure 2, xl−1s,vs

l−1, s=1, …, S, vs

l−1=1, …, Vsl−1 are collected, and the fully con-

nected layer eventually establishes the regression relation-ship and obtains the value of geometrical quality. Although the whole structure of the proposed novel CNN is similar to that of most of deep neural networks, the input data ar-rangement is speci�cally for raw multi-stage process data. �is means that the features hidden in the raw data are directly extracted by the convolving �lter layer, and these features correspond to the �nal regression model.

3. Case Study

In this section, the proposed CNN is compared to �ve baseline approaches to illustrate the performance of the proposed method in terms of both e�ciency and accuracy. A computational experiment is conducted using data from a real industrial process. Five approaches, namely, partial least squares (PLS), Gaussian process regression (GPR), principal component analysis and partial least squares (PCA-PLS), principal component analysis and Gaussian process regres-sion (PCA-GPR), and standard CNN (which is an abbrevia-tion denoting a CNN with all data concatenation in a single-stage) are considered as baseline methods for the evaluation purpose. Here, PLS, GPR, and standard CNN are considered for comparison with the novel CNN, which can show im-provement for the new data arrangement mentioned in Sec-tion 2 (proposed work). Moreover, PCA-PLS and PCA-GPR are the methods that combine the conventional feature ex-traction part, PCA, with the regression part, PLS, and GPR.

3.1 Data description�e process considered for a case study is chemical vapor

deposition (CVD); the details of the process are not dis-closed because of a con�dentiality agreement. However, the process is similar to the process in the semiconductor industry that is responsible for applying solid thin-�lm coat-ings to surfaces. �e CVD process is complex as it involves various chemical reactions and multiple reactor systems. Figure 4 is a simpli�ed representation of the actual unit. �e reactors are independently controlled to allow the �lm to be deposited in the process chambers under various conditions. CVD equipment is equipped with a considerable number of sensors. �rough VM development, the quality of a wafer can be predicted from historical process and production equipment data, without costly quality measurements.

Fig. 3 Con�guration of layer l−1 to layer l (l≠L) in the CNN model

Page 5: Convolutional Neural Networks for Multi-Stage ...

Vol. 54 No. 8 2021 453

�e data were provided by Western Digital from a four-step deposition process. Raw trace data were collected from 27, 27, 27, 20 tool sensors from the four stages, respectively. At each stage of the process, data for 100-time points were observed. It took about two months to obtain the data for all the 170 wafers; 90 of them are used for training and the rest for testing. �e raw data for the duration of each step are used instead of the common practice of using descrip-tive statistics including the mean, variance, minimum, and maximum. �e output is the mean of a certain electric prop-erty of each wafer. As indicated earlier, the input and output variable names are not disclosed because of the con�dential-ity agreement.

3.2 Results of CNN and model variationsIn all the methods, each univariate of the raw dataset is

normalized. In this work, the structures of CNNs (the stan-dard CNN and the proposed CNN) are selected based on the number of input variables and the performance of the predicted output. Starting with a larger size of kernel in CNNs and then reducing the size, the parameters can be tuned automatically using the gradient descent method. Conventionally, an early stopping scheme (Prechelt, 1998) with a cross-validation method (Kohavi, 1995) can be used to achieve a better generalization performance. �is com-binational method is adopted in the current work to obtain the architectures. �e proposed CNN is a three-layer struc-ture, not counting the input, all of which contain trainable parameters, with a fully connected layer as the last layer. �e size of the kernel in the convolution layer is set as 90 and 11, whereas the number of kernels is 6 and 10.

All the prediction values of GPR are zeros for all the wa-ferss, which is not shown. It is found that for PLS (Figure 5(a)), GPR, and standard CNN (Figure 5(b)), the dimen-sion of input data is too large to e�ectively tune parameters, leading to poor performance. In addition to these three methods, PCA-PLS (Figure 5(c)) and PCA-GPR (Figure 5(d)), which are used for feature extraction comparison pur-poses, are also adopted. PCA-PLS and PCA-GPR perform

the data features learning by unfolding all the stage data. In PCA, four components are obtained based on the 99% contribution rate. PLS and GPR are then applied to the fea-tures obtained by PCA. �e number of components of PLS (both in original PLS and PCA-PLS) is then set as four, the same as the number of principal components of PCA. Fur-thermore, the hyper-parameters in GPR are tuned using the max-likelihood gradient-based method. �e predicted result of the novel CNN is illustrated in Figure 5(e), which per-forms the best where the predictive value matches real data. Clearly, the other �ve methods do not �t the dataset well.

Table 1 provides a comparison of the performance be-tween CNN and other models for both training and testing data. �is is expressed quantitatively using some common index, including the root-mean-square-error (RMSE), mean absolute percentage error (MAPE), as well as the training time. �e expression of RMSE and MAPE are as follows:

test

test

2

test 1

test 1

1 ˆ( )

1 ˆ /

N

i ii

N

i i ii

RMSE y yN

MAPE y y yN

=

=

=

=

(3)

where y and y are, respectively, the actual metrology values and the prediction values. Ntest is the total number of test-ing wafers. Considering the prediction performance, the traditional modeling method without a speci�c feature ex-traction part or data arrangement (PLS, standard CNN) achieves higher training speed than PCA-PLS, PCA-GPR, and novel CNN owing to relatively simple matrix calcula-tions; however the novel CNN outperforms other methods in terms of much smaller RMSE and MAPE. Although PCA-GPR has a relatively small prediction error as shown in Table 1, it can be seen from Figure 5(d) that the output does not change wafer by wafer, which means that PCA-GPR does not obtain the characteristics inside the wafer. �e novel CNN separated all variables and stages and also used weight sharing with several convolution operations for bet-ter prediction but required longer training time.

Conclusions

Deep learning has been implemented for feature extrac-tion in multi-stage semiconductor applications. In real man-ufacturing, there are o�en several di�erent process tools corresponding to di�erent stages of production, resulting in large-scale datasets to be analyzed. For such a moder-ately large number of inputs, the traditional machine learn-ing methods cannot easily extract valid information. To overcome this limitation of traditional machine learning algorithms, a novel stage-wise and variable-wise stacked convolving model is proposed for output-related feature extraction as applied to a multi-stage input semiconductor process. In contrast to traditional machine learning models, which usually use the pre-processed part in an unsupervised way or set manifold regularizations, the novel CNN pro-

Fig. 4 Multiple reactor system in cluster tool of CVD-like process

Page 6: Convolutional Neural Networks for Multi-Stage ...

454 Journal of Chemical Engineering of Japan

posed in this work can extract features automatically and handle multi-stage input.

�e proposed method is demonstrated to be e�ective through a case study of a real deposition process. For fu-ture work, the multi-stage features obtained by leveraging deep learning can also be extended to other issues, such as multi-objective optimization control and energy-saving eco-nomic control. Moreover, multiple process data forms such as graphs and monitoring videos can be used, and more in-

Fig. 5 Regression results of testing data with (a) PLS, (b) standard CNN, (c) PCA-PLS, (d) PCA-GPR, and (e) Novel CNN

Table 1 Comparison of related models

PLS Standard CNN PCA-PLS PCA-GPR Novel

CNN

RMSE_train 0.1397 2.7001 0.1398 0.0273 0.0165RMSE_test 0.1397 2.6997 0.1397 0.0251 0.0198MAPE_train 0.0323 0.6335 0.0322 0.0047 0.0032MAPE_test 0.0322 0.6335 0.0323 0.0042 0.0037Training time 1.076 s 40.300 s 187.5 s 217.4 s 339.9 s

Page 7: Convolutional Neural Networks for Multi-Stage ...

Vol. 54 No. 8 2021 455

trinsic characteristics can be learned through deep learning methods.

Acknowledgement

�is research work was partially supported by Western Digital (WD) through the External University Collaboration Project. We would like to express our gratitude for having this opportunity to work with the in-dustry-leading storage company. A special thanks goes to the Advanced Process Control group at WD Wafer Head Manufacturing Fab for providing the data studied in this material and for the idea exchanges that allowed us to gain insights on the real-world processes. �e authors wish to acknowledge the �nancial support from National Key R&D Program of China (No. 2018YFB1701102), and Ministry of Science and Technology, Taiwan, R.O.C. (MOST 109-2221-E-033-013-MY3).

Literature Cited

Ciresan, D. C., U. Meier, J. Masci, L. Maria Gambardella and J. Schmid-huber; “Flexible, High Performance Convolutional Neural Net-works for Image Classi�cation,” �e 22nd International Joint Con-ference on Arti�cial Intelligence, p. 1237, Barcelona, Spain (2011)

Hirai, T. and M. Kano; “Adaptive Virtual Metrology Design for Semi-conductor Dry Etching Process through Locally Weighted Partial Least Squares,” IEEE Trans. Semicond. Manuf., 28, 137–144 (2015)

Io�e, S. and C. Szegedy; “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shi�,” Interna-tional Conference on Machine Learning, pp. 448–456, Lille, France (2015)

Kohavi, R.; “A Study of Coss-Validation and Bootstrap for Accuracy Estimation and Model Selection,” �e 14th International Joint Conference on Arti�cial Intelligence, pp. 1137–1145, Montreal, Canada (1995)

Lecun, Y., L. Bottou, Y. Bengio and P. Ha�ner; “Gradient-Based Learn-ing Applied to Document Recognition,” Proc. IEEE, 86, 2278–2324 (1998)

Lee, D., V. Siu, R. Cruz and C. Yetman; “Convolutional Neural Net and Bearing Fault Analysis,” Proceedings of the International Confer-ence on Data Mining Series (ICDM), pp. 194–200, Barcelona, Spain (2016)

Lee, K. B. and C. O. Kim; “Recurrent Feature-Incorporated Convolu-tional Neural Network for Virtual Metrology of the Chemical Me-chanical Planarization Process,” J. Intell. Manuf., 31, 1–14 (2018)

Lee, S., P. Kang and S. Cho; “Probabilistic Local Reconstruction for k-NN Regression and Its Application to Virtual Metrology in Semiconductor Manufacturing,” Neurocomputing, 131, 427–439 (2014)

Lenz, B. and B. Barak; “Data Mining and Support Vector Regression Machine Learning in Semiconductor Manufacturing to Improve Virtual Metrology,” �e 46th Hawaii International Conference on System Sciences, pp. 3447–3456, Wailea, U.S.A (2013)

Maggipinto, M., C. Masiero, A. Beghi and G. A. Susto; “A Convolutional Autoencoder Approach for Feature Extraction in Virtual Metrol-ogy,” Procedia Manuf., 17, 126–133 (2018a)

Maggipinto, M., M. Terzi, C. Masiero, A. Beghi and G. A. Susto; “A Computer Vision-Inspired Deep Learning Architecture for Vir-tual Metrology Modeling With 2-Dimensional Data,” IEEE Trans. Semicond. Manuf., 31, 376–384 (2018b)

Mikolov, T., M. Kara�át, L. Burget, J. Černocký and S. Khudanpur; “Re-current Neural Network Based Language Model,” Eleventh Annual Conference of the International Speech Communication Associa-tion, Chiba, Japan (2010)

O’Mara, W., R. B. Herring and L. P. Hunt; Handbook of Semiconductor Silicon Technology, William Andrew, New York, U.S.A. (1990)

Prechelt, L.; Early Stopping-But When? Neural Networks: Tricks of the Trade, pp. 55–69, Springer, Berlin, Germany (1998)

Ringwood, J. V., S. Lynn, G. Bacelli, B. Ma, E. Ragnoli and S. McLoone; Estimation and Control in Semiconductor Etch: Practice and Pos-sibilities, IEEE Trans. Semicond. Manuf., 23, 87–98 (2010)

Scott, D. W.; �e Curse of Dimensionality and Dimension Reduction, Multivariate Density Estimation: �eory, Practice, Visualization, pp. 195–217, Wiley, New York, U.S.A. (2008)

Tan, L. K., Y. M. Liew, E. Lim and R. A. McLaughlin; “Cardiac Le� Ven-tricle Segmentation Using Convolutional Neural Network Regres-sion,” 2016 IEEE EMBS Conference on Biomedical (IECBES), pp. 490–493, Kuala Lumpur, Malaysia (2016)

Terzi, M., C. Masiero, A. Beghi, M. Maggipinto and G. A. Susto; “Deep Learning for Virtual Metrology: Modeling with Optical Emission Spectroscopy Data,” 2017 IEEE 3rd International Forum on Re-search and Technologies for Society and Industry (RTSI), pp. 1–6, Modena, Italy (2017)

Tsutsui, T. and T. Matsuzawa; “Virtual Metrology Model Robustness Against Chamber Condition Variation Using Deep Learning,” IEEE Trans. Semicond. Manuf., 32, 428–433 (2019)

Yang, Y., M. Wang and M. J. Kushner; “Progress, Opportunities and Challenges in Modeling of Plasma Etching,” 2008 IEEE Interna-tional Interconnect Technology Conference, pp. 90–92, Burlin-game, U.S.A. (2008)

Zheng, Y., Q. Liu, E. Chen, Y. Ge and J. L. Zhao; “Time Series Classi�ca-tion Using Multi-Channels Deep Convolutional Neural Networks,” 15th International Conference on Web-Age Information Manage-ment (WAIM), pp. 298–310, Macau, China (2014)


Recommended