Deep Neural Networks for Acoustic Modelling
Bajibabu Bollepalli
Hieu Nguyen
Rakshith Shetty
Pieter Smit (Mentor)
Introduction
• Automatic speech recognition
Feature Extraction
Acoustic Modelling
Decoder
Language Modelling
Speech signalRecognized text
Introduction
• Acoustic modelling using deep neural networks
Feature Extraction
Acoustic Modelling
Decoder
Language Modelling
Speech signalRecognized text
Background
• HMM-GMMs have prevailed in ASR for last four decades• Difficult for any new methods to outperform them for acoustic modelling
• Can GMMs capture all information in acoustic features?• No. Inefficient in modelling the data that lie on or near nonlinear manifold in
the data space
• Need for better models• Artificial neural networks (ANNs) are known to capture the nonlinearities in
the data
• Natural to think of ANNs as an alternative to GMMs
Background
• ANNs are not new for speech recognition• Two decades back, researchers employed the ANNs for ASR
• Unable to outperform the GMMs
• Hardware and learning algorithms were restricted the capacity of ANNs
• Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs)• DNNs outperform the GMMs (finally ;) )
Deep Neural Networks (DNNs)
• Feed-forward ANNs with more than one hidden layers
Our task
• Frame based phoneme recognition using simple DNNs
• Experiments with various input features
• Compare the results with GMMs
• Try complex DNNs (if time permits)• Deep belief networks (DBN)
• Recurrent neural networks (RNNs)
Our task
• Frame based phoneme recognition using simple DNNs
• Experiments with various input features
• Compare the results with GMMs
• Try complex DNNs (if time permits)• Deep belief networks (DBN)
• Recurrent neural networks (RNNs)
Database
• Training data : 151 Finnish speech sentences (~ 15 mins)
• Development data 135 sentences (~ 11 mins)
• Evaluation data 100 sentences (~ 8 mins)
Simple DNN
• Similar to multi-layer perceptron (MLP)
• Hidden Layers: [300, 300]
• Activations: Sigmoid
• Optimization: Stochastic Gradient Descent (SGD)
• Error criteria: Categorical crossentropy
• Software tool: Keras
• Input: MFCC features with 39 dimension
• Output: 24 Finnish phonemes
• Normalization: Mean-variance
Performance of simple DNN (MLP)
Input feature Frame-wise accuracy (%)
Single frame [t] 63.81
Three frames [t-1, t, t+1] 67.59
Five frames [t-2, t-1, t, t+1, t+2] 67.22
DBN
Deep Belief Network (DBN)
• This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization.
• After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP.
• Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights.
• Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label.
13
Restricted Boltzmann Machine (RBM)
• This is type of generative neural network.
• The idea is to generate an ’energy surface’ or ’heat map’ in form of probability density.
• Energy:
• Probability density:
• Optimize:
Use Gibbs sampling for <.>_model
14
DBN-pretraining
• Stack of RBMs:
• Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer.
• The process is done bottom-up
• Iterate multiple for multiple epochs
15
Setups
• Using Theano-based tutorial code from deeplearning.net
• Hidden layers uses activation function sigmoid function.
• Prediction layer (top layer) is a softmax layer.
• Loss function is categorical cross entropy.
• Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities)
• Each input is MFCC in context of 3 (triphone)
16
Experiments
• Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen
• Train ‘with’ and ‘without’ pre-training to compare
• The number of hidden layers varies from 1 to 3
• The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained)
• Experimenting with some 3-hidden-layer ‘hourglass’ model, the results don’t show real improvement.
17
DBN Results
• The best model is non-pre-trained 500_500 network. Accuracy on validation set is 66.82%
• The table show predicting accuracy on trained models on validation set.
18
Model size Pre-trained Iterations
Non-pre-
trained Iterations
100 60.188 48934 60.344 39830
200 61.235 44382 62.792 48934
300 61.387 39830 62.721 39830
400 61.284 42106 63.561 37554
100_100 61.641 48934 62.638 44382
200_200 63.106 47796 64.266 39830
300_300 63.808 46658 64.716 37554
400_400 63.741 51210 64.634 33002
500_500 66.820 33002
600_600 65.327 30726
100_100_100 62.237 55762 62.926 46658
200_200_200 63.589 53486 64.19 40968
300_300_300 63.572 44382 63.73 33002
400_400_400 63.106 44382 64.941 35278
Recurrent Networks
Recurrent Neural Networks• Output of a recurrent n/w at time t depends on inputs at
time t as well as state of the n/w at time t-1.
• Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights
• In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window.
• Infinitely deep in a sense
Our Model• We use a fixed context size with frames upto t-context fed
into the RNN.
• Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.
Learning in recurrent nets• We can compute the error at time t (cross entropy error)
and backpropagate the gradients through time, similar to backpropagation in MLP.
• Problem is these gradients can die out or blow up if sequence is very long
• One solution for explosive gradients is to truncate the depth in time till which you propagate
• Other solution is to use more complex recurrent units like LSTMs
LSTM Cell• Consists of a memory unit
and 3 gates
• Each gate is affected by current input and previous output state of the cell.
• The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.
Learning Details and Regularization• We use RMSprop learning algorithm – a form of gradient
descent where learning rate is automatically scaled by rmsvalue of most recent gradients
• Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much
• Dropout only in the embedding and output layer, bad idea to do it recurrent connections.
Results with RNNs - Accuracies• Context 10, 200 units, Dropout 0.3
• LSTM , Context 10, Dropout 0.3
• LSTM , 200 Units, Dropout 0.3
• LSTM , Context 10, 200 Units
Size of n/w 50 100 200
Accuracy on Eval 67.79 68.11 67.76
Context Window 5 10 20
Accuracy on Eval 68.11 67.76 68.76
Dropout Prob 0.0 0.3 0.5 0.7
Accuracy on Eval 66.47 67.76 68.21 68.19
Type of Unit simple lstm
Accuracy on Eval 66.43 67.76
Summary Results: All ModelsContext Window MLP DBN RNN
Accuracy on Eval 67.59 66.82 68.76
Source code is available on GitHub :
https://github.com/rakshithShetty/dnn-speech
27
References
• George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012.
• Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013.
• Some figures are taken from prof. Juha Karhunen’s slides of the course Machine Learning and Neural Networks.
• Implementation DBN code are take and modified from tutorial on deeplearning.net
28
Questions ?
29