Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM:...

transcript

Automatic Speech

Recognition (ASR) intro

Sokolov Artem

Motivation

• Streaming speech

• Voice commands/search

• Key World Spotting (KWS)

• TTS

• NLP/NLU

Neighbor technologies:

Families

• Hybrid (DNN+HMM)

• E2E

Conventional ASR pipeline

1. AM:

MFCC - > o3 o7 o7 o1 o9 o9 o9 o5

2. HMM:

o3 o7 o7 o1 o9 o9 o9 o5kaet

3. PM (Lexicon):

k ae tcat

4. LM:

the catthe cat

Acoustic

Model Language

Pronunciatio

n Model

Feature

Extractio

n Text

ASR basics

- Phoneme is a minimum speech item

- Biphone, Triphone (senone) context depended

phoneme

- Frame window is ~ 10-50 ms with overlapping

- Used features:

- MFCC

- LPCC

- FBANK

- MELSPEC

- ETSI - AFE

- PNCC

Conventional ASR pipeline

• All components are trained separately.

• Language Model is a prior information about language

• Pronunciation Model (lexicon) defines phoneme sets for

the particular language

Acoustic

Model Language

Pronunciatio

n Model

Feature

Extractio

n Text

Acoustic and pronunciation models

LAY L EY

PLACE P L EY S

SET S EH T

RED R EH D

GREEN G R IY N

BLUE B L UW

WHITE W AY T

Image credit: Park H. et al. A Fast-Converged Acoustic Modeling for Korean Speech Recognition: A Preliminary Study on Time Delay Neural Network //arXiv preprint arXiv:1807.05855. –

PM example:

TDNN/BLSTM based are the SOFA for conventional (hybrid)

systems.

Language model

N-grams, Weighted Finite State Transducers (WFST) • [Mohri et al.] M. Mohri, F. Pereira, M. Riley, “Weighted Finite-State Transducers in Speech

Recognition”

• Most of ASR frameworks

RNN • [Mikolov et al., 2010] T. Mikolov, M. Karafiat, L. Burget, “Recurrent neural network based

language model”

• [Kannan et al., 2017] A. Kannan, Y. Wu, P. Nguyen, “An analysis of incorporating an

external language model into a sequence-to-sequence model”

• ESPnet, EESEN end-to-end ASR frameworks

Transformers • [Li et al., 2019] J. Li, V. Lavrukhin, B. Ginsburg, “Jasper: An End-to-End Convolutional

Neural Acoustic Model”, April 2019

Weighted Finite State Transducers

0 3HUAWEI

1 2is AI

0 3HUAWEI

1 2is AI/0.3

cool/0.7

0 10H:HUAWEI

4 5E:- I:-

c:cool

12o:- o:-

9A:AI I:-

0 10H:HUAWEI

4 5E:- I:-

c:cool/0.7

12o:- o:-

9A:AI/0.3 I:-

Finite State Acceptor:

Weighted Finite State Acceptor:

Finite State Transducer:

Weighted Finite State Transducer:

Accuracy measure

- Commonly used accuracy measure for spontaneous speech is Word Error Rate (WER)

- Relative number of correct utterances (for commands)

S - substitutions (wrongly recognised commands)

N - total number decoded commands

Datasets*. Data augmentation

Corpus Description Size

Librispeech Audio Books ~1000h

WSJ Reading of Wall Street Journal texts 80h

Fisher Telephone speech 2000h

Switchboard Telephone speech 300h

• * https://www.ldc.upenn.edu/

• ** https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html

Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.

• Speed change

• Natural and synthetic noise

• …

• SpecAugment**

Augmentation:

Frameworks & Vendors

• Kaldi

• TF

• Pytorch

• Google

• Amazon

• Facebook

• Microsoft

• Baidu

• Yandex

• Speech Technology

Center

• Nuance

• …

E2E models

Image credits: http://ericbattenberg.com/

A system which directly maps a sequence of input acoustic features

into a sequence of graphemes or words.

Amodei D. et al. Deep speech 2: End-to-end speech recognition in english and mandarin //International conference on machine learning. – 2016. – С. 173-182.

End-To-End Models Overview

AM Submission Architecture N of layers LibriSpeech* test-clean

DeepSpeech

(Baidu/Mozilla) 2014 3FC + BiLSTM + FC -> CTC 5 5.66

DeepSpeech2 2015 2 CNN + 7 BiRNN + 2 FC → CTC 11 5.33

Wav2Letter 2016 12xCNN -> ASG/CTC 12 7.2

ESPnet 2018 CNN (VGG)+N BiLSTM →

CTC/CTC+Attention ~16+8 4.0

Wav2Letter+ 2018 Residual CNN + 19xCNN → CTC 19 3.44

Jasper 2019

Residual

Nx(CNN+BN+ReLU+Dropout) –>

54 2.95

Research developments on end-to-end

models towards productionisation

• Attention - Pushing the limit of attention-based end-to-end

models.

• Online models – Streaming models for real world application

• RNN-T

• Neural Transducers

Questions?

Hardware acceleration. GNA

GNA is a ASIC in Intel ® CPUs

- Gemini Lake

- Cannon Lake

- Ice Lake

- Elkhart Lake

GNA driver and library to provide the API

- 1.0 Gold

- 2.0 Pre-Alfa

Intel® GNA - Gaussian Mixture Models and Neural Networks Accelerator.

IE GNA plugin to provide compatibility with famous frameworks (Kaldi, TF)

- Gold????

IE GNA plugin + lib + HW place

Image credit: https://www.techrepublic.com/article/how-we-learned-to-talk-to-computers/

GNA input: feature vectors

GNA Output: second likelihoods (in lattices)

GNA library and driver

- Two versions of GNA library existed. GNA 2.0 is in pre-alfa status

- Several modes of acceleration are supported

- SW generic

- SW specific (SSE4_2, AVX1, AVX2)

- SW exact

Image credit: GNA 2.0 API official documentation

Model loading in pinned RAM memory

GNA plugin routine

• Convolutional layer

• Recurrent layer

• Diagonal layer

• Affine layer

• Copy layer

• PWL

Supported layers

Layers

Activation-Clamp

Activation-Leaky ReLU

Activation-ReLU

Activation-Sigmoid/Logistic

Activation-TanH

Concat

Convolution-Ordinary

Eltwise-Mul

Eltwise-Sum

FullyConnected, Diagonal

Memory

Layers

Permute

Pooling(AVG,MAX)

Split/Slice

Reshape

ROIPooling

ScaleShift

Supported precisions

Level GNA Plugin

Model format FP32, I16

Computational precision FP32*, I16, I8

Output FP32, I16

Quantisation. Scaling. Mixed precision for weight

and biases Quantisation is a hint to the GNA plugin regarding the preferred target weight resolution for all layers.

Quantisation modes:

• static

• dynamic (not implemented, but potentially supported)

• user-defined

I16 I8

Compound

biases

Compound biases

I8 typedef struct

int32_t bias;

uint8_t multiplier;

char reserved[3];

} intel_compound_bias_t;

Engineering

Unit tests

(functions, model parts; i8, i16, fp32, +

(fp16 input))

Functional tests

(models, compare with MKLDNN on

i8, i16 precisions)

Behaviour tests

(Behaviour cases where failures

expected)

Additionally: we have layers dumping and weights similarity

measurements mechanisms for debugging purposes.

Sample

IE GNA plugin vs IE MKLDNN plugin I

rm_lstm 10 utterances

GNA HW i8 (single) GNA SW i8 (single) MKLDNN fp32 (multi)

Intel® Core™ Silver J5005 CPU @ 1.50GHz (GeminiLake)

1. LSTMProjectedStream with 43 inputs, 512 cells, and 200 outputs

2. LSTMProjectedStream with 200 inputs, 512 cells, and 200 outputs

3. AffineTransform with 200 inputs and 1494 outputs

4. Softmax.

The network outputs are senone class likelihoods.

The model is provided with class counts so that the softmax layer may be removed (with no drop in accuracy).

IE GNA plugin vs IE MKLDNN plugin

DNN CNN LSTM

Infer time Per Frame Correlation

MKLDNN fp32 (multi) GNA i8 SW (multi*) Challenges made on Intel® Core™ i7-6770K

Single utterance with 8192 frames, batch_size=1 for CNN and RNN; 262144 frames, batch_size=8 for DNN.

*New changes for for async requests supporting by GNA plugin. Not released yet.

Multithreading mode specialities: MKLDNN uses all physical cores (4). GNA 8 async requests in parallel.

Questions?

Automatic Speech Recognition (ASR) intro - nnov.hse.ru€¦ · Conventional ASR pipeline 1. AM:...

Documents