of 24
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
1/24
PARALLEL STRUCTURE OF
DECODER IN AUTOMATIC SPEECH
RECOGNITION SYSTEM
STUDENT VO QUOC VIET
SUPERVISORS DR. DANG TRONG TRINH
DR. HOANG TRANG
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
2/24
CONTENT
1. Introduction
2. Literature review
3. Methodology and System description
4. Design specification
5. Implementation
6. Test plan
7. Discussion
8. Conclusion and future work
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
3/24
1. INTRODUCTION
http://zagg-blog.s3.amazonaws.com/community/blog/wp-
content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-
AM.pnghttp://www.bmwblog.com/wp-
content/uploads/Siri_VoiceControl1.png
http://www.blogcdn.com/www.engadget.com/
media/2012/04/lg-voice-control.jpg
Automation control
Smart interaction TV
Voice control insmart phone
Smart phone needs
to connect to
server via internet
Some applications do not
need an extremely large
vocabulary but need short
processing time or real-timecontrol
Some portable
devices need a small
and less power
consumption
hardware design
Need parallel structure of decoder
http://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://www.bmwblog.com/wp-content/uploads/Siri_VoiceControl1.pnghttp://www.bmwblog.com/wp-content/uploads/Siri_VoiceControl1.pnghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.blogcdn.com/www.engadget.com/media/2012/04/lg-voice-control.jpghttp://www.bmwblog.com/wp-content/uploads/Siri_VoiceControl1.pnghttp://www.bmwblog.com/wp-content/uploads/Siri_VoiceControl1.pnghttp://www.bmwblog.com/wp-content/uploads/Siri_VoiceControl1.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.pnghttp://zagg-blog.s3.amazonaws.com/community/blog/wp-content/uploads/2012/11/Screen-Shot-2012-11-28-at-11.19.31-AM.png8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
4/24
2. LITERATURE REVIEW
Pipeline structure of decoder [3]
Controller is so complicated
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
5/24
2. LITERATURE REVIEW
Parallel structure of decoder [2]
Processing
element
Consume many
resources
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
6/24
3. METHODOLOGY AND
SYSTEM DESCRIPTION
Diagram of decoder [1]
Result
Data RAM controller
(DRC)
12 x 26
Viterbi searching
FLASH RAM
RAM
+ X X +
+ X X +
CU1
CU2
CU7
CU8
X Reg
X Reg
+ X X +
+ X X +
X Reg
X Reg
.
.
.
.
GCU
Log (b j(Ot))
Log (b j(Ot+12))
Viterbi searching
j(t) j(t+1) j(t+12)
j-1(t) j-1(t+1) j-1(t+12) 1(t)
2(t)
11(t)
12(t)
Result
REG
REGTMP 1
REGTMP 2
Final RegGaussian Calculation Unit (GCU)
=> output probability calculation
Data RAM controller (DRC)
Pipeline Viterbi searching
Calculation
Unit
Input data
Model data
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
7/24
4. DESIGN SPECIFICATION
Factors Specification
Technology 90 nm
Vdd 2.5V 3V
Power consumption 1mw
Area 100nm2
Recognition accuracy 85% for 50 words
Frequency 100 MHz
Number of transistor 50.000
Maximum number of states 16
Maximum number of mixture components 8
Maximum number of parallel calculation units 8
Maximum decoding time for one word About 0.221s
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
8/24
5. IMPLEMENTATION
Software system
MFCC
extraction
Training and
creating model
Check model
Speech
Convert
model
FLASH
memory
Software ASR
TEST
GENERATETest file.txt
Recognition
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
9/24
5. IMPLEMENTATION
Software system simulation result with 400 voice samples
and 20 models. Each model is corresponding to one word
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
10/24
5. IMPLEMENTATION
Hardware design
Data RAM controller
Data RAM controller
(DRC)
8 x 26
O_mfcc_data1
FLASH RAM
For model
RAM For
MFCC vectors
MFCC RAM controller
16 x 26 x 16b
Reg
Reg
Mean RAM
8 x 26 x 16b
j,1 j,2 j,26
8,1 8,2 8,26
Sigma RAM
8 x 26 x 16b 8,1 8,2
8,26
j,1 j,2 j,26
Ot,1 Ot,2 Ot,26
O2,1 O2,2 O2,26
O16,1 O16,2 O16,26
8
21
16
13 O15,1 O15,2 O15,26
16
16
16
16
16
16
5
5 mfcc_addr
model_addr
model_finish
mem_mfcc_finish
Control signal
O_mfcc_addr
i_model_data
O_model_addr
i_mfcc_data O_mfcc_data2
O_mfcc_data8
O_mfcc_data7
Model RAM controller
MFCC RAM controller
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
11/24
5. IMPLEMENTATION5555
E38E
F0F0
0F0F
Hardware design
Data RAM controller
=> Write operation
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
12/24
5. IMPLEMENTATION
Hardware design
Data RAM controller => Read operation
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
13/24
5. IMPLEMENTATION
Calculation unit
+ X X +
CU1
Frame xt
Parameter ,
X Reg
Log {bj(ot)}
Overflow
Control signal
16bit adder 16bit Multiplier
26bit Multiplier 52bit Adder
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
14/24
5. IMPLEMENTATION
Gaussian Calculation Unit and Viterbi searching modules
+ X X +
+ X X +
CU1
CU2
CU7
CU8
X Reg
X Reg
+ X X +
+ X X +
X Reg
X Reg
.
.
.
.
Frame Ot
Log (bj(Ot))
Frame Ot+7Log (b j(Ot+7))
Log (bj(Ot+6))
Log (bj(Ot+1))
Frame Ot+6
Frame Ot+1
Control signal
GCU
Log (b j(Ot))
Log (b j(Ot+12))
Viterbi searching
j(t) j(t+1) j(t+12)
j-1(t) j-1(t+1) j-1(t+12) 1(t)
2(t)
11(t)
12(t)
Result
REG
REGTMP 1
REGTMP 2
Final Reg
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
15/24
6.TEST PLAN
Module Input Output Method
Data Ram
Controller
(DRC)
Data from text file Store input data in
internal register banks
and export to 8 16
output ports
Build in self test
Completed
system with
Log-add
Data from text file
generated by
Matlab
The index of model The result will be
compared with the
value from Matlab
Completed
system withoutLog-add
Data from text file
generated by
Matlab
The index of model The result will be
compared with the
value from Matlab
FPGA test
Model parameter
Data of all feature
vector from Matlab
Display result on 7
segment LED
The result will be
compared with the
value from Matlab
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
16/24
7. DISCUSSION AND TIMELINE The characteristic of English
Some suffixes like s, ed, t or d
One word may have many syllables
Diffcult to detect word
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
17/24
8. CONCLUSION AND
FUTURE WORK
Software system => implemented successfully for parallel
Viterbi algorithm with offline recognition
Two first hardware sub-module is verified successfully
Future work
Improve word detecting function
GCU and Viterbi searching sub-module
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
18/24
THANK YOU
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
19/24
REFERENCE
[1] Kazuhiro Nakamura, Ryo Shimazaki, Masatoshi Yamamoto, Kazuyoshi Takagi
and Naofumi Takagi, AVLSI architecture for output probability computationsof hmm based recognition systems,in VLSI, Rijeka, Croatia, InTech, 2010, pp.
274-284.
[2] Yoshizawa S., Wada N., Hayasaka N. and Miyanaga Y., Scalablearchitecture
for word HMM-based speech recognition and VLSI implementation in
complete system,Circuits and Systems, vol. 53, no. 1, pp. 70-77, 2006.
[2] Wei H., Cheong F. C., Chiu S. C. and Kong P. P., ASpeech Recognizer with
Selectable Model Parameters,in ISCAS, 2005.
[3] Bok-Gue P., Koon-Shik C. and Jun-dong C., Lowpower VLSI architecture of
Viterbi scorer for HMM-based isolated word recognition,in Quality ElectronicDesign, 2002.
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
20/24
2. LITERATURE REVIEW
ASR built as an embedded system
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
21/24
3. SYSTEM DESCRIPTION
a. Computing flow in Viterbi algorithm [1]
1lnln~
2lnlnln
maxmax
maxmaxminmin
MXCXC
tj eXCob
NjTt
obaijj tjijtNi
tt
1;2
,~~~max~ln~ 11
BFPP: Block Frame Parallel processing
Output probability (2)
Partial probability (1)
[4]
Computing flow: 2-1-2-1-2-1-2-1..
Computing flow:
{2-2-2-2}-1-1-1-1-{2-2-2-2-2}-1-1-1-1-1
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
22/24
3. SYSTEM DESCRIPTION
Flow chart and diagram of entire decoder
v++
Start
t++
j++
p++
Pp
Nj
Tt
Vv
LogbjOt LogbjOt+1 LogbjOt+M-1
Loop A
Loop B
Loop C
Loop D
End
N
Y
N
Y
N
Y
N
Y
RegAjj=0
Regtmp=0
Regfinal=0
t=1
j=Q
Procedure 2
i++
t=T
i=N
i=0
j++
Regfinal [j] =RegAjj[N]
j=0
t=t+M
j=1Procedure 1
i++
Start
N
Y
Y
N
Y
N
N
Finish
Loop A
Loop B
Loop C
Y
Result
Data RAM controller
(DRC)
12 x 26
Viterbi searching
FLASH RAM
RAM
+ X X +
+ X X +
CU1
CU2
CU7
CU8
X Reg
X Reg
+ X X +
+ X X +
X Reg
X Reg
.
.
.
.
GCU
Output probability Partial probability
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
23/24
5. IMPLEMENTATION
Fail to recognize
Recognize successfully
8/13/2019 Parallel Structure of Decoder in Automatic Speech Recognition
24/24
5. IMPLEMENTATION
Hardware design
Data RAM controller
Test register
bank 0 & 1W en = R en =1
Test register
bank 2 & 3
W en = R en =1
Test register
bank 4 & 5
W en = R en =1
Test register bank
6 & 7
W R 1
Test register bank
0 & 1
W 0 R 1
RAM controller
Testbench
Test_case.txt
Model.out Virtual RAM Store Address
ADDR
ADDR
Input
Data
Control
signal
Comparator
Output
Data
W_en/R_en
Expected
data
Error_signal