End-to-End Text Recognition with Convolutional Neural Networks Tao Wang, David J. Wu, Adam Coates,...

End-to-End Text Recognition with End-to-End Text Recognition with Convolutional Neural NetworksConvolutional Neural Networks

Tao Wang*, David J. Wu*, Adam Coates, Andrew Y. Ng

Computer Science Department

Stanford University

* Denotes equal contribution

Tao WangTao Wang 2

Scene Text Recognition OverviewScene Text Recognition Overview

• Text “in the wild” are hard to recognize

• Wide range of variations in backgrounds, textures, fonts, and lighting conditions

Street View Text Dataset K.Wang et al., 2011

ICDAR 2003 Dataset S. Lucas et al., 2003

Tao WangTao Wang 3

Detection/Classification High-level Inference

“HOTEL”

Two-Stage FrameworkTwo-Stage Framework

Tao WangTao Wang 4

Exhaustive Graph Search

MSER + SVM with RBF Kernel

Neumann and Matas, 2012

CRF + N-gram model

HOG + SVM with RBF Kernel

Mishra et al., 2012

Pictorial Structure

HOG + Random FernsK. Wang et al., 2011

Semi-Markov CRF

Appearance + GeometryWeinman et al., 2008

High-level inference

Classification and detection

Works

Tao WangTao Wang 5

Simple off-the-shelf heuristics

Learnt features + Learnt features + 2-layer CNN2-layer CNNOur approachOur approach

Graph based inference models

Hand-designed features + off-the-shelf classifier

Most other approaches

High-level inference

Classification and detection

Tao WangTao Wang 6

ICDAR 62-way cropped character classification

Detection/Classification End-to-end system after high-level inference

Various BenchmarksVarious Benchmarks

ICDAR and SVT end-to-end text recognition

ICDAR and SVT Cropped word recognition Lexicon

SOTASOTA

SOTA on ICDAR SOTASOTA

Tao WangTao Wang 7

Unsupervised Feature LearningUnsupervised Feature Learning

Contrast Normalization + ZCA whitening

K-Means

Coates et al., 2011

Tao WangTao Wang 8

Convolution ConvolutionSpatial Pooling Spatial Pooling

LL22-SVM Classifier-SVM Classifier

√√ TextText × × Non-TextNon-Text

Backpropagation

Large representation but not enough data. Overfitting?

96

256

~10K parameters for detection

~50K parameters for classification

1st layer 2nd layer

Tao WangTao Wang 9

Synthetic DataSynthetic Data

Color Statistics

Synthetic “hard negatives”

Real SyntheticUnrealistic Synthetic DataReal Data

Java.Font + Natural backgrounds

Tao WangTao Wang 10

Detector PerformanceDetector Performance

Tao WangTao Wang 11

Text Line Bounding boxes

Candidate spaces

Tao WangTao Wang 12

81.4 81.7

64

89

50

55

60

65

70

75

80

85

90

95

100

Yokobayashi etal., 2006

Coates et al.,2011

K.Wang et al.,2011

Our Approach Human

83.9

62-way classification accuracy on ICDAR cropped characters62-way classification accuracy on ICDAR cropped characters

(on ICDAR-Sample characters)

Acc

urac

y(%

)

Higher is better

Classifier PerformanceClassifier Performance

Tao WangTao Wang 13

Tao WangTao Wang 14

Ch

ar

Cla

ss

Sliding window position

Tao WangTao Wang 15

Word RecognitionWord Recognition

Lexicon:…

MAKESERIESESTATEPOKER

…

S E R I E S -5.45

7.82

-1.74

-9.02

max ∑

Tao WangTao Wang 16

76

82

90

62

84

57

7370

40

50

60

70

80

90

100

ICDAR-WD-50 ICDAR-WD-FULL SVT-WD

K.Wang et al., 2011

Mishra, et al., 2012

Our approach

Cropped Word Recognition AccuracyCropped Word Recognition AccuracyA

ccur

acy(

%)

Cropped Words Benchmarks

Higher is better

Tao WangTao Wang 17

…

…

Candidate spacesgenerated by detector

max( )j

j

MSeg

M Seg

S

Tao WangTao Wang 18

Tao WangTao Wang 19

End-to-end text recognition resultsEnd-to-end text recognition results

0.72

0.76

0.7

0.74

0.68

0.72

0.51

0.67

0.38

0.46

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

ICDAR-5 ICDAR-20 ICDAR-50 ICDAR-FULL SVT

K.Wang etal., 2011

Ourapproach

F-S

core

End-to-end Benchmarks

Higher is better

Tao WangTao Wang 20

Sample Output Sample Output Images from SVTImages from SVT

Tao WangTao Wang 21

Sample Output Images Sample Output Images from ICDAR-FULLfrom ICDAR-FULL

Tao WangTao Wang 22

max( )

max({ \ })

n c

m n c n

-- “confidence margin”

PEOSTELPEOST

POSTPOS

Hunspell

POSEPOST

PEOPLEPISTOL

…

LEXICON

Suggested Words

Our F-score: 0.38

Neumann and Matas, 2010: 0.40

c

Tao WangTao Wang 23

• Learnt features + 2-layer CNN for+ character detection and classification• Simple heuristics to build end-to-end scene text recognition system• State-of-the-art performances onState-of-the-art performances on

- ICDAR cropped character classification- ICDAR cropped word recognition- Lexicon based end-to-end recognition on ICDAR and SVT

• Extensible to more general lexicon with off-the-shelf spelling checker

ConclusionConclusion

Tao WangTao Wang 24

Date post:	16-Jan-2016
Category:	Documents
Upload:	myrtle-mosley
View:	220 times
Download:	2 times

End-to-End Text Recognition with Convolutional Neural Networks Tao Wang*, David J. Wu*, Adam Coates,...

Documents

End-to-End Text Recognition with Convolutional Neural Networks Tao Wang, David J. Wu, Adam Coates,...