Synthetic Data and Artiﬁcial Neural Networks for Natural ...vgg/publications/2014/...Text...

P (c1|�(x))

P (c5|�(x))P (c6|�(x))

P (c23|�(x))

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford, UK

1. OVERVIEW Text recognition in natural scene images. Contributions •  A synthetic data engine to generate

unlimited training data. •  Three deep convolutional neural

network (CNN) architectures for holistic image classification.

•  A resulting set of state-of-the-art reading systems in language constrained and unconstrained scenarios.

2. SYNTHETIC DATA ENGINE

3. MODELS

5. EVALUATION

HERBERT

LOADING

Existing scene text datasets are very small, and cover a small number of words. Use a synthetic data engine to generate training samples. Fonts selected from 1400 Google Fonts. Projective distortion, elastic distortion, and noise applied. Random crops of natural images alpha-blended with image-layers to generate texture and lighting.

1. Font rendering

2. Border/shadow & colour

3. Composition 4. Projective distortion 5. Natural image blending

Dataset Available! •  9 million word images •  Covering 90k English words •  Download at:

DICTIONARY ENCODING (DICT)

w

⇤= arg max

w2WP (w|x)P (w|L)predicted word

Language model

Visual model

corpus (movie subtitles)

Multi-class classification, one class for each word in dictionary (constrained language model).

w W

The number of classes can be scaled to 90k classes. Requires incremental training – initialize learning with 5k classes, incrementally increase number of classes as learning progresses.

CHARACTER SEQUENCE ENCODING (CHAR)

Single CNN with multiple independent classifiers, inspired by Goodfellow et al ICLR’14. Each classifier predicts the character at each position of the word.

No language model, suitable for unconstrained recognition.

c

⇤i = arg max

ci2C[{�}P (ci|�(x))

w⇤ = c⇤1c⇤2 . . . c

⇤i 8c⇤i /2 {�} null character shared features

BAG OF N-GRAMS ENCODING (NGRAM)

Represent a string as a bag-of-N-grams. E.g. G(spires) = {s, p, i, r, e, s, sp, pi, ir, re, es, spi, pir, ire, res, spire, pires}

Visually model 10k common 1, 2, 3, and 4-grams. 10k independent binary classifiers. Result is N-gram detection vector.

Two ways to recover words: •  Find nearest neighbour of output with ideal outputs of dictionary words. •  Train a linear SVM for each dictionary word, using training data outputs.

IC03-50 IC03-Full

SVT-50 SVT IC13 IIIT5k-50 IIIT5k-1k Model

Baseline (ABBYY) 56.0 55.0 35.0 - - 24.3 - Wang, ICCV ‘11 76.0 62.0 57.0 - - - - Bissacco, ICCV ‘13 - - 90.4 78.0 87.6 - - Yao, CVPR ‘14 88.5 80.3 75.9 - - 80.2 69.3 Jaderberg, ECCV ‘14 96.2 91.5 86.1 - - - - Gordo, arXiv ‘14 - - 90.7 - - 93.3 86.6 DICT-IC03-Full 99.2 98.1 - - - - - DICT-SVT-Full - - 96.1 87.0 - - - DICT+2-90k 98.7 98.6 95.4 80.7 90.8 97.1 92.7 CHAR+2 96.7 94.0 92.6 68.0 79.5 95.5 85.4 NGRAM+2-SVM 96.5 94.0 - - - - -

Evaluation performed on standard benchmarks IC03, IC13, SVT, IIIT5k, as well as on Synth dataset for all 90k words. Different lexicon sizes (e.g. IC03-50).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1Precision−Recall curve for N−gram recogntion

Recall

Prec

isio

n

(a) (b) (c) (d) (e) (f)0

10

20

30

40

50

60

70

80

90

100

Reco

gnitio

n Ac

cura

cy %

DICT−IC03−FullDICT−SVT−Full

B&W +fonts +colour +pers-‐ pec2ve

+noise +image blending

Synthe2c Data Contribu2ons

4. EXPERIMENTAL SETUP

# training words

# parameters

Synth accuracy

DICT-IC03-Full 563 108M 98.7%

DICT-SVT-Full 4282 123M 98.7%

DICT-90k 90k 450M 90.3%

DICT+2-90k 90k 480M 95.2%

CHAR 90k 109M 71.0%

CHAR+2 90k 127M 86.2%

NGRAM-NN 90k 145M 25.1%

NGRAM+2-NN 90k 163M 27.9%

NGRAM+2 precision/recall curve

z, zz, izz

DICT: pizza CHAR: pizz

i, n, y, im, ji, mm, my, imm, imn, lim, mim, mmy, 2m, immi

DICT: jimmy CHAR: limmy

a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2

DICT: organiza2on CHAR: organaa2on

a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2

DICT: western CHAR: western

www.robots.ox.ac.uk/~vgg/data/text/

Date post:	05-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Synthetic Data and Artiﬁcial Neural Networks for Natural ...vgg/publications/2014/...Text...

Documents