P (c1|�(x))
P (c5|�(x))P (c6|�(x))
P (c23|�(x))
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford, UK
1. OVERVIEW Text recognition in natural scene images. Contributions • A synthetic data engine to generate
unlimited training data. • Three deep convolutional neural
network (CNN) architectures for holistic image classification.
• A resulting set of state-of-the-art reading systems in language constrained and unconstrained scenarios.
2. SYNTHETIC DATA ENGINE
3. MODELS
5. EVALUATION
HERBERT
LOADING
Existing scene text datasets are very small, and cover a small number of words. Use a synthetic data engine to generate training samples. Fonts selected from 1400 Google Fonts. Projective distortion, elastic distortion, and noise applied. Random crops of natural images alpha-blended with image-layers to generate texture and lighting.
1. Font rendering
2. Border/shadow & colour
3. Composition 4. Projective distortion 5. Natural image blending
Dataset Available! • 9 million word images • Covering 90k English words • Download at:
DICTIONARY ENCODING (DICT)
w
⇤= arg max
w2WP (w|x)P (w|L)predicted word
Language model
Visual model
corpus (movie subtitles)
Multi-class classification, one class for each word in dictionary (constrained language model).
w W
The number of classes can be scaled to 90k classes. Requires incremental training – initialize learning with 5k classes, incrementally increase number of classes as learning progresses.
CHARACTER SEQUENCE ENCODING (CHAR)
Single CNN with multiple independent classifiers, inspired by Goodfellow et al ICLR’14. Each classifier predicts the character at each position of the word.
No language model, suitable for unconstrained recognition.
c
⇤i = arg max
ci2C[{�}P (ci|�(x))
w⇤ = c⇤1c⇤2 . . . c
⇤i 8c⇤i /2 {�} null character shared features
BAG OF N-GRAMS ENCODING (NGRAM)
Represent a string as a bag-of-N-grams. E.g. G(spires) = {s, p, i, r, e, s, sp, pi, ir, re, es, spi, pir, ire, res, spire, pires}
Visually model 10k common 1, 2, 3, and 4-grams. 10k independent binary classifiers. Result is N-gram detection vector.
Two ways to recover words: • Find nearest neighbour of output with ideal outputs of dictionary words. • Train a linear SVM for each dictionary word, using training data outputs.
IC03-50 IC03-Full
SVT-50 SVT IC13 IIIT5k-50 IIIT5k-1k Model
Baseline (ABBYY) 56.0 55.0 35.0 - - 24.3 - Wang, ICCV ‘11 76.0 62.0 57.0 - - - - Bissacco, ICCV ‘13 - - 90.4 78.0 87.6 - - Yao, CVPR ‘14 88.5 80.3 75.9 - - 80.2 69.3 Jaderberg, ECCV ‘14 96.2 91.5 86.1 - - - - Gordo, arXiv ‘14 - - 90.7 - - 93.3 86.6 DICT-IC03-Full 99.2 98.1 - - - - - DICT-SVT-Full - - 96.1 87.0 - - - DICT+2-90k 98.7 98.6 95.4 80.7 90.8 97.1 92.7 CHAR+2 96.7 94.0 92.6 68.0 79.5 95.5 85.4 NGRAM+2-SVM 96.5 94.0 - - - - -
Evaluation performed on standard benchmarks IC03, IC13, SVT, IIIT5k, as well as on Synth dataset for all 90k words. Different lexicon sizes (e.g. IC03-50).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1Precision−Recall curve for N−gram recogntion
Recall
Prec
isio
n
(a) (b) (c) (d) (e) (f)0
10
20
30
40
50
60
70
80
90
100
Reco
gnitio
n Ac
cura
cy %
DICT−IC03−FullDICT−SVT−Full
B&W +fonts +colour +pers-‐ pec2ve
+noise +image blending
Synthe2c Data Contribu2ons
4. EXPERIMENTAL SETUP
# training words
# parameters
Synth accuracy
DICT-IC03-Full 563 108M 98.7%
DICT-SVT-Full 4282 123M 98.7%
DICT-90k 90k 450M 90.3%
DICT+2-90k 90k 480M 95.2%
CHAR 90k 109M 71.0%
CHAR+2 90k 127M 86.2%
NGRAM-NN 90k 145M 25.1%
NGRAM+2-NN 90k 163M 27.9%
NGRAM+2 precision/recall curve
z, zz, izz
DICT: pizza CHAR: pizz
i, n, y, im, ji, mm, my, imm, imn, lim, mim, mmy, 2m, immi
DICT: jimmy CHAR: limmy
a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2
DICT: organiza2on CHAR: organaa2on
a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2
DICT: western CHAR: western
www.robots.ox.ac.uk/~vgg/data/text/