+ All Categories
Home > Documents > Synthetic Data and Artificial Neural Networks for Natural ...vgg/publications/2014/...Text...

Synthetic Data and Artificial Neural Networks for Natural ...vgg/publications/2014/...Text...

Date post: 05-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
P (c 1 |Φ(x)) P (c 5 |Φ(x)) P (c 6 |Φ(x)) P (c 23 |Φ(x)) Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford, UK 1. OVERVIEW Text recognition in natural scene images. Contributions A synthetic data engine to generate unlimited training data. Three deep convolutional neural network (CNN) architectures for holistic image classification. A resulting set of state-of-the-art reading systems in language constrained and unconstrained scenarios. 2. SYNTHETIC DATA ENGINE 3. MODELS 5. EVALUATION HERBERT LOADING Existing scene text datasets are very small, and cover a small number of words. Use a synthetic data engine to generate training samples. Fonts selected from 1400 Google Fonts. Projective distortion, elastic distortion, and noise applied. Random crops of natural images alpha-blended with image-layers to generate texture and lighting. 1. Font rendering 2. Border/shadow & colour 3. Composition 4. Projective distortion 5. Natural image blending Dataset Available! 9 million word images Covering 90k English words Download at: DICTIONARY ENCODING (DICT) w = arg max w2W P (w |x)P (w |L) predicted word Language model Visual model corpus (movie subtitles) Multi-class classification, one class for each word in dictionary (constrained language model). w W The number of classes can be scaled to 90k classes. Requires incremental training – initialize learning with 5k classes, incrementally increase number of classes as learning progresses. CHARACTER SEQUENCE ENCODING (CHAR) Single CNN with multiple independent classifiers, inspired by Goodfellow et al ICLR’14. Each classifier predicts the character at each position of the word. No language model, suitable for unconstrained recognition. c i = arg max c i 2C [{φ} P (c i |Φ(x)) w = c 1 c 2 ...c i 8c i / 2 {φ} null character shared features BAG OF N-GRAMS ENCODING (NGRAM) Represent a string as a bag-of-N-grams. E.g. G(spires) = {s, p, i, r, e, s, sp, pi, ir, re, es, spi, pir, ire, res, spire, pires} Visually model 10k common 1, 2, 3, and 4-grams. 10k independent binary classifiers. Result is N-gram detection vector. Two ways to recover words: Find nearest neighbour of output with ideal outputs of dictionary words. Train a linear SVM for each dictionary word, using training data outputs. IC03-50 IC03- Full SVT-50 SVT IC13 IIIT5k-50 IIIT5k-1k Model Baseline (ABBYY) 56.0 55.0 35.0 - - 24.3 - Wang, ICCV ‘11 76.0 62.0 57.0 - - - - Bissacco, ICCV ‘13 - - 90.4 78.0 87.6 - - Yao, CVPR ‘14 88.5 80.3 75.9 - - 80.2 69.3 Jaderberg, ECCV ‘14 96.2 91.5 86.1 - - - - Gordo, arXiv ‘14 - - 90.7 - - 93.3 86.6 DICT-IC03-Full 99.2 98.1 - - - - - DICT-SVT-Full - - 96.1 87.0 - - - DICT+2-90k 98.7 98.6 95.4 80.7 90.8 97.1 92.7 CHAR+2 96.7 94.0 92.6 68.0 79.5 95.5 85.4 NGRAM+2-SVM 96.5 94.0 - - - - - Evaluation performed on standard benchmarks IC03, IC13, SVT, IIIT5k, as well as on Synth dataset for all 90k words. Dierent lexicon sizes (e.g. IC03-50). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Recall Precision (a) (b) (c) (d) (e) (f) 0 10 20 30 40 50 60 70 80 90 100 Recognition Accuracy % DICTIC03Full DICTSVTFull B&W +fonts +colour +pers pec2ve +noise +image blending Synthe2c Data Contribu2ons 4. EXPERIMENTAL SETUP # training words # parameters Synth accuracy DICT-IC03-Full 563 108M 98.7% DICT-SVT-Full 4282 123M 98.7% DICT-90k 90k 450M 90.3% DICT+2-90k 90k 480M 95.2% CHAR 90k 109M 71.0% CHAR+2 90k 127M 86.2% NGRAM-NN 90k 145M 25.1% NGRAM+2-NN 90k 163M 27.9% NGRAM+2 precision/recall curve z, zz, izz DICT: pizza CHAR: pizz i, n, y, im, ji, mm, my, imm, imn, lim, mim, mmy, 2m, immi DICT: jimmy CHAR: limmy a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2 DICT: organiza2on CHAR: organaa2on a, n, o, t, at, io, on, 2, za, a2, ion, iza, 2o, zat, 2on, a2o, izat, za2 DICT: western CHAR: western www.robots.ox.ac.uk/~vgg/data/text/
Transcript
Page 1: Synthetic Data and Artificial Neural Networks for Natural ...vgg/publications/2014/...Text recognition in natural scene images. Contributions • A synthetic data engine to generate

P (c1|�(x))

P (c5|�(x))P (c6|�(x))

P (c23|�(x))

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford, UK

1. OVERVIEW Text recognition in natural scene images. Contributions •  A synthetic data engine to generate

unlimited training data. •  Three deep convolutional neural

network (CNN) architectures for holistic image classification.

•  A resulting set of state-of-the-art reading systems in language constrained and unconstrained scenarios.

2. SYNTHETIC DATA ENGINE

3. MODELS

5. EVALUATION

HERBERT

LOADING

Existing scene text datasets are very small, and cover a small number of words. Use a synthetic data engine to generate training samples. Fonts selected from 1400 Google Fonts. Projective distortion, elastic distortion, and noise applied. Random crops of natural images alpha-blended with image-layers to generate texture and lighting.

1. Font rendering

2. Border/shadow & colour

3. Composition 4. Projective distortion 5. Natural image blending

Dataset Available! •  9 million word images •  Covering 90k English words •  Download at:

DICTIONARY ENCODING (DICT)

w

⇤= arg max

w2WP (w|x)P (w|L)predicted word  

Language model  

Visual model  

corpus (movie subtitles)  

Multi-class classification, one class for each word in dictionary (constrained language model).

w W

The number of classes can be scaled to 90k classes. Requires incremental training – initialize learning with 5k classes, incrementally increase number of classes as learning progresses.

CHARACTER SEQUENCE ENCODING (CHAR)

Single CNN with multiple independent classifiers, inspired by Goodfellow et al ICLR’14. Each classifier predicts the character at each position of the word.

No language model, suitable for unconstrained recognition.

c

⇤i = arg max

ci2C[{�}P (ci|�(x))

w⇤ = c⇤1c⇤2 . . . c

⇤i 8c⇤i /2 {�} null character   shared features  

BAG OF N-GRAMS ENCODING (NGRAM)

Represent a string as a bag-of-N-grams. E.g. G(spires) = {s, p, i, r, e, s, sp, pi, ir, re, es, spi, pir, ire, res, spire, pires}

Visually model 10k common 1, 2, 3, and 4-grams. 10k independent binary classifiers. Result is N-gram detection vector.

Two ways to recover words: •  Find nearest neighbour of output with ideal outputs of dictionary words. •  Train a linear SVM for each dictionary word, using training data outputs.

IC03-50 IC03-Full

SVT-50 SVT IC13 IIIT5k-50 IIIT5k-1k Model

Baseline (ABBYY) 56.0 55.0 35.0 - - 24.3 - Wang, ICCV ‘11 76.0 62.0 57.0 - - - - Bissacco, ICCV ‘13 - - 90.4 78.0 87.6 - - Yao, CVPR ‘14 88.5 80.3 75.9 - - 80.2 69.3 Jaderberg, ECCV ‘14 96.2 91.5 86.1 - - - - Gordo, arXiv ‘14 - - 90.7 - - 93.3 86.6 DICT-IC03-Full 99.2 98.1 - - - - - DICT-SVT-Full - - 96.1 87.0 - - - DICT+2-90k 98.7 98.6 95.4 80.7 90.8 97.1 92.7 CHAR+2 96.7 94.0 92.6 68.0 79.5 95.5 85.4 NGRAM+2-SVM 96.5 94.0 - - - - -

Evaluation performed on standard benchmarks IC03, IC13, SVT, IIIT5k, as well as on Synth dataset for all 90k words. Different lexicon sizes (e.g. IC03-50).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1Precision−Recall curve for N−gram recogntion

Recall

Prec

isio

n

(a) (b) (c) (d) (e) (f)0

10

20

30

40

50

60

70

80

90

100

Reco

gnitio

n Ac

cura

cy %

DICT−IC03−FullDICT−SVT−Full

B&W   +fonts   +colour   +pers-­‐  pec2ve  

+noise   +image  blending  

Synthe2c  Data  Contribu2ons  

4. EXPERIMENTAL SETUP

# training words

# parameters

Synth accuracy

DICT-IC03-Full 563 108M 98.7%

DICT-SVT-Full 4282 123M 98.7%

DICT-90k 90k 450M 90.3%

DICT+2-90k 90k 480M 95.2%

CHAR 90k 109M 71.0%

CHAR+2 90k 127M 86.2%

NGRAM-NN 90k 145M 25.1%

NGRAM+2-NN 90k 163M 27.9%

NGRAM+2 precision/recall curve

z,  zz,  izz  

DICT:  pizza          CHAR:  pizz  

i,  n,  y,  im,  ji,  mm,  my,  imm,  imn,  lim,  mim,  mmy,  2m,  immi  

DICT:  jimmy        CHAR:  limmy  

a,  n,  o,  t,  at,  io,  on,  2,  za,  a2,  ion,  iza,  2o,  zat,  2on,  a2o,  izat,  za2    

DICT:  organiza2on        CHAR:  organaa2on  

a,  n,  o,  t,  at,  io,  on,  2,  za,  a2,  ion,  iza,  2o,  zat,  2on,  a2o,  izat,  za2    

DICT:  western          CHAR:  western  

www.robots.ox.ac.uk/~vgg/data/text/  

Recommended