Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | joseph-moran |
View: | 34 times |
Download: | 3 times |
A New Approach for Video Text Detection and Localization
M. Cai, J. Song and M.R. Lyu
VIEW Technologies
The Chinese University of Hong Kong
Related work
Text Area Detection– Uncompressed domain methods
• Texture-based• Color-based• Edge-based
– Compressed domain methods• DCT coefficients• Number of intra-coded blocks on P- / B- frames
Text String Localization– Bottom-up scheme– Top-down scheme
Language-independent characteristics
Contrast– An adaptive contrast threshold according
to the background complexity
Color– Color bleeding caused by compression
Orientation– Well-defined size and orientation make it
easy to understand
Stationary location– Appear a certain long time
Language-dependent characteristics
English Chinese
Stroke density roughly similar varies dramatically
Min(Font size) 10-pixel high 20-pixel high
Min(Aspect ratio) Relatively large Relatively small
Stroke direction statistics
mainly vertical vertical horizontalLeft diagonalRight diagonal
Workflow
Sampling &color space conversion
Multi-frame comparison
Video text detection andlocalization on
every sampled frame
A sequential multi-resolution paradigm
Level = 2
Level = n-1
Original image
Edge map
Text regions
Original coordinates of text regions
Size/ f(l)Text areaDetection
Text stringLocalization Size f(l)
Level = 1
Edge map
Text regions
Original coordinates of text regions
Size/ f(l)Text areaDetection
Text stringLocalization Size f(l)
Level = n
Final text regions with original coordinates
Edge detection
Text detection
Edge detection– Sobel edge detector
Local thresholding– Adaptive to background complexity
Text-like area recovery– Enhance the density of text areas
Local Thresholding
Use a small kernel (gray) to scan the whole edge map row by row.
In the bigger window surrounding the kernel, check the background type: “Clear” or “Noisy”.
For Clear background and Noisy background, determined the local threshold by low and high parts, respectively, of the edge strength histogram in the bigger window.
3hh
Window
Kernel
(a) Concentric kernel and window
P1
P3h....
(b) A window on the multi-line text area and the horizontal projection in it.
(c) Local threshold selection MAX
Count
Edge strength 0
Low part High part
Thresholding result comparison
Video image Local thresholding resultsGlobal thresholding results
Labeling: Classify current edge pixels as “TEXT” and “NON_TEXT” based on its local density.Recovery/Suppression:– Bring back neighboring lower-strength edge pixels of
the TEXT edge pixels.– The NON_TEXT edge pixels are suppressed.
Text-like area recovery
Before recovery After recovery
Coarse-to-fine Text localization
Projection-based top-down localization.
To handle complex text layout.
Divisible? Horizontal projection
Vertical projection
Pop the first region from theprocessing array
Add to the processing array
InitializationThe whole edge map is the only region in the processing array.
Add to the resulting text regions
Y
N
Eachsub-region
The region
Sub-regions
Indivisible regions
Y
N
If the array is empty, terminate.
Divisible?
Check aspect ratio
Y
N
Discard false regions
Localization steps
(1)
(2)
(3)
(4)
Experimental results
Experimental results
Performance statistics
Statistics of 10 news videos:
Processing time per frame: 0.25 s (PIII 1G CPU)
Detection rate = = 93.6%
Detection accuracy =
= 87.2%
Localization accuracy
= > 90%
)regionstexttruthground(
)regionstextdetectedcorrectly(
Num
Num
)regionstextdetectedall(
)regionstextdetectedcorrectly(
Num
Num
)regionstexttruthground(
)regionstexttruthground()regionstextdetected(
Area
AreaArea