+ All Categories
Home > Documents > Michele Merler and John R. Kender Computer Science ...mmerler/posterICIP09.pdf · 8 presentation...

Michele Merler and John R. Kender Computer Science ...mmerler/posterICIP09.pdf · 8 presentation...

Date post: 17-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
1
Michele Merler and John R. Kender Computer Science Department {mmerler,jrk}@cs.columbia.edu Columbia University Semantic Keyword Extraction via Adaptive Text Binarization of Unstructured Unsourced Video Example of indexing function : the word Energy is recognized in slides across 4 different presentations 8 presentation videos, 1hr 45 mins, ~13 Slides each Studies have been conducted in order to assess the reliability of slides as a summarization tool [He et al.2000] Text1 Text2 Text3 Text4 Text5 Text6 Text7 Text8 Text9 Text10 Completed Tasks Rcscarch Interview with Client { i sited Project Space Qiant House Resident Association Meeting LoG edges Edges Connected Components Constraints Empirically validated thresholds applied to prune non-text regions 5 6 3 2 10 1000 height height width width Area Area Area F R F R F R F Geometric Constraints Edge Density Constraint F Frame R Candidate Text Region Alignment Regions Merge Constraints Binarization - 54 Detected Regions, 2177154 annotated pixels N. Ground Truth Characters N. Recognized Characters Precision Recall 13804 7376 0.5343 0.7446 N. Ground Truth Words N. Recognized Words Precision Recall 2276 1126 0.4947 0.6651 0 2000 4000 6000 8000 Recognition Method N. Recognized Characters Tesseract LAO + Tesseract ) , ( r gt gtc corc s s ED N N Edit Distance between Ground Truth String and Recognized String Videos of presentations are employed in a large variety of systems for different purposes Distance or E-learning Automatic generation of conference proceedings Student presentations Challenges Result: Fully automatic method for summarizing and indexing unstructured presentation videos based on text extracted from the projected slides Integration into summarization and presentation tools such as the VAST MultiMedia Browser 1 (see image above) 1. www.aquaphoenix.com/research/vastmm 2. http://code.google.com/p/tesseract-ocr 3. http://en.wikipedia.org/wiki/Letter_frequencies Introduction Local Adaptive Otsu (LAO) Binarization Proposed Approach Results Optimize for threshold T maximizing between-class variance in sliding window 2 2 )) ( ) ( )( ( ) ( ) ( T T T n T n T F B F B between 1 ) , , ( 1 ) , , ( ) , ( R W y x k W y x y x T ) ( ) ( ) ( ) ( ) ( 2 2 2 T T n T T n T F F B B within Many videos are already archived Low quality Lack of Structure Lack of additional sources of information (e.g. electronic copies of slides) Not recorded by professional cameramen Shots cannot be used as clue Not edited Unconstrained camera movements Slides Clipped Compression 1. Segment video into semantically distinct shots based on slides 2. Slides Text Detection 10 | ) ( ) ( | 0 ) , ( j i j i j i R x R x R R R R merge 2 . 0 density E 3. Slides Text Recognition Double Text Regions Size with Bilinear Interpolation LAO Binarization Tesseract 2 OCR Engine Training with 15 character sets Height 30pt Algorithm Precision Recall F1* t(sec) Binarization Example Otsu 0.8611 0.8555 0.8583 0.539 Sauvola (k = 0.5) 0.9003 0.8759 0.8879 0.626 LAO 0.8831 0.9278 0.9049 2.126 LAO + Int. Hist. 0.8831 0.9278 0.9049 1.29 Recall Precision Recall Precision F1 2 Most popular fonts used in presentation slides Text reflecting English letters frequencies 3 Slides Text Recognition - 2276 words, 13804 characters Precision SIMPLE Recall SIMPLE 0.71213 0.85914 Precision REFINED Recall REFINED 0.88584 0.68046 0 0.2 0.4 0.6 0.8 1 Precision Recall SIMPLE use all candidate text regions REFINED prune candidate text regions based on OCR output- Slides Text Detection - 500 Frames GT E GT E E GT TA TA TA Recall TA TA TA Precision Ground Truth Text Regions Area Extracted Text Regions Area TA E TA GT 400 with text 100 no text Text vs. Non-Text Foreground vs. Background Pixel Value p F B T ) ( p h 2 F 2 B Character Recognition Word Recognition N r(c/w) cor(c/w) N Recall gt(c/w) cor(c/w) N N Precision No electronic copies of the slides Changes in text used to assess slide changes • The clients desire to generate wind energy • Grey Water system/Hydroelectric Energy @;gm;y,g[g_,bjns and energy efficient light bulbs electrical/thermal energy Reallstlc energy PV and energy storage system hall energy needs Optimal version of Sauvola’s algorithm Localized version of Otsu’s algorithm [Otsu 79] Assumption: bimodal distribution (foreground/background) Fast implementation with Integral Histogram [Porikli et al. 05] [Sauvola et al. 00] [Otsu 79] Dependency from k is removed! *
Transcript
Page 1: Michele Merler and John R. Kender Computer Science ...mmerler/posterICIP09.pdf · 8 presentation videos, 1hr 45 mins, ~13 Slides each •Studies have been conducted in order to assess

Michele Merler and John R. Kender Computer Science Department {mmerler,jrk}@cs.columbia.edu

Columbia University

Semantic Keyword Extraction via Adaptive Text Binarization of

Unstructured Unsourced Video

Example of indexing function: the word Energy is recognized in slides across 4 different presentations

8 presentation videos, 1hr 45 mins, ~13 Slides each

• Studies have been conducted in order to assess the reliability of slides as a summarization tool [He et al.2000]

Text1 Text2 Text3 Text4 Text5 Text6 Text7 Text8 Text9 Text10

Completed Tasks

Rcscarch

Interview with Client

{ i sited Project SpaceQiant House Resident Association Meeting

LoG edgesEdges Connected

Components Constraints

Empirically validated thresholds applied to

prune non-text regions

56

32

101000

height

height

widthwidth

AreaArea

Area

FR

FR

FR

F

Geometric Constraints

Edge Density ConstraintF – FrameR – Candidate Text Region

Alignment Regions Merge Constraints

Binarization - 54 Detected Regions, 2177154 annotated pixels

N. Ground Truth

Characters

N. Recognized

CharactersPrecision Recall

13804 7376 0.5343 0.7446

N. Ground Truth

Words

N. Recognized

WordsPrecision Recall

2276 1126 0.4947 0.6651 0

2000

4000

6000

8000

Recognition Method

N.

Recogniz

ed C

hara

cte

rs

Tesseract

LAO + Tesseract

),( rgtgtccorc ssEDNN

Edit Distance between Ground Truth String and Recognized String

Videos of presentations are employed in a large variety of systems for

different purposes

• Distance or E-learning• Automatic generation of conference proceedings • Student presentations

Challenges

Result: Fully automatic method for summarizing and indexing unstructured presentation videos based on text extracted from the projected slides

Integration into summarization and presentation tools such as the VAST MultiMedia Browser1 (see image above)

1. www.aquaphoenix.com/research/vastmm 2. http://code.google.com/p/tesseract-ocr 3. http://en.wikipedia.org/wiki/Letter_frequencies

Introduction Local Adaptive Otsu (LAO) Binarization

Proposed Approach

Results

Optimize for threshold T maximizing between-class variance in sliding window

22 ))()()(()()( TTTnTnT FBFBbetween

1

),,(1),,(),(

R

WyxkWyxyxT

)()()()()( 222 TTnTTnT FFBBwithin

Many videos are already archived

Low quality Lack of Structure

• Lack of additional sources of information

(e.g. electronic copies of slides)

• Not recorded by professional cameramen

• Shots cannot be used as clue

• Not edited

• Unconstrained camera movements

• Slides Clipped

• Compression

1. Segment video into semantically distinct shots based on slides

2. Slides Text Detection

10|)()(|

0),(

ji

ji

ji RxRx

RRRRmerge

2.0densityE

3. Slides Text Recognition

• Double Text Regions Size with Bilinear Interpolation

• LAO Binarization

• Tesseract2 OCR Engine• Training with 15 character sets• Height 30pt

Algorithm Precision Recall F1* t(sec) Binarization Example

Otsu 0.8611 0.8555 0.8583 0.539

Sauvola (k = 0.5) 0.9003 0.8759 0.8879 0.626

LAO 0.8831 0.9278 0.9049 2.126

LAO + Int. Hist. 0.8831 0.9278 0.9049 1.29 RecallPrecision

RecallPrecisionF1

2

Most popular fonts used in presentation slidesText reflecting English

letters frequencies3

Slides Text Recognition - 2276 words, 13804 characters

PrecisionSIMPLE RecallSIMPLE

0.71213 0.85914

PrecisionREFINED RecallREFINED

0.88584 0.680460

0.2

0.4

0.6

0.8

1

Precision Recall

•SIMPLE – use all candidate text regions

•REFINED – prune candidate text regions based on OCR output-

Slides Text Detection - 500 Frames

GT

EGT

E

EGT

TA

TATARecall

TA

TATAPrecision

Ground Truth Text Regions Area

Extracted Text Regions AreaTAE

TAGT

400 with text 100 no text

Text vs. Non-TextForeground vs. Background

Pixel Value pF BT

)(ph

2

F 2

B

Character Recognition

Word Recognition

N

r(c/w)

cor(c/w)

NRecall

gt(c/w)

cor(c/w)

N

NPrecision

• No electronic copies of the slides

• Changes in text used to assess slide changes

• The clients desire to generate wind energy • Grey Water system/Hydroelectric Energy@;gm;y,g[g_,bjns and energy efficient light bulbs

electrical/thermal energy Reallstlc energyPV and energy storage system hall energy needs

Optimal version of Sauvola’salgorithm

Localized version of Otsu’s algorithm [Otsu 79]

• Assumption: bimodal distribution (foreground/background)

• Fast implementation with Integral Histogram [Porikli et al. 05]

[Sauvola et al. 00][Otsu 79]

Dependency from k is removed!

*

Recommended