The IBM TRECVID Concept Detection System

© 2006 IBM Corporation

The IBM TRECVID Concept Detection System

Milind NaphadeIntelligent Information Analytics GroupIBM Thomas J Watson Research Center

Team: Milind Naphade, Dhiraj Joshi, Dipankar Datta, Paul Natsev, Lexing Xie, Shahram Ebadoolahi, John Smith, Alexander Haubold, Jelena Tesic, Joachim Seidl

2

Milind R. Naphade

IBM TRECVID 2006 System © 2006 IBM Corporation

The IBM TRECVID 2006 Concept Detection System

Annotation and Data

Preparation

Training only: Training and Detection

LocalizedExtraction

Texture(3)

Color (3)

Texture (3)

Motion (4)

Color (3)

GlobalExtraction

ASR Text (3)

SVM Multiview

TextSearch

IVP

LSCOMModeling

Multi-kernel

Low-levelFeature-based

Models

Feature Extraction Fusing Models across

techniques

Annotation

Non-weighted

WeightSelection

Validity-based

Videos

3

Milind R. Naphade


Feature Extraction

VisualColor Correlogram (166)Co-occurrence Texture (96)Color Moments (9)Wavelet Texture (12)Motion Magnitude & Direction (260)

GranularityGrid GlobalCompressed Domain

Macro-block

ASRText Search System

4

Milind R. Naphade


MARVEL MODELER A tool for building models optimized over features, parameters and learning algorithms

Manually Indexed Dev. Corpus

Split inside internally

FeaturesColor

Texture, ASR textCaptions

Audio

Learning andOptimization

MODELSREPOSITORY

Algorithms and ParametersSVM, GMM, HMM

5

Milind R. Naphade


Approach 1: Multiple Instantiations

Consider multiple instantiations of learning problem– Different development corpus partitions

– Different ground truth interpretations

– Different learning algorithms

– Different optimization schemesFuse across the multiple instantiations using multiple normalization and simple fusion strategies

6

Milind R. Naphade


Reusing what we have – 2005 Models

2005 Models for 39 LSCOM-lite concepts using 5 visual featuresRun against 2006 data and combined using late fusionDevelopment Corpus partitioned into 4 setsUses SVM-light package and a range of gamma and C values for parameter optimizationUses the training set of 28055 images for training and validation set of 4400 images for validation and parameter optimizationUses a liberal interpretation of ground truth (annotation assumes positive when any annotator tags it positive) when multiple annotators inputs were available

7

Milind R. Naphade


Using Marvel :Modeler: 2006 Models

2006 Models for 39 LSCOM-lite concepts using 5 visual featuresRun against 2006 data and combined using late fusionDevelopment Corpus partitioned into 3 setsUses IBM implementation of SVM SMO and a range of gamma and C values for parameter optimizationUses the training set of 42000 images for training and validationUses multiple interpretations of ground truth ranging from the most liberal to the most strict when multiple annotators inputs were availableAll new models built using Marvel Modeler using 7 parameter configurations for 5 features for each concept. Number of parameter configurations and features constrained by the time for the effort: 1 week

8

Milind R. Naphade


Multi-view Approach: Fusion

Normalization1. Gaussian2. Sigmoid3. Range4. Rank

Aggregation1. Average2. Weighed Average

9

Milind R. Naphade


Comparison between 2005 and 2006 SVM Models

Older models built for TREC 2005 Newer 2006 models built using Marvel Modeler Performance evaluated: 2005 Test SetNumber of Concepts: 10Ground Truth: Provided by NISTMAP for 2005 models: 0.31MAP for 2006 models: 0.31MAP for fused 2005 and 2006 models: 0.37

20 % performance improvement fusing 2 views

10

Milind R. Naphade


Approach: Multi-kernel Learning

Problem: Fusing multiple inputs: color moments, correlogram, texture …Late fusion

1. Train SVM on each

2. Perform weighted fusion on the prediction values

Equivalent to having kernel weights for each support vector

Alternative– Train one decision function for both the support vector

weights and the kernel weights

– … and make the support vector weights shared among kernels ?

Advantages:– Decision + fusion learned in one pass

– Less weights to learn and keep

– Faster to evaluate on test data

– … …

11

Milind R. Naphade


Multiple Kernel Learning: Solution

SVM MKL

µ1 µ2 µ3

x

xx

+- x

xx

x

xx

support vectors

Second-order cone programming

[bach, lankriet, jordan2003][sedumi 2001]

12

Milind R. Naphade


Approach: Text Baseline

IBM Text Search Engine for Shot-level ranking– JURU Search Engine used

– No story level processing

– Normalization of Text-based Run different than other runs

– Fusion with visual models for generating multimodal runs Manual expansion from concepts to keywords

– Potential use of LSCOM, CyC, WordNet to be exploredHeld Out Set Performance lower than Visual Models

– Strength of approach is in combination hypothesis

13

Milind R. Naphade


Fusion Across Multiple Approaches

Normalization1. Gaussian2. Sigmoid3. Range4. Rank

Aggregation1. Average2. Weighed Average

Weight Selection1. Validity-based

14

Milind R. Naphade


LSCOM Models

Time limitation forced to build 70 LSCOM modelsFocused on frequent concepts that were also relevantMarvel Modeler leveraged for building modelsSame IBM colleague performed model buildingContext enforcement performed using manual mappingFew LSCOM-lite concepts targeted for context enforcement

– Military Personnel

– Waterscape

– AirplaneResulted in 1 Type B Run mistakenly tagged Type A

15

Milind R. Naphade


IBM Runs

DescriptionTypeRun Name

B

A

A

A

A

A

Aggregating across all subsystems including Text Baseline, Visual Baseline Multi-kernel Linear machines, Image Upsampling, and LSCOM context and using held out set for optimal selection

Aggregating across all subsystems including Text Baseline, Visual Baseline Multi-kernel Linear machines, and Image Upsampling

Sigmoid Normalization and Decision Fusion of Multi-view SVM Visual and Text Baselines

Fusion of Multi-view SVM Visual and Text Baselines

Unimodal Baseline: Best of Visual Baseline and Text Baseline selected based on held out set performance

Visual Baseline: Using 5 upto visual features and Multi-view SVM Models with naïve fusion

MBWN

MRF

MAAR

MBW

UB

VB

16

Milind R. Naphade


NIST Evaluation: Performance Summary

All IBM runs except Visual Baseline buggy for 3 concepts– Submitted with Incorrect feature numbers (fnum)

– Did not contribute to the poolingMean Inferred Average Precision

– Ranges from 0.145 (Visual only) to 0.1773 (Multimodal)NIST Returned Precision @100

– Ranges from 22 (Visual Only) to 26 (Multimodal)Top performance for 7 of the 20 conceptsSecond highest MAP among all sitesTop MIAP and IP@100 accounting for the bug

– Excluding the 3 concepts that did not make it to the pool

17

Milind R. Naphade


IBM Runs in Context of Overall Benchmark

Mean Inf. P@100

0

5

10

15

20

25

30

• IBM Runs returned near top performance with bug, top performance discounting bug• NIST Returned P@100: Multimodal runs improve over Visual baseline by 10 % • InfAP: Multimodal Runs improve over Visual baseline by 22 %• IBM Runs have top performance for 7/20 concepts

18

Milind R. Naphade


IBM Runs in Context with Overall BenchmarkMean Inf. AP

0

0.05

0.1

0.15

0.2

0.25

• IBM Runs returned near top performance with bug, top performance discounting bug• NIST Returned P@100: Multimodal runs improve over Visual baseline by 10 % • InfAP: Multimodal Runs improve over Visual baseline by 22 %• IBM Runs have top performance for 7/20 concepts

19

Milind R. Naphade


IBM Runs in Context with Overall BenchmarkCorrected Mean P@100

0

5

10

15

20

25

30

35

• IBM Runs returned near top performance with bug, top performance discounting bug• NIST returned P@100: Multimodal runs improve over Visual baseline by 10 % • InfAP: Multimodal Runs improve over Visual baseline by 22 %• IBM Runs have top performance for 7/20 concepts

20

Milind R. Naphade


IBM Runs in Context with Overall BenchmarkMean IAP

0

0.05

0.1

0.15

0.2

0.25

Mea

n In

f. A

P

• IBM Runs returned near top performance • NIST returned P@100: Multimodal runs improve over Visual baseline by 10 % • InfAP: Multimodal Runs improve over Visual baseline by 22 %• IBM Runs have top performance for 7/20 concepts

21

Milind R. Naphade


IBM Runs in Context with Overall BenchmarkMean IP@100

0

5

10

15

20

25

30

• IBM Runs returned near top performance • NIST returned P@100: Multimodal runs improve over Visual baseline by 10 % • InfAP: Multimodal Runs improve over Visual baseline by 22 %• IBM Runs have top performance for 7/20 concepts

22

Milind R. Naphade


But Was this Analysis Conclusive?Random Sampling of the Pool raises questions about conclusivenessActual P@100 Range: 44 to 52NIST Returned P@100 Range: 22 to 26Absolute Numbers Matter: So Relative Ordering may not be enoughPerformance discrepancy significant for 15 of the 20 concepts

P@100: Actual vs Inferred

020

4060

8010

0

23

Milind R. Naphade


Observations

Visual Baseline created by leveraging Marvel Modeler AssetText+Visual improve performance by 10 % over Visual-onlyContext helps when underlying contributors are robustNeed more work on event and object detectionNormalization & multimodal fusion leads to re-ranking

Significant improvement in concepts such as Airplane (3x better)LSCOM provides large untapped potential

– Quality is Key– Once Acceptable Quality guaranteed, Quantity is game changer

Milind R. Naphade

© 2005 IBM Corporation24 12/19/2006

From LSCOM-lite to LSCOMBroadcast News

Program Category

Finance & Business

Commercial

Weather

Entertainment

Science &Tech

Sports

Meeting

Studio

Outdoor

Road

Sky

Snow

Urban

Waterscape

Mountain

Desert

CourtPolitics

Location

Indoors

Office

People

Face

Person

Roles

Govt Leader

Corp Leader

Police/Security

Prisoner

Military

Crowd

Objects

Flag-US

Animal

Computer

Vehicle

Airplane

Car

Boat/Ship

Bus

Truck

Building

Vegetation

Activities & Events

Walk/Run

People Related

March

Events

Explosion/Fire

Natural Disaster

Graphics

Maps

Charts

Page 25

Start with existing terms

Workflow

Create list of useful concepts with users

Filter concepts that are not feasible

Filter concepts that are not observable

Annotate partial corpus with concepts

Filter concepts that are very rare or with very high inter-annotator disagreement

LSCOMGoal and Vision

Annotation& Knowledge

RepresentationCommunity

User Community

Modeling Community

Domain Vocabulary& Ontology

Usability

Feasibility Observability

Annotation& Knowledge

RepresentationCommunity

User Community

Modeling Community

Domain Vocabulary& Ontology

Usability

Feasibility Observability

Deliverables Impact• 1000+ concept lexicon• Annotated corpus• 39 Use Cases and 250 + Queries• Ontology• Experimental Evaluation

• Largest annotated video corpus

• Leveraged at TRECVID and other fora

• LSCOM mapped into openCyC and ResearchCyC

• Dissemination at various fora for optimizing utilization leading to collaboration opportunities

Page 26

What is LSCOM?•1000+ concepts that describe broadcast news from the intelligence analyst perspective•An annotated corpus of 61901 shots (80 hours) of broadcast news video (3 languages, 6 channels) for 449 concepts•Compilation of 39 use cases and 250+ TRECVID style Queries that represent analyst requirements•Mapping of LSCOM concepts and subsequent expansion using CyC (packaged in OpenCyC and ResearchCyC releases) •Initial results on modeling 300 of the annotated concepts

Page 27

Evaluation Results

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Baseline LSCOM-Lite(39Concepts)

MediaMill(75 concepts) LSCOM (317 concepts)

Oracle Combination + OracleDetection

Oracle Combination + NoiseDetection, 20% Break-evenPrecision-Recall

Page 28

Extrapolating MAP by # concepts:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

7500

8000

Number of Concepts

Mea

n Av

erag

e Pr

ecis

ion

OC + ODOC + ND

How many concepts do we need? 3K-5K

Date post:	15-Feb-2022
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

The IBM TRECVID Concept Detection System

Documents