Date post: | 15-Dec-2015 |
Category: |
Documents |
Upload: | caiden-waln |
View: | 217 times |
Download: | 3 times |
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
Overview of MPEG-7Overview of MPEG-7
Dr Zhang Sen
Speech Group, INRIA-LORIAVillers les Nancy, France
Chinese Academy of SciencesBeijing, China
04/18/23
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
2
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
3
Ozone WP2 architecture
Ozone application
Software Environment layer
Oz
on
e
Servic
es
Situation Sensitivity
User Context
OzoneContext
Multi-modal widgets
Dialog management
smartagent User
Interfacemana-
gement Percep-
tion QoS
Security
speechrecognition
videobrowser
...
animated agent
Authen-tication
User-interaction module
gesturerecognition
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
4
90 92 94 98 99 01 ?
v1 v2
mpeg1 mpeg2 mpeg4 mpeg7 mpeg21
• MPEG-3, ever defined, but abandoned
• MPEG-5 and -6, not defined
From MPEG-1 to MPEG-7
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
5
MPEG-1 – Coding of moving pictures and audio for digital
storage media (CD-ROM, MP3), 11/92
MPEG-2 – Generic Coding of moving pictures and audio
information (DVD, Digital TV), 11/94
MPEG-4 – Coding of Audiovisual Objects for MM appls
Ver1 09/98, Ver2 11/99
MPEG-7 – Multimedia content description for AV material 08/01
MPEG-21 – Digital AV framework: Integration of
multimedia technologies, 11/01
MPEG Family
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
6
Why is MPEG-7 needed
• Digital audiovisual information increasing– more and more available contents– all kinds of sources of information
• Use of the digital audiovisual information– description of the contents– fast search of the contents
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
7
Objective of MPEG-7
• Standardize content-based description for various types of audiovisual information – Enable fast and efficient content searching, filtering and
identification
– Describe several aspects of the content (low-level features, structure, semantic, models, collections, creation, etc.)
– Address a large range of applications
• Types of audiovisual information: – Audio, speech
– Moving video, still pictures, graphics, 3D models
– Information on how objects are combined in scenes
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
8
Scope of MPEG-7
• The description generation (feature extraction, indexing process, annotation & authoring tools,...) and consumption (search engine, filtering tool, retrieval process, browsing device, ...) are non normative parts of MPEG-7.
• The goal is to define the minimum that enables interoperability.
DescriptionDescriptiongeneration
Description consumption
Scope of MPEG-7Research and
future competitionResearch and
future competition
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
9
Scope of MPEG-7
Feature SearchExtraction Engine
MPEG-7Description
standardization
Search Engine:Searching & filteringClassificationManipulationSummarization Indexing
MPEG-7 Scope:Description Schemes (DSs)Descriptors (Ds)Language (DDL)Ref: MPEG-7 Concepts
Feature Extraction:Content analysis (D, DS)Feature extraction (D, DS)Annotation tools (DS)Authoring (DS)
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
10
Audio in MPEG-7
• Audio content description (yes)
• Sound retrieval and classifier (yes)
• Speech synthesis (no)
• Speech recognition (no)
• Probability Models (yes)
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
11
Parts of the MPEG-7 Standard
• ISO / IEC 15938 - 1: Systems • ISO / IEC 15938 - 2: Description Definition Language • ISO / IEC 15938 - 3: Visual • ISO / IEC 15938 - 4: Audio • ISO / IEC 15938 - 5: Multimedia Description Schemes • ISO / IEC 15938 - 6: Reference Software
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
12
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
13
Main elements of MPEG-7
• Descriptors (D): representations of features, that define the syntax and the semantics of each feature representation (low-level).
• Description Schemes (DS): that specify the structure and semantics of the relationships between their components, which may be both Ds and DSs (high-level).
• A Description Definition Language (DDL): based on XML Schema, to allow the creation of new DSs and Ds, and to allow the extension and modification of existing DSs
• System tools: to support multiplexing of descriptions, synchronization issues, transmission mechanisms, coded representations, management and protection of intellectual property
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
14
Relations of main elements
DS
DDL
DSDS
DSDS
D
DDD
D DSDS
DS
DD
D
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
15
Description Definition Language
• Description Definition Language (DDL) is a language
that define what description is valid, and allows the
creation of new Description Schemes and Descriptors.
It also allows the extension and modification of existing
Description Schemes• DDL is used to define a set of formal rules
• ordering of the elements
• occurrences of elements
……...
• XML + MPEG-7 extensions
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
16
• Why choose XML as the base for the DDL? • The popularity of XML• The interoperability with other standards in the future
• Why XML should be extended for MPEG-7?• SGML > XML• Structural extensions• Datatype extensions
XML: Base for DDL
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
17
DDL parser
DDL parser is a software to check if
a description is valid
Description Parser
Schema
YesorNo
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
18
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
19
Type of descriptions
• Low level description (features, etc)• Generic and flexible • Intelligent / efficient search engine
• High level description (structures, concepts,etc)• Efficient and powerful • Lack of flexibility
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
20
Low-level Description
• Information in the creation and production processes• director, title, short feature movie
• Information related to the usage of the content • copyright pointers, usage history, broadcast schedule
• Information on the storage features of the content • storage format, encoding
• Information about low-level features in the content • colors, textures, sound timbres, melody
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
21
High-level Description
• Structural description – video segments, frames, still and moving regions,
audio segments– Segment DS (representing the spatial, temporal or
spatio-temporal structure)• Conceptual (semantic) description
– objects, events, and notions – links of the two descriptions
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
22
Illustration of descriptions
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
23
Basic description
• Elements– Information containers– containing data and other elements– <city> …… </city>
• Attributes– Attribute-value pairs used to characterize elements– <city population=“10000”> …… </city>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
24
Structured descriptions
• Structured descriptions are trees• Trees are suitable for retrieval and search
DS
DS DS D
D D DD
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
25
Description trees<letter>
<header><name> Mr Sen </name><address>
<street> 16 rue Laplace </street><city> Nancy </city>
</address></header><text> Dear Mr White, …</text>
</letter>
text
name
letter
header
address
street city
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
26
Example: Audio description
<Mpeg7Main><DescriptionMetadata>
<Version>1.0</Version></DescriptionMetadata><ContentDescription>
<AudioContent xs1:type=“AudioType”><Audio>
<CreationInformation><Creation>
<Title> The daily news </Title></Creation>
</CreationInformation></Audio>
</AudioContent></ContentDescription>
</Mpeg7Main>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
27
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
28
Audio description
• Low-level Description – spectrum, parametric, and temporal features
• High-level Description– Audio signature Description Scheme – Instrument timbre Description Schemes – The melody Description Tools – Sound recognition and indexing Description To
ols– Spoken Content Description Tools
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
29
Audio low-level descriptors
• Waveform• Loudness• Spectral basis• Spectral envelope• Spectral centroid• Spectral spread• Fundamental frequency• Harmonicity• Attack time
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
30
Audio descriptor: Basic
• Two basic audio Descriptors– AudioWaveform Descriptor
• describes the audio waveform envelope (minimum and maximum)
– AudioPower Descriptor • describes the temporally-smoothed instantaneous po
wer
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
31
Audio descriptor: Basic Spectral
• AudioSpectrumEnvelope Descriptor– describes the short-term power spectrum
• AudioSpectrumCentroid Descriptor – describes the center of gravity of the log-frequency po
wer spectrum
• AudioSpectrumSpread Descriptor – describing the second moment of the log-frequency po
wer spectrum
• AudioSpectrumFlatness Descriptor – describes the flatness properties of the spectrum
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
32
Audio Signature Description
• AudioSignature Description Scheme provides a unique content identifier for the purpose of robust automatic identification of audio signals
• Applications include – audio fingerprinting– identification of audio– locating metadata for legacy audio content
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
33
Instrument Timbre Description
• Timbre is defined as the perceptual features that make two sounds having the same pitch and loudness sound different.
• Timbre Description describes the perceptual features with a reduced set of Descriptors– HarmonicInstrumentTimbre Descriptor – LogAttackTime Descriptor– PercussiveIinstrumentTimbre Descriptor – Combination with Basic Spectral Descriptors
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
34
Melody Description Tools
The melody Description Tools is to facilitate efficient, robust, and expressive melodic similarity matching
• MelodyContour Description Scheme– 5-step contour representation– basic rhythmic information representation
• MelodySequence Description Scheme – supporting an expanded descriptor set and high p
recision of interval encoding
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
35
General Sound Recognition and Indexing Description Tools
• SoundModel (SM) DS– statistical model, such as HMM or GMM– SoundModelStatePath Descriptor
• consists of a state sequence generated by a SM– SoundModelStateHistogram Descriptor
• consists of a normalized histogram of the state sequence generated by a SM given an audio segment
• SoundClassificationModel DS – a trainable multi-way classifier based on SMs
• speech vs music, male vs female, trumpet vs violin• genre classification, voice recognition
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
36
Spoken content retrieval
• Output of ASR– phone lattice or word lattice– spoken content DS stores these
lattices instead of plain text– lattices are good for retrieval
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
37
Spoken Content Description Tools
• SpokenContentLattice– representing the actual decoding produced by a
n ASR engine
• SpokenContentHeader– contains information about the speakers being r
ecognized and the recognizer itself– WordLexicon Descriptor – PhoneLexicon Descriptor– SpeakerInfo Descriptor– ConfusionInfo Descriptor
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
38
Gaussian DS
<Gaussian>
<Mean>
4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09
………………………………
</Mean>
<Variance>
1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006
………………………………
</Variance>
</Gaussian>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
39
State-transition model DS<StateTransitionModel>
<Transitions size1="20" size2="20">
0 0 0.210526 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
……………………………………
</Transitions>
<Initial size="20">
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
</Initial>
<State label="0 players" confidence="1">
……………………………………
<State label="19 players" confidence="0.223607">
</StateTransitionModel>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
40
ProbabilityModelClassier DS<ProbabilityModelClassifier confidence="0.9" length="2">
<ProbabilityModelClass SemanticLabel="fish" Confidence="0.5"
DescriptorName="ColorHistogram">
<Gaussian>
<Mean>
4087.18 7173.73 1.36364 94.2727 1834.36 2359.55
………………………….
</Mean>
<Variance>
1.6982e+007 5.21621e+007 14.3636 9749.09
………………………….
</Variance>
</Gaussian>
</ProbabilityModelClass>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
41
SpokenContentLattice DS
A lattice structure for an hypothetical (combined phone and word) decoding of the expression “Taj Mahal drawing …”.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
42
MPEG-7SOUND
DATABASE
SoundModelStatePath
SoundRecognitionClassifier
HMM 2
HMM 1
HMM N-1
HMM N
MODEL REF+STATE PATH
HMMAND
BASES
SELECTAUDIOQUERY
SPECTRUMPROJECTION
N
SoundRecognitionModel
Segmented AudioDescription
AudioSpectrumBasis
Extraction of sound indexes using a sound-recognition classifier. The model reference and state
path is stored.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
43
MATCHING
MPEG-7SOUND
DATABASE
RESULT LIST
SoundModelStatePath
SoundRecognitionClassifier
HMM 2
HMM 1
HMM N-1
HMM N
MODEL REF+STATE PATH
HMMAND
BASIS
SELECTAUDIOQUERY
SPECTRUMPROJECTION
N
SoundRecognitionModel
AudioSpectrumBasisContinuousMarkovModel
Indexed Audio
Query-by-example application with a query in media source form. Features must be
extracted and projected into the classification space for each model
in order to match against the database.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
44
MATCHING
MPEG-7SOUND
DATABASE
RESULT LIST
MODEL REF +STATE PATH
DDLQUERY
An example search application utilizing a query in DDL format
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
45
Extraction of hidden Markov model and basis functions
and storage in a DDL representation
HMMAND
BASISAUDIOWAV FILES
BASISEXTRACT
HMM
SoundRecognitionModel
FEATUREEXTRACT
AudioSpectrumBasis
SoundRecognitionFeatures ContinuousMarkovModel
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
46
Scenario for for the spoken content Description Tools
• Recall of AV data by memorable spoken events– A film or video recording where a character or person spoke a particular
word or sequence of words. The source media would be known, and the query would return a position in the media.
• Spoken Document Retrieval– There is a database consisting of separate spoken documents. The result
of the query is the relevant documents, and optionally the position in those documents of the matched speech
• Annotated Media Retrieval– Similar to spoken document retrieval. The result of the query is the
media which is annotated with speech, and not the speech itself. An example is a photograph retrieved using a spoken annotation.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
47
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
48
Multimedia DSs
• Basic Elements• Content Management• Content Description• Content Organization• Navigation and Access• User Interaction
Multimedia Description Schemes are metadata structures for describing and annotating audio-visual (AV) content
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
49
Organization of Multimedia DSs
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
50
Content Management• Creation and production information
– Creation information • title, textual annotation, creators, and dates
– Classification information• genre, subject, purpose, language
• Media coding, storage and file formats– format, compression, and coding
• Content usage– usage rights, usage record
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
51
Navigation and Access
• Summaries– hierarchical summaries– sequential summaries
• Partitions and Decompositions– decompositions in space, time and frequency– used in multi-resolution access and progressive retrieval
• Variations– selection of the most suitable of an AV program– adapt to the different capabilities of terminal devices,
network conditions or user preferences
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
52
Hierarchical summary
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
53
Illustration of variations
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
54
Content Organization
• Collections– group the contents into clusters
– describes statistics and models of the attribute values – describe relationships among collection clusters
• Models– model the attributes and features of AV content– Probability Model
• specify statistical functions and structures – Analytic Model
• specify semantic labels • specify the confidence• build classifiers
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
55
Collection Structure
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
56
User Interaction
• User Preference– context dependency in terms of time and place– relative importance of different preferences– privacy characteristics of the preferences – preferences update by agent or user
• Usage History – history of actions – used to determine the user's preferences
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
57
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
58
eXperimentation Model(XM)
• Simulation platform for:• Ds, DSs, CSs, DDL
• XM applications: • the server (extraction) applications • the client (search, filtering and/or transcoding) applications
CS: Coding Schemes
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
59
The XM applications
• Extraction from Media• all low-level Ds or DSs should have an application class of this type
• Search & Retrieval Application• either client application
• Media Transcoding Application• either client application
• Description Filtering Application• either client application
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
60
Extraction from Media
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
61
Search and retrieval application
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
62
Media transcoding application
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
63
Description Filtering Application
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
64
Interface model for XM app
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
65
Real world application
MDB = media database, DDB = description database. First, from a media database two features are extracted. Then, basing on the first feature,
relevant media files are selected from the media database. The relevant media files are transcoded basing on the second extracted feature.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
66
• Storage and retrieval of audiovisual databases (image, film, radio archives)
• Broadcast media selection (radio, TV programs)
• Surveillance (traffic control, surface transportation, production chains)
• E-commerce and Tele-shopping (searching for clothes / patterns)
• Remote sensing (cartography, ecology, natural resources management)
• Entertainment (searching for a game, for a karaoke)
• Cultural services (museums, art galleries)
• Journalism (searching for events, persons)
• Personalized news service on Internet (push media filtering)
• Intelligent multimedia presentations
• Educational applications nBio-medical applications
MPEG-7 application areas
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
67
Illustration of applications
Users
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
68
Information Flow
Feature extraction
Transmission
Storage
AV Description
Search/query
Browse
Filter
UsersUsers
PullPull
PushPush
Manual/automatic
DecodingEncoding
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
69
Push and Pull applications
• Push applications– Example: Search engines for internet and DBs – Advantage: Many search engines work on stand
ardized descriptions
• Pull applications– Example: Broadcast of video, Interactive TV – Advantage: Intelligent agents filter standardize
d descriptions
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
70
Example: Pull application
MPEG-7MPEG-7DatabaseDatabase
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
71
Example: Push application
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
72
Example: queries
• Text (keywords): – Find AV material with subject corresponding to some k
eywords • Semantic description:
– Find AV material corresponding to a specified semantic • Image as an example:
– Find an image with similar characteristics (global or local)
• A few notes of music: – Find corresponding musical pieces or movies
• Low level features (example: motion): – Find video with specific object motion trajectories
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
73
Integration of MPEG-7 into XML
<seq begin=20s dur=10s> <img id="Image1" dur=5s> <MP7: annotation> <Who>Fernado Morientes</Who> < WhatAction >Spain vs. Sweden soccer match </ WhatAction> </MP7: annotation> </img> <img id="Image2" dur=2s /> </seq>
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
74
Outline of contents
• Introduction• Basic Components• Content Description• Audiovisual (AV) Descriptions• Multimedia Description Schemes• XM and Applications• More Information
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
75
MPEG-7 and other Standards
• MPEG-1, -2, and -4 are designed to represent the information itself, while MPEG-7 is meant to represent information about the information.
• MPEG-1, -2, and -4 make content available, while MPEG-7 allows you to find the content you need.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
76
Ultimate ambition of MPEG-7
• To make the web as searchable for multimedia content as it is searchable for text today
• To improve the use of computer systems as easy as possible
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
77
MPEG-7 beyond
• To mould computers around human requirements and not humans around computer requirements
• To enable content disclosure based on facts, rather than on human annotations
• To find information by rich spoken queries, hand-drawn images and address what most people expect computers to be able to do
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
78
More Information on WWW
• Major MPEG-7 documents
http://www.cselt.it/mpeg/, semi-official website
http://www.mpeg-7.com, official website
• Others
http://www.elsevier.com/locate/image
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
79
Conclusion
AV contents
Structures
Features
Ds
DSs
DDL Ds, DSs
User
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
80
ThankThankss
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
81
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
82
Low level AV descriptors
Video segments•Color •Camera motion •Motion activity •Mosaic
Moving regions•Color •Motion trajectory•Parametric motion•Spatio-temporal shape
Still regions
•Color •Shape •Position •Texture
Audio segments
•Spoken content •Spectral feature•Timbre
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
83
Face Recognition Descriptor
• Projection of a face vector onto a set of basis vectors (face patterns)
• Feature set is extracted from a normalized face image
• Normalized face image– 56 lines with 46 intensity values in each line– The centers of the two eyes are located on the
24th row and the 16th and 31st column for the right and left eye respectively
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
84
Segment Decomposition
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
85
MPEG-7 Normative Interfaces
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
86
Example: Content description
MPEG-7MPEG-7DatabaseDatabase
IndexingIndexingFea extracFea extrac
SearchSearchretrievalretrieval
High levelHigh levelprocessprocess
Low levelLow levelprocessprocess
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
87
Segment DS Segment DS describes the result of a spatial, temporal, or spatio-temporal partitioning of the AV content. It has nine major subclasses:
• Multimedia Segment DS• AudioVisual Region DS• AudioVisual Segment DS• Audio Segment DS• Still Region DS• Still Region 3D DS• Moving Region DS• Video Segment DS • Ink Segment DS
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
88
Examples: T/S segments
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
89
Example: Segment trees
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
90
Illus of conceptual description
Object DS
Event DS
Concept DS
Semantic state DS
Semantic place DS
Semantic time DSAV content
Semantic DS
Semantic container DS
Semantic base DS
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
91
Visual description
• Basic structures– Grid layout, Time series, Multiple view,
Spatial 2D coordinates, Temporal interpolation
• Descriptors– Color, Texture, Shape, Motion, Localization
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
92
Example: Color Descriptors
• Color space
• Color Quantization
• Dominant Colors
• Scalable Color
• Color Layout
• Color-Structure
• GoF/GoP Color
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
93
Example: Color space
• R,G,B
• Y,Cr,Cb
• H,S,V
• HMMD
• Linear transformation matrix with reference to R, G, B
• Monochrome
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
94
Audio Framework
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
95
Descriptor
• Definition A Descriptor (D) is a representation of a Feature. A Descriptor defines the syntax and the semantics of the Feature representation. • Notes A descriptor allows an evaluation of the corresponding feature via the descriptor value. It is possible to have several descriptors representing a single feature. • Examples For example for the color feature, possible descriptors are: the color histogram, the average of the frequency components, the motion field, the text of the title, etc.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
96
Descriptor Value
• Definition A Descriptor Value is an instantiation of a Descriptor for a given data set (or subset thereof).
• Notes Descriptor Values are combined via the mechanism of a Description Scheme to form a Description.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
97
Description Scheme
• Definition A Description Scheme (DS) specifies the structure and semantics of the relationships between its components, which may be both Descriptors and Description Schemes.• Examples A movie, structured as scenes and shots, including some textual descriptors at the scene level, and color, motion and some audio descriptors at the shot level. • Note Ds contain only basic data types, and does not refer to others D or DSs.
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
98
DS: XML Scheme & Extensions
• XML Scheme• Data types • Simple and Complex types • Elements • Inheritance, Abstract types
• MPEG-7 extensions• Array and Matrix datatype • Enumerated datatypes for MimeType, CountryCode, RegionCode, CurrencyCode and CharacterSetCode • Typed references
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
99
Basic elements of DS
• Constructs for linking media files
• Localizing pieces of content
• Describing – time, places, persons, individuals, groups,
organizations, and textual annotation, etc– Who? What object? What action? Where?
When? Why? and How?
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
100
Content recognition tools
• No speech or face or gesture recognition engines included in MPEG-7
• Content recognition tools is a task for industries, not a standard– coding tools in MPEG-1, -2, -4 were for
research purposes, not part of the standard– no tools were part of the MPEG standard
Chinese Academy of Sciences, Beijing, China
Speech and Language Processing Techniques
Report
Docum
ent
101