5/15/2005 Video Processing Wen-Hung Liao 6/2/2005.

5/15/2005

Video Processing

Wen-Hung Liao6/2/2005

Outline

Basics of Video Video Processing

Video coding/compression/conversion Digital video production Video special effects Video content analysis

Summary

Basics of Video

Component video Composite video Digital video

Component Video

Higher-end video systems make use of three separate video signals for the red, green, and blue image planes. Each color channel is sent as a separate video signal. Most computer systems use Component Video, with

separate signals for R, G, and B signals. For any color separation scheme, Component Video gives

the best color reproduction since there is no crosstalk between the three channels.

This is not the case for S-Video or Composite Video, discussed next.

Component video, however, requires more bandwidth and good synchronization of the three components.

Composite Video Color (chrominance) and intensity (luminance) signals are mixed

into a single carrier wave. Chrominance is a composition of two color components (I and Q,

or U and V). In NTSC TV, e.g., I and Q are combined into a chroma signal, a

nd a color subcarrier is then employed to put the chroma signal at the high-frequency end of the signal shared with the luminance signal.

The chrominance and luminance components can be separated at the receiver end and then the two color components can be further recovered.

When connecting to TVs or VCRs, Composite Video uses only one wire and video color signals are mixed, not sent separately. The audio and sync signals are additions to this one signal.

Since color and intensity are wrapped into the same signal, some interference between the luminance and chrominance signals is inevitable.

S-Video As a compromise, (Separated video, or Super-video, e.g., in S-V

HS) uses two wires, one for luminance and another for a composite chrominance signal.

As a result, there is less crosstalk between the color information and the crucial gray-scale information.

The reason for placing luminance into its own part of the signal is that black-and-white information is most crucial for visual perception.

In fact, humans are able to differentiate spatial resolution in gray-scale images with a much higher acuity than for the color part of color images.

As a result, we can send less accurate color information than must be sent for intensity information | we can only see fairly large blobs of color, so it makes sense to send less color detail.

Digital Video

The advantages of digital representation for video are many.

For example: Video can be stored on digital devices or in memory, ready

to be processed (noise removal, cut and paste, etc.), and integrated to various multimedia applications;

Direct access is possible, which makes nonlinear video editing achievable as a simple, rather than a complex, task;

Repeated recording does not degrade image quality; Ease of encryption and better tolerance to channel noise.

Chroma Subsampling

Since humans see color with much less spatial resolution than they see black and white, it makes sense to decimate the chrominance signal.

Interesting (but not necessarily informative!) names have arisen to label the different schemes used.

To begin with, numbers are given stating how many pixel values, per four original pixels, are actually sent: The chroma subsampling scheme 4:4:4 indicates that no c

hroma subsampling is used: each pixel's Y, Cb and Cr values are transmitted, 4 for each of Y, Cb, Cr.

Chroma Subsampling (2)

The scheme 4:2:2 indicates horizontal subsampling of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labeled as 0 to 3, all four Ys are sent, and every two Cb's and two Cr's are sent, as (Cb0, Y0)(Cr0,Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averaging is used).

The scheme 4:1:1 subsamples horizontally by a factor of 4.

The scheme 4:2:0 subsamples in both the horizontal and vertical dimensions by a factor of 2.

Chroma Subsampling (3)

RGB/YUV Conversion

http://www.fourcc.org/index.php?http%3A//www.fourcc.org/intro.php

RGB to YUV Conversion Y = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16 Cr = V = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128 Cb = U = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128

YUV to RGB Conversion B = 1.164(Y - 16) + 2.018(U - 128) G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128) R = 1.164(Y - 16) + 1.596(V - 128)

Video Coding Standards

MPEG Standards (1, 2,4,7,21) MPEG-1: VCD MPEG-2: DVD MPEG-4: video objects MPEG-7: Multimedia database MPEG-21: framework

H.26x series (H.261,H.263,H.264): video conferencing

Digital Video Production

Tools: Adobe Premiere, After Effects,… Resources:

http://www.cc.gatech.edu/dvfx/resources.htm Examples:

http://www.cc.gatech.edu/dvfx/videos/dvfx2005.html

Video Special Effects

Examples: EffectTV: http://effectv.sourceforge.net/ FreeFrame: http://

freeframe.sourceforge.net/gallery.html

Types of Special Effects

Applying to the whole image frame Applying to part of the image (edges, moving

pixels,…) Applying to a collection of frames Applying to detected areas Overlaying virtual objects:

at pre-determined locations in response to user’s position

Video Content Analysis

Event detection For indexing/searching To obtain high-level semantic description of

the content.

Image Databases Problem: accessing and searching large databases

of images, videos and music Traditional solutions: file IDs, keywords, associated t

ext. Problems:

can’t query based on visual or musical properties depends on the particular vocabulary used doesn’t provide queries by example time consuming

Solution: content-based retrieval using automatic analysis tools (see http://wwwqbic.almaden.ibm.com)

Retrieval of images by similarity Components:

Extraction of features or image signatures and efficient representation and storage

A set of similarity measures A user interface for efficient and ordered representation of

retrieved images and to support relevance feedback

Considerations Many definitions of similarity are possible User interface plays a crucial role Visual content-based retrieval is best utilized when

combined with traditional search

Image features for similarity definition

Color similarity Similarity: e.g., “distance” between color histograms Should use perceptually meaningful color spaces (HSV, La

b...) Should be relatively independent of illumination (color cons

tancy) Locality:“find a red object such as this one

Texture similarity Texture feature extraction (statistical models) Texture qualities: directionality, roughness, granularity...

Shape Similarity Must distinguish between similarity between actual geometrical

2-D shapes in the image and underlying 3-D shape Shape features: circularity, eccentricity, principal axis orientatio

n...Spatial similarity Assumes images have been (automatically or manually) segmen

ted into meaningful objects (symbolic image) Considers the spatial layout of the objects in the scene

Object presence analysis Is this particular object in the image?

Main components of retrieval system

Database population: images and videos are processed to extract features (color, texture, shape, camera and object motion)

Database query: user composes query via graphic user interface. Features are generated from graphical query and input to matching engine

Relevance feedback: automatically adjusts existing query using information fed back by user about relevance of previously retrieved objects

Video parsing and representation

Interaction with video using conventional VCR-like manipulation is difficult - need to introduce structural video analysis

Video parsing Temporal segmentation into elemental units Compact representation of elemental unit

Temporal segmentation

Fundamental unit of video manipulation: video shots Types of transition between shots:

Abrupt shot change Fades: slow change in brightness Dissolve Wipe: pixels from second shots replace those of previous

shot in regular patterns

Other factors of image change: Motion, including camera motion and object motion Luminosity changes and noise

Representation of Video Video database population has three major components:

Shot detection Representative frame creation for each shot Derivation of layered representation of coherently moving structu

res/objects

A representative frame (R-frame) is used for: population: R-frame is treated as a still image for representation query: R-frames are basic units initially returned in video query

Choice of R-frame: first - middle - last frame in video shot sprite built by seamless mosaicing all frames in a shot

Video soundtrack analysis Image/sound relationships are critical to the perception and unde

rstanding of video content. Possibilities: Speech, music and Foley sound, detection and representation Locutor identification and retrieval Word spotting and labeling (speech recognition) A possible query could be: “find the next time this locutor is again

present in this soundtrack”

Video scene analysis

500-1000 shots per hours in typical movies One level above shot: sequence or scene (a series of consecutiv

e shots constituting a unit from the narrative point of view)

Date post:	21-Dec-2015
Category:	Documents
View:	218 times
Download:	3 times

5/15/2005 Video Processing Wen-Hung Liao 6/2/2005.

Documents