+ All Categories
Home > Documents > A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Date post: 13-May-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
41
A Realtime Hardware System for Stereoscopic Videoconferencing with Viewpoint Adaptation Jens-Rainer Ohm 1 , Karsten Grüneberg 1 , Emile Hendriks 2 , Ebroul Izquierdo M. 1 , Dimitris Kalivas 3 , Michael Karl 1 , Dionysis Papadimatos 4 , André Redert 2 ABSTRACT This paper describes a hardware system and the underlying algorithms that were developed for realtime stereoscopic videoconferencing with viewpoint adaptation within the European PANORAMA project. The goal was to achieve a true telepresence illusion for the remote partners. For this purpose, intermediate views at arbitrary positions must be synthesized from the views of a stereoscopic camera system with rather large baseline. The actual viewpoint is adapted according to the head position of the viewer, such that the impression of motion parallax is produced. The whole system consists of a disparity estimator, stereoscopic MPEG-2 encoder, disparity encoder and multiplexer at the transmitter side, and a demultiplexer, disparity decoder, MPEG-2 decoder and 1 Heinrich-Hertz-Institut Berlin, Germany. 2 Delft University of Technology, Netherlands 3 INTRACOM, Greece 4 University of Patras, Greece The work described herein was performed within the ACTS PANORAMA project, funded by the European Commission under grant AC092. Corresponding author : Dr.-Ing. Jens-Rainer Ohm Heinrich-Hertz-Institut Image Processing Department Einsteinufer 37, D-10587 Berlin, Germany Phone : +49-30-31002-617 Fax : +49-30-392-7200 E-mail : [email protected]
Transcript

A Realtime Hardware System for Stereoscopic Videoconferencingwith Viewpoint Adaptation

Jens-Rainer Ohm1, Karsten Grüneberg1, Emile Hendriks2, Ebroul Izquierdo M.1,

Dimitris Kalivas3, Michael Karl1, Dionysis Papadimatos4, André Redert2

ABSTRACT

This paper describes a hardware system and the underlying algorithms that were developed for

realtime stereoscopic videoconferencing with viewpoint adaptation within the European

PANORAMA project. The goal was to achieve a true telepresence illusion for the remote partners.

For this purpose, intermediate views at arbitrary positions must be synthesized from the views of a

stereoscopic camera system with rather large baseline. The actual viewpoint is adapted according to

the head position of the viewer, such that the impression of motion parallax is produced. The whole

system consists of a disparity estimator, stereoscopic MPEG-2 encoder, disparity encoder and

multiplexer at the transmitter side, and a demultiplexer, disparity decoder, MPEG-2 decoder and

1 Heinrich-Hertz-Institut Berlin, Germany.2 Delft University of Technology, Netherlands3 INTRACOM, Greece4 University of Patras, Greece

The work described herein was performed within the ACTS PANORAMA project, funded by the European Commission

under grant AC092.

Corresponding author :

Dr.-Ing. Jens-Rainer Ohm

Heinrich-Hertz-Institut

Image Processing Department

Einsteinufer 37, D-10587 Berlin, Germany

Phone : +49-30-31002-617 Fax : +49-30-392-7200 E-mail : [email protected]

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-2- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

interpolator with viewpoint adaptation at the receiver side. For transmission of the encoded signals,

an ATM network is provided. In the final system, autostereoscopic displays will be used. The

algorithms for disparity estimation, disparity encoding and disparity-driven intermediate viewpoint

synthesis were specifically developed under the constraint of hardware feasibility.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -3-

I. INTRODUCTION

A telepresence videoconferencing system should give the users an illusion of true contact, bringing

participants together in a virtual space. This cannot be achieved by an ordinary stereoscopic image

acquisition/presentation chain. Firstly, the users should not wear glasses, which implies that

autostereoscopic displays have to be used [1]. Second, the viewer would expect that the view angle

alters with movements of the head ; this effect, called motion parallax, is very important for a true

illusion of being inside a 3D scene. To achieve this, viewpoint adaptation must be performed, which

means that the view angle on the display is altered automatically according to the viewer's head

movements [2]. Third, the stereoscopic cameras cannot be positioned in front of the display, which

implies that the baseline between the cameras must be at least 50 cm with a relatively small display,

and 80 cm with a larger display. Such a baseline is by far too large for rendering stereoscopic

images directly. The extreme differences between left- and right-view images do not correspond to

the small distance of human eyes, and the resulting stereo presentation would perturb the viewer.

Hence, it is necessary to synthesize intermediate-view stereo image pairs with smaller baseline

(fig.1). At the same time, a headtracking system can be used to adapt the actual viewpoint on the

interocular axis between the cameras, which gives the impression of natural motion parallax. A

prototype hardware system, which will perform these tasks, is presently developed within the

framework of the European PANORAMA project.

The synthesis of natural-looking intermediate views can be done by interpolation from the left-

and right-view images, if the positions of corresponding points are known. This requires the

knowledge of depth information, which can be obtained by disparity estimation between the left-

and right-view images. The disparity vectors can then be used to project pixels onto an intermediate

image plane. However, a critical case is the presence of occlusion areas, where some parts of the

scene may only be found in the left- or in the right-view image. In these cases, instead of

interpolation, a unidirectional projection has to be performed. Furthermore, disparity estimation is

not a trivial task due to the rather large baseline. We have found that horizontal disparity ranges of

up to 120 pixel are necessary with a 50 cm baseline, and a distance of 1.5 m between the camera and

the user. The vertical disparity shifts are much smaller. In the special case where a coplanar camera

geometry is used (which means that cameras shoot in parallel directions and are adjusted at the same

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-4- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

height), the vertical disparity is approximately zero all over the images. We decided to use this

geometry in order to simplify the algorithms with respect to hardware realization.

The paper is organized as follows. Section II introduces the concept of the whole system chain. In

section III, the algorithm for disparity estimation algorithm is described. Section IV specifies the

scheme developed for encoding of the disparity map. The algorithm for interpolation synthesis is

introduced in section V. Section VI gives results of computer simulations and shows examples.

Section VII describes the hardware concepts for the different parts of the chain. In section VIII,

conclusions are drawn.

II. SYSTEM CHAIN

Fig.2 shows the complete system chain in schematic form. The transmitter side consists of data

acquisition (stereoscopic camera, microphone), disparity estimation processing and encoding. The

stereoscopic camera setup uses parallel camera geometry, which allows to reduce the disparity

estimation search to the horizontal shift component. The disparity estimation processing also

includes delays which are necessary to synchronize video, audio and disparity data prior to

encoding, and a special disparity command conversion which is described in more detail in section

IV. The left and right view image signals and the audio signal are encoded by separate,

commercially-available MPEG-2 encoders. However, it is necessary to provide a separate encoder

for the subsampled disparity fields that are output from the estimator. The system multiplexer,

which is compatible to a standard MPEG-2 multiplex, integrates the encoded disparities as

additional stream data, independent from video data. Furthermore, it is necessary to synchronize the

independent left- and right-image video encoders and the disparity data.

For transmission, a standard ATM network is provided. We are using constant rate transmission

in AAL 5. At the receiver side, demultiplexing of video, audio and disparity data is performed first.

Then, the separate elementary streams are fed into the appropriate decoders, of which only the

disparity decoder is a non-standard device again. Video and disparity data are then fed into the

intermediate viewpoint interpolator, which gets further information from a headtracking system

about the required viewpoint. The information of the headtracker is used at the same time to drive

the autostereoscopic display, which is a system based on projection onto a lenticular screen and

must be adapted according to the viewing angle [3].

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -5-

The system is designed such that it can work in three modes, and hence can also be configured in

a flexible way for other purposes. In the direct mode, only data acquisition, estimator processing,

interpolator processing and data presentation are performed, which is the basic configuration for a

stereoscopic system with viewpoint adaptation. In coding mode, the chain is extended by the

encoders, multiplexer, demultiplexer and decoder, which enables compression of the required data.

In ATM mode, also a transmission of data is performed.

III. DISPARITY ESTIMATION

Disparity estimation is the most demanding task for the system hardware. To match the

corresponding points between left- and right-view images, disparity ranges of up to 120 pixel are

necessary, when the baseline is 50 cm and the distance between user and camera is 1.5 m. With 80

cm baseline, this range would even increase to approximately 230 pixel. Due to the use of a parallel

camera geometry, the disparity estimator needs only to take into account the horizontal disparity

shift. On the other hand, the parallel setup has the disadvantage that the absolute disparity shift

between left and right images is much larger than the ranges given above ; specifically, the zero

disparity is only met for a point with infinite distance. As a consequence, a large portion at the left

side of the left image is not present in the right image, and a large portion at the right side of the

right image is not present in the left image (see fig.3). This circumstance is taken into account

during estimation by defining an additional disparity offset doff, and must also be treated during

interpolation (see section V.2).

During the last years, many different schemes for disparity estimation have been proposed.

Though feature-based [4,5,6] and dynamic-programming [7,8,9] approaches seem to perform very

well, we found them to be too complex for a hardware system with the requirement of large

disparity ranges even in the case of pure horizontal disparities. Matching approaches can be

classified as area-based schemes [10,11]. We have compared several algorithms (feature-based,

dynamic programming and matching) with respect to subjective quality results and hardware

feasibility, and decided to implement the hierarchical block matching scheme, which is described in

this section. This scheme easily copes with arbitrary disparity ranges, and performs robust even in

the case of low correspondence between left- and right-view images, e.g. in partially occluded areas.

A criterion based on an absolute-difference feature is used to determine optimum positions of the

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-6- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

matching windows. At the same time, in a preprocessing stage, a simple foreground-/background

segmentation is performed, which is used to refine the results of estimation.

The disparity estimation algorithm can be divided into 4 different modules :

1. Preprocessing and segmentation. The goal of this stage is to find points with highest relevance

for matching, and to perform a raw subdivision into foreground and background areas.

2. Block matching with large block size for global bidirectional disparity estimation, followed by a

cross-consistency check.

3. Block matching with small block size for local bidirectional disparity estimation, followed by a

cross-consistency check.

4. Interpolation of dense L→R and R→L disparity fields, application of vertical median filters and

ordering-constraint check.

A flowchart describing the interrelation of the disparity estimator module blocks is given in fig.4.

Preprocessing and segmentation is performed on both input signals. Bidirectional (L→R and R→L)

sparse disparity fields are estimated in the global and refined in the local estimation stage. In order

to guarantee temporal continuity of the estimated disparities and to avoid temporally annoying

artefacts, the disparities estimated for the previous field pair are fed back to the estimator stages. For

this purpose, the dense field, generated at the final stage by bilinear interpolation, is used.

III.1. Preprocessing and segmentation

The preprocessing-and-segmentation stage uses a simple criterion based on pixel differences. To

select those image points which cannot be distinguished from their neighbours, we use a simple,

difference-based interest operator (see the so-called Moravec operator [12] in fig.5). This is applied

to both left- and right-view image fields. The directional difference in four directions (horizontal,

vertical and the two diagonals) is measured at each pixel position over a small square window of

size 5x5. In each of the four directions, we have five pixels, and four differences between adjacent

pixel pairs. In a first step, the sums-of-absolute-differences along all directions are calculated. The

output of the operator is then defined as the maximum of these four sum values. The goal of this

operation is two-fold :

− The Moravec operator's output is used to detect the point of highest interest within each matching

block for the subsequent global block matching stage.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -7-

− A threshold analysis detects large, uniform areas, at which valid disparity vectors cannot be

estimated by a block matching strategy, because no true correspondences can be found. If we

process head-and-shoulder scenes with relatively uniform background, it is easy to interpret this

as a raw foreground/background segmentation. The classification is performed on block basis,

where isolated areas with "wrong" classification are erased by comparison with their neighbors.

In this case, we end up with a unique foreground/background mask containing only two

segments.

Actually, this system was only optimized for head-and-shoulder scenes with uniform background. In

the future, it would be possible to replace this part by a more sophisticated algorithm, which enables

a more precise knowlege about the borders between foreground and background areas, where abrupt

changes in the disparities will occur due to occlusion effects. In the case where no reasonable

foreground/background segmentation mask is found (the "foreground region" of the segmentation

mask then covers the whole image), those features of the estimator, which are related to the

segmentation mask, are switched off, while the overall system still keeps working with fair quality.

Figures 6 and 7 show the extracted foreground regions from the left first frame of the sequences

ANNE and CLAUDE, and the highest-variance points which are used as matching correspondences by

the following global block matching stage

III.2 Global disparity estimation

In order to reduce noise sensitivity and simultaneously reach higher efficiency, both the left and

right image fields are subsampled by a factor of two in the horizontal direction. Only those

subsampled fields are used during the global estimation step, which are now divided into blocks of

size 16x16 pixel.

We use the point of highest interest within each block, which has been determined by the

preprocessing module, and match a reference window of size MxN pixel around this point (fig.8).

This means that the sampling position with highest omnidirectional difference inside the block is

chosen as representative point for the entire block. Furthermore, matching is performed only for

those blocks which are part of the foreground region from the segmentation mask (if present), which

means that blocks within uniform background areas are not considered at all during the matching

process. If the foreground region covers the whole image, matching is performed on all blocks.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-8- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Let z=(x,y) be the sampling position of the particular highest-interest point (center of the

reference window) in the left field, that has been chosen to be matched. A full-search block

matching in horizontal direction is performed in order to find the corresponding block centered

around point ~ ( , )( )z x d d yzt

off= − − in the right field. Herein, dzt( ) denotes the absolute-valued

disparity vector from left to right at time t, and doff the predefined disparity offset (remark that the

left-to-right shift is always negative with a parallel camera setup !). The reference window is

compared with all corresponding windows (of size MxN as well) along the given horizontal search

interval, defined by the disparity search range and the disparity offset. The disparity range is from 0

to 63, such that a maximum disparity shift of 126 pixel plus offset can be estimated with respect to

the not-subsampled image field.

In order to select the best match among the allowed displacements, a matching criterion based on

temporal smoothing and mean absolute difference (MAD) is used. The MAD is given as

MAD dM N

I x i y j I x i d d y jzt

l r zt

offi M

M

j N

N

( ) ( , ) ( , )( ) ( )

/

/

/

/

=⋅

+ + − + − − +=−=−∑∑1

2

2

2

2

. (1)

Using the MAD, the cost function is defined as:

F d d MAD d d dzt

zt

zt

zt

zt( , ) ( )( ) ( ) ( ) ( ) ( )− −= + −1 1α , (2)

with dzt( ) as current displacement vector, dz

t( )−1 the temporal prediction vector and the weight

coefficient α, which should be set to an approximate value of 0.2. The block sizes were set to M=13

horizontally and N=9 vertically for the local matching stage. To realize a simple hardware structure,

the quotient 1/117 was omitted from (1), and α was set to 16 in (2). The temporal prediction vector

dzt( )−1 is taken from the same position z in the previously-estimated disparity field at time t-1. The

previous-field dense disparity maps are available from the dense-field interpolation module, which

are stored internally.

For each position ~z within the search interval, the function value F d dzt

zt( , )( ) ( )−1 is calculated.

The particular sampling position ~z , which minimizes this cost function, is the corresponding point

of z. Once ~z has been estimated, the same procedure is repeated from right to left, using ~z as

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -9-

reference sampling position on the right image, which means that the reference window of size MxN

is now centered at this position. The search window of the same size is placed on the left image and

shifted within a search interval of 64-pixel width again. Now, the temporal prediction is taken from

the R→L dense disparity memory. The correspondence search is then carried out without further

consideration of the previously-found L→R disparity dzt( ) .

MAD dM N

I x i y j I x i d d y jzt

r l zt

offi M

M

j N

N

( ) (~ , ) (~ , )~( )

~( )

/

/

/

/

=⋅

+ + − + + + +=−=−∑∑1

2

2

2

2

. (3)

Using the MAD, the cost function is defined as:

F d d MAD d d dzt

zt

zt

zt

zt( , ) ( )~

( )~( )

~( )

~( )

~( )− −= + −1 1α , (4)

Let us denote the estimated L→R disparity with z as reference sampling position as dzt( ) , and the

estimated R→L disparity with reference sampling position ~ ( , )( )z x d d yzt

off= − − on the right

image as dzt

~( ) . Then, a bidirectional consistency check [13] is performed in order to reject outliers.

If the vector difference condition

d d pelzt

zt( )

~( )− ≤ 1 (5)

is violated, the two vectors dzt( ) and dz

t~( ) are eliminated from both sparse disparity fields. This

verification enables a reliability criterion of disparity estimation, such that the remaining disparity

estimates can be considered as correct disparity values.

III.3 Local disparity estimation

Local disparity estimation is also a block matching procedure, but is applied to the full-resolution

(not-subsampled) image fields. The block center positions z=(x,y) are now 4 pixel apart in the

horizontal and vertical directions. The reference windows have a size of M=9 pixel horizontally and

N=5 pixel vertically, but the position z is always at the block center, such that adjacent windows

overlap by a regular value. Instead of using full search (as in global estimation), only the range

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-10- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

defined by candidate vectors is tested with an additional search range of ±2 pixel horizontally

beyond the minimum and maximum candidates. We are using 10 candidates, of which are

− 6 from the output of the global estimation, unless they are part of the background segment ;

− 3 candidates from neighboring blocks, which were already calculated during local estimation;

− 1 from the temporally-preceding displacement field at the same spatial position.

The positions of candidates and the procedure of search range determination are illustrated in fig.9.

Herein, the matching windows used during global matching are marked as hatched regions, with the

center anywhere within the global block areas of size 32x16 (this is the not-subsampled equivalent

of the 16x16 block size used during global estimation). Global candidates are the one that belongs

to the active area (of which the actual local block is part), and in addition its left/right neighbors and

the three neighbors below. It may happen that neither a global, nor a local candidate exists. This will

be the case, e.g. when all candidates are within uniform background areas, or if they did not pass the

bidirectional consistency check. In addition, the search range is cut by all disparities which would

point into the background part of the opposite field's segmentation mask. In the case where less than

4 positions would have to be checked, no matching is performed, and the positions in the sparse

disparity field are a priori marked as INVALID .

Those candidates, which originated from global matching, have to be multiplied by two, because

they were calculated on the basis of subsampled images. The search range of the local matching

procedure is determined on the basis of all candidates. For this purpose, the minimum (MIN) and

maximum (MAX) disparity values among the candidates are determined, and the search range

reaches from MIN-2 to MAX+2, but is limited between 0 and 127.

The rest of the procedure is very similar to global estimation (1)-(4), with exception of the search

range(s) and block sizes. Again, the search criterion is a combination of MAD and temporal

smoothness with approximately the same α-parameter (≈0.2) in (2),(4). In order to omit the division

in (1),(3), where the quotient should now be 1/45, α was set to a value of 8.

Local displacement estimation is also performed bidirectionally, in order to apply the cross-

consistency check (5) on the estimation result. As the first step, L→R disparity estimation is

performed. Hence, the positions z in the L image field are equidistant, while the best-matching

positions in the R image field are not necessarily equidistant, but anywhere on the same line. Sparse

disparity fields are generated with valid values at each fourth row. For R→L disparity estimation,

the same search range is used as for L→R estimation.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -11-

Regarding the output of the estimation, the number of disparity values calculated per image field

is not fixed. This is caused by possible INVALID values in the sparse disparity field. The maximum

possible number of estimated disparity values is 1/16 of image field size.

Another object-based postprocessing step is performed if a foreground/background segmentation

mask exists. Among the first four valid disparity values at the left and right sides of the foreground

part of the mask, it is checked whether the absolute disparity is smoothly decreasing towards the

outermost one. If this is not the case, those values are eliminated (set to INVALID ), which offend this

condition (indicated by broken lines in fig.10).

III.4 Generation of rowwise-dense disparity fields

After estimation of disparity values at sparse positions, as it is finally performed by the local

estimation procedure, the dense disparity fields are generated by bilinear interpolation. Herein, we

simply ignore the INVALID positions within the sparse disparity fields and generate the dense fields

only from valid ones. The bilinear interpolation is at this stage performed only horizontally, within

the rows where the local disparity estimation derives disparity values. This is called the rowwise-

dense disparity field, which is defined for rows 2,6,10,... of the image fields. This rowwise-dense

field is used exclusively for the feedback (temporal prediction) of disparities during estimation.

Furthermore, an extract from both rowwise-dense (L→R and R→L) fields is transformed into a

unique command map representation for encoding, which will be described in the next section.

After the interpolation, a vertical 7-tap median filter is applied to both disparity fields at these

rows, i.e. the median mask contains the values at the same x-positions from each 3 rows below and

above the actual point in the rowwise-dense disparity field (see fig.11). This median filter is

necessary to reject outliers, and to introduce vertical dependencies between the estimated disparity

values, which were so far calculated more or less independent of each other. If there are any

remaining rows in the rowwise-dense field, which do not contain any valid disparities at all (this

happens, e.g. in the case of a present segmentation mask, usually at the top of the image), these are

filled by a vertical linear interpolation, again starting with a zero value at the top border.

In fig.12, images representing the horizontal component of dense disparity field pointing from

left to right are displayed for the tenth frame pair of sequences ANNE and CLAUDE. Low gray levels

represent large negative horizontal vector components, whereas high gray levels represent large

positive horizontal vector components. A vector with horizontal component 0 is represented by the

gray value of 128.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-12- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

IV. DISPARITY MAP ENCODING

In our system, the information about the disparity map must be transmitted along with the encoded

data stream for the left and right camera signals. This is reasonable, because in multipoint

videoconferencing it would be superfluous to determine the disparity parameters for all participants

at each site. For the purpose of disparity map encoding, a new type of command map representation

has been developed, which is easily compressible and bears information about both (L→R and

R→L) disparity maps [14]. To transform disparities of the rowwise-dense field into the command

map representation we are using, several constraints are imposed onto the disparity fields :

− Only horizontal shifts can be treated (this was already a constraint from estimation) ;

− The disparity map must obey the ordering constraint, i.e. disparities within one line must not

cross each other ;

− The disparity map must be absolutely dense, i.e. no special treatment of occlusions is performed

(this demand is already fulfilled by the interpolation described in the last section).

IV.1 The Disparity Command map

If Dlt x( )( ) is the L→R absolute-valued disparity at pixel position x and time t, and Dl

t x( )( )−1 that of

the previous pixel position (same with Drt( ) for the R→L field), then the ordering constraint is

violated, if

D D D Dlt

lt

rt

rtx x x x( ) ( ) ( ) ( )( ) ( ) ; ( ) ( )− − < − − >1 1 1 1 . (6)

If the constraint is violated once at position x1, the check must be iterated setting x←x+1, until a

value is reached, which does not any more violate the condition

D D D Dlt

lt

rt

rtx x x x x x x x( ) ( ) ( ) ( )( ) ( ) ; ( ) ( )1 1 1 1− < − − > − . (7)

Fig.13 shows an example of a disparity map violating the ordering constraint (a) and the correction

(b). The command map, which is a transformed representation of the ordering-constraint checked

disparity map, indicates where the correspondences can be found between the left and right image

field. Two commands "match left" (ML) and "match right" (MR) are used, that indicate the

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -13-

propagation of corresponding points along the scanlines of the left and right image fields. If the

correspondence "halts", e.g. a number N of points of the left image are referenced to the same point

in the right image, N subsequent ML commands are produced. This case we call a "left contraction".

The other way round, in the case of a "right contraction", several subsequent MR commands are

released. In the "normal" case, where disparity remains constant along the scanline, the sequence

alternates ML-MR-ML-MR-... . An example is illustrated in fig.14. Starting at the left border of the

images, which is position 0 in the right image, there is one correspondence to the left image at the

same position (this will always be present, and must not be explicitly encoded), and 4 more

correspondences to subsequent positions in the left image. Hence, the command map starts with

four ML commands, which characterize the four additional correspondences of four left positions to

the right position zero. Then, the correspondence proceeds one position in the left image (ML), and

in the right image, too (MR). This happens once more, but then two additional MRs fit to the same

left position. Finally, there is again one ML and three MRs. The complete command sequence for

the example shown here is ML-ML-ML-ML-ML-MR-ML-MR-MR-MR-ML-MR-MR-MR. The total

number of commands per scanline is 2⋅xsize-2, where xsize is the number of pixels in the scanline.

The transformation from the disparity map to the command map is very simple. Suppose we are

using the L→R disparity field. Since the estimator does not allow any disparities to point outside the

image, and the dense field interpolation always starts with zero disparity at the borders, the disparity

value of the first position of the scanline (x=0) will be zero. Now, whenever the disparity value at

position x+1 is larger than the value at x, we have the case of left contraction ; if it is smaller, we

have the case of right contraction, if it is equal, we have the normal case. Specifically, the number of

ML commands to be released before the next MR command is equal to D Dlt

ltx x( ) ( )( ) ( )+ − +1 1 in

the case of zero or positive difference, and the number of MR commands to be released before ML

is equal to D Dlt

ltx x( ) ( )( ) ( )− + +1 1 in the case of negative difference. It is easy to see that this

automatically corrects disparity fields violating the ordering constraint (6),(7). For the R→L field,

everything inverts ; a smaller disparity value at the next position indicates left contraction, a higher

value the right contraction. It is left to the interested reader to determine the setting of ML and MR

commands.

Basically, both disparity fields contain approximately the same information, and are highly

redundant due to the application of the bidirectional consistency check. Major differences may be

present in the areas of heavy contractions (being in most cases originally occlusions), which are

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-14- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

indicated by many L→R (R→L) vectors pointing to only one or a very small number of pixels in the

right-view (left-view) image. Indeed, we found that R→L disparities produced by the estimator

algorithm are more reliable at left occlusions, while the L→R disparities produce better interpolated

image quality at the right occlusions. Since we deal with videoconferencing sequences, we can

employ a very simple model for head-and-shoulder scenes, which is based on the convex surface of

the human head and body [11]. Then, it is clear that left occlusions can occur only left from the

center of the foreground shape, while right occlusions will be present only at the right side from this

point. Hence, we can divide each scanline of disparity values into two parts, which are marked by

the mid position of the active area under the foreground object (fig.15). For the left part, the R→L

disparity field is used, while for the right part, the L→R field is better suited. The split position is

determined only from one of the masks (preferably R, since we start with this field), because the

mid positions must not necessarily coincide. In addition, it is required to perform an ordering-

constraint check at the crossover point, taking into account both fields.

If no segmentation mask is present, the fallback mode is switched, which uses only one disparity

field. As the estimation process is started with the L→R field, we chose to use this one.

IV.2 Encoding of the Command map

The command map is a compact representation of a disparity field checked for ordering constraint,

and has an extremely small amount of inter-symbol redundancy [14]. With two commands, we need

2⋅xsize-2 bits per scanline to represent the command map. With CCIR-601 video (P=720), this

results in a transmission rate of 5.177 Mb/s, if we transmit disparity values only at each fourth

scanline. Since we have a maximum of 4 Mb/s allocated for lossless transmission of disparity values

in our overall system, further reduction of data rate is necessary. We have found a Lempel-Ziv based

algorithm [15] capable to reach a further reduction of rate by a factor of approximately two ; this

algorithm can easily be implemented in realtime hardware, using a conventional microcontroller.

The reason for further redundancy reduction is firstly the limited disparity range, which does not

allow arbitrary command map sequences, but even more important are the interline, intraline, and

temporal redundancies of the disparities, which are caused by the smoothness of object surfaces, and

allow the code adapt to specific preferable ML/MR sequences.

The encoding algorithm takes into account these characteristics. The update heuristic uses the

frame being coded and the previous frame to update the Lempel-Ziv algorithm sequence memory.

Sequences of ML/MR commands from lines adjacent to the line being coded and corresponding

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -15-

lines in the previous frame have been assigned a gradually increased priority over sequences gained

from the rest of the sequence memory. Thus, during the coding process, these sequences are used

first to compress the incoming command sequences. Tests have shown that even by limitation of the

memory to these "near" sequences, the compression is not greatly affected : The compression

degradation incurring through reduction of possible sequence memory positions is more than offset

by the reduction of the absolute number of positions, which must be represented by specific codes.

Limitation of sequences also reduces the hardware complexity.

V. INTERMEDIATE VIEWPOINT INTERPOLATION

The task of the interpolator is to generate two images from virtual cameras, based on two images

from real cameras and their disparity map. The position of the virtual cameras should be related to

the position of the viewer’s eyes. We have developed an interpolation concept which can decide

dynamically, based on the degree of contraction, which areas of the intermediate image are truly

interpolated, i.e. taken from both images, and which areas are possibly subject to occlusions and

hence must only be projected from the corresponding area of one of the left- or right-view images.

The unique representation of disparities by the command map has three advantages that can be

exploited by the interpolator :

− the vertical interpolation of the missing rows of disparities (the rowwise-dense map being

defined so far only at each fourth row) is very simple using the command map ;

− the determination of pixel addresses for the corresponding points, necessary during interpolation,

can be performed only by increment operations ;

− it is very easy to check the grade of contraction (multiple correspondences to one point), to

decide whether it is more appropriate to perform extrapolation from one image instead of

interpolation from both.

Fig.16 shows the block diagram of the interpolator. We can recognize four main parts:

− Command expander

− Parameter controller

− Left eye image generator

− Right eye image generator

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-16- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

The command expander transforms the four times subsampled command map to a dense (not

subsampled) map by linear interpolation. The parameter controller generates control parameters for

the image generators, e.g. the relative position of the virtual cameras. The right- and left-eye image

generators are the actual image interpolators.

V.1 Command expander

The task of the command expander is to transform a four times vertically subsampled command

map into a dense command map. To do this we need both interpolation and extrapolation of the

command map. In the following example we will explain how to use these two options. Scanlines

are numbered from 0 to 287 for each field.

Since the original command map is sampled for example at scanlines 2, 6, 10, …, and 286, the

lines 0 and 1 have to be extrapolated from line 2, line 287 has to be extrapolated from line 286, and

all other lines have to be interpolated from the two lines that enclose them.

All remaining lines have to interpolated. For this we use linear interpolation. This operation is very

easily done with the two-command map. Let us consider two consecutive scanlines in the

subsampled command map at positions y and y+4. We are going to create an intermediate command

scanline at relative position α. For α=0, the scanline at position y is found, for α=1 the scanline at

y+4. So, for our application, the values α = ¼, α = ½ and α = ¾ are used.

The command map scanlines describe disparity paths. Fig.17 shows the two original disparity paths

A and B in black and an interpolated one I in dark grey with α = ½. The axes are the horizontal

positions of the left and right image scanlines. The disparity paths go from (0,0) to (719,719) in this

diagram.

The method of interpolation is very simple. We start at (0,0) for all three paths at time instant

zero. At every time instant, we execute one command of each original command scanline. These

commands we call ACOM and BCOM. They can be either ML or MR. An ML means a step in the

left image scanline, in fig.17 this is a horizontal step from left to right. An MR means a step in the

right image scanline, in the figure this is a vertical step from bottom to top. For the I command

scanline we construct a command by ICOM = (1-α) * ACOM + α * BCOM. This gives us the

analytical interpolated I path.

For α unequal to 0 and 1 but in between, intermediate types of commands arise for the I path,

other than ML or MR. The consequence of this is, that the analytical I path does not fit onto the grid

in general. It can no longer be described by the normal commands ML and MR. We solve this by

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -17-

rounding off the path to the nearest grid point. The disparity path in light grey shows a rounded

version of the original I path in dark grey.

Although the rounded light grey path is used for output, the analytical dark grey path has to be

calculated as well. If not, round off errors in the path will accumulate very fast. As the analytical

path is constructed step by step at each time instant, the rounded path can be constructed by

choosing either ML or MR. The goal is to minimize the distance between the endpoints of the

current analytical and rounded path.

V.2 Parameter controller

Fig.18a/b shows as an example the left and right camera image of the MAN sequence, first frame.

Fig.18c shows the associated command map disparity field, where the ML command is indicated as

black, MR as white, and the "normal" alternation of ML-MR as grey.

Fig.19 illustrates, for the example of one scanline at the tip of the nose of these images, some

important preliminaries that have to be observed by the interpolator and must be regulated by the

parameter controller. The scanline showing grey values from the left camera is located on top, that

from the right camera at the bottom. All corresponding points are connected as they were found by

the disparity estimator. The representation of fig.19 is very closely related to image interpolation :

Every horizontal cross section gives the luminance values along the scanline of a virtual camera at a

specific position between the real cameras. This relative position is indicated by the parameter S,

which varies from -½ (position of left camera) to ½ (position of right camera).

For the stereoscopic presentation, we take two virtual cameras with positions S1 and S2, S1 < S2,

which are the positions of the virtual left and right cameras, respectively. If a stereoscopic scene

shall appear behind the display, the nearest point must have a disparity shift of zero. However, in

our parallel camera configuration, the nearest point has the largest disparity shift dmax, such that it is

necessary to introduce a shift correction. Moreover, we must guarantee that the position of that point

on the screen remains the same when the view angle is changed. Hence, the shift relative to the right

camera at a specific position S, as it is shown in fig.19, must be

SHIFT d S= ⋅ −max ( )12 (8)

pixels to the right. In this example, the nearest point is the lightest value, which is the tip of the

nose.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-18- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Now, the algorithm has two input parameters: the relative virtual camera position S and the shift

parameter SHIFT. The S parameter is a real number between -½ and ½, the shift parameter an

integer that can be both positive and negative, depending whether the left or right camera position is

interpolated. Due to the horizontal shift, it is possible that some pixels on the virtual scanline are not

in the visible region. In the example of fig.19, this is the case for the rightmost pixels. Using

information from the visible region to define the luminance and chrominance of these pixels is not

feasible. Therefore, we chose to set these pixels to black values. Specifically, it is necessary to set

dmax pixels to black value at the left side of the left-view image (if S=-½ is selected) and do the same

with dmax pixels at the right side of the right-view image (if S=½ is selected). At intermediate

positions, the number of black pixels BLACKleft at the left side is equal to SHIFT from (8), and the

number of black pixels BLACKright at the right side is equal to dmax-SHIFT.

The parameter controller has to generate two sets of each four parameters, one for the left eye

image generator and one for the right eye image generator. The headtracker information is vital for

this. The information of the headtracker are the real numbers X, Y and Z. At this moment, we are

using only the X component (left/right head position), which has the same scale as S.

Furthermore, we need to define the real number DIST that is related to the eye distance and the

camera baseline. This gives the relative distance between the viewpoints S of the left and right

virtual cameras. A value of DIST=0.05 is presently set. We still investigate a technique to adapt this

parameter to the Z information of the headtracker, which is the distance between display and viewer.

We now redefine the SHIFT parameter given in (8), such that zero shift is now obtained at

position S=0. All generated images with S<0 are then shifted towards left, and those with S>0 are

shifted towards right. At the same time, we assume that the blackening of pixels is performed after

the shift, such that always dmax/2 pixels are set to black at both sides of the images. The following

control parameters are then used:

Left control parameters

S x DIST SHIFT d S112 1= − ⋅ = ⋅; max (9)

Right control parameters

S x DIST SHIFT d S212 2= + ⋅ = ⋅; max (10)

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -19-

Note that the two S parameters should always be in the range of [-½,½]. For extreme DIST values, S

has to be clipped. Fig.20 shows the effect of the linear relation between the S and SHIFT parameter.

The black parameters are chosen just large enough to ensure that always the same set of pixels on

the virtual scanline is visible, independent of position S.

V.3 Image interpolators

Fig.21 shows one scanline of left and right images, their associated disparity field and an

intermediate virtual image. For each 2 pixels in all images, we have two Y values, one U and one V

value.

For each virtual Y-pixel, we determine which disparity vectors cross its area. Of those, we select

the ones most to the left. In the figure these are indicated by grey lines. After selecting a disparity

vector, we determine which Y-pixels of the left and right image are referenced by this vector, and

use a weighted average to create the value of the virtual Y-pixel:

Y Y Yvirtual left left right rightW W= ⋅ + ⋅ (11)

The same procedure is done for U- and V-pixels, the only difference being that these components

have only half the size, and double number of disparities values is defined for each pixel area. The

generation of virtual Y-, U- and V-pixels are separate processes and can be done in parallel.

In formula (11), two weights Wleft and Wright were introduced. Since the condition Wleft+Wright=1

must apply, we need to specify only one. For example, if we set Wleft=½, the interpolator would be

most simple ; however, to obtain better image quality, it is required to take into account the position

parameter S, and the contractivity of the disparity field described by the command map. The first

can be done by setting

W Sleft = −12 , (12)

which for extreme virtual camera positions |S| ≈ ½ produces much better results. An extension to

this scheme is to make the weights dependent on the disparity field. If done correctly, it is possible

to adjust the weights in such a way that in occlusion areas data is only taken from one image. In a

two command disparity map, we can not see the difference between an occlusion and a strong

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-20- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

contraction. However, in both cases it would be wise to take data only from one of the left and right

images.

To accomplish this, we use again the notion of the disparity path. Fig.22 shows a disparity path

representation of the disparity map scanline in fig.19. The horizontal and vertical axes are the

horizontal positions of the left and right image scanlines, respectively. Each point in this

representation corresponds to a possible match. The grey values indicate the probability of each

possible match. The disparity path goes from (0,0) to (719,719) for CCIR601 images. Each disparity

vector is one white point. The disparity range (allowed minimum and maximum disparities) is also

shown in white.

Next, we introduce the real number δ that indicates, for each disparity vector, the average

direction of the disparity path in a window around that vector. The length of the window is the even

integer N, and count(ML) is the number of ML commands in that window :

δ = ⋅ −2 1count( )ML

N(13)

In left occlusion or strong left contraction areas, δ becomes +1, indicating a horizontal path in fig.22

around the point of interest. In right occlusion or strong right contraction areas, δ becomes -1,

indicating a vertical path in fig.22 around the point of interest. Now, we choose the weight Wleft to

be:

W W Wleft = + ⋅δ ∆ (14)

with

W S W S= − = −12

12; ∆ (15)

It is easy to see that Wleft = 1 in a left occlusion, Wleft = 0 in a right occlusion and Wleft = W for a

normal object. This kind of adaptive weight setting is very easy to implement in hardware. Fig.23

shows the effect of the disparity-driven weighting procedure. Fig.23a is the weighting according to

(12), with white indicating Wleft = 1 and black indicating Wleft = 0. Fig.23b shows the disparity-

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -21-

adaptive weighting according to (14)-(15) with a window length N=8, and fig.23c with a window

length of N=64. It can clearly be seen that Wleft approaches 1 even for S→½ in the areas of left

occlusions (left side of the head), and Wleft approaches 0 even for S→-½ in the areas of right

occlusions with adaptive weighting. In these areas, the disparity-weighted interpolation algorithm

produces sharper images than the simple interpolator using (12).

VI. EVALUATION OF ALGORITHMS AND RESULTS OF COMPUTER SIMULATIONS

In the start phase of the PANORAMA project, we have compared different algorithms for disparity

estimation and viewpoint interpolation with regard to subjective quality and hardware feasibility.

Among the disparity estimators were a feature-based approach [16], two dynamic-programming

approaches and the hierarchical block matching, which is described in this paper. The latter one was

finally selected, because it showed superior performance and did not require larger hardware

complexity than any of the other proposals. Two interpolation concepts were investigated, one of

these an object-based approach [17], the other one the concept presented in this paper. Though the

former one performed slightly better in the areas of conclusions with uniform background scenes,

the latter one was selected, because it is less complex with respect to hardware realization, and more

universal applicable also to non-uniform background scenes. The subjective assessments of six

expert viewers were evaluated ; a "good" picture quality was attested to the selected scheme.

The performance of the methods presented in this paper has been tested with a set of natural

stereoscopic sequences in extensive computer simulation experiments. These sequences were

recorded within the framework of the European projects RACE-DISTIMA and ACTS-

PANORAMA. The image resolution is 720x576 pixels. The stereoscopic sequences MAN, ANNE

and CLAUDE, representing typical videoconferencing situations, are given here as a reference. It is

interesting to note that these three sequences were taken with different camera setups. While MAN is

truly in the configuration we are planning to realize (parallel cameras, but only the overlapping area

is shown), the ANNE images were artificially shifted in order to avoid the large non-overlapping area

between left and right image view. In both of these images, the baseline was 50 cm, with a distance

of 2-2.5 m between cameras and person. The CLAUDE sequence was even captured with a

convergent (non-parallel) camera setup, however with a much smaller baseline (15 cm), but also

with smaller distance between person and camera (1.2 m). All the sequences presented here fulfil

the uniform background assumption, such that the foreground/background segmentation can be

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-22- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

performed as planned in the hardware system. However, the disparity estimation algorithm did

succeed not only with these sequences, but has also successfully been applied to other (non-uniform

background) sequences.

All underlying disparity estimation experiments have been performed using the parameters given

in section III. Some results illustrating the performance of the image interpolation method are given

in figures 24-26, which show left-view images, synthesized central viewpoints and right-view

images using the tenth frame pairs of the sequences MAN, ANNE and CLAUDE. The computed central

viewpoint is displayed between the two original stereo images.

We obtain a good stereoscopic virtual-viewpoint image quality, when the difference in the s-

position between a synthesized left- and right-view image is between 0.05 and 0.1, which

corresponds approximately to the "natural" disparity due to the distance of human eyes. It is

remarkable, that some occasional distortions, which become visible as some kind of temporal

flicker near the foreground/background borders in a monoscopic presentation, are becoming

unnoticeable in the stereoscopic view. The telepresence illusion is very natural, rendering high

image quality.

Recently, we have also tested the performance of the system with sequences, that were taken

with a convergent camera setup. Herein, the preliminary condition is violated, that no vertical

disparities should be present, which implies that the actions of the disparity estimator and the

viewpoint interpolator should strictly not be limited to only one scan line. However, we have found

that the quality of images interpolated with our system remains high, if the convergence angle

between the camera is not too large (up to 15 degrees). With a convergent camera setup, the SHIFT

parameters in (8) (9) and (10) can drastically be lowered, such that it is no longer necessary to set a

large number of pixels at the left and right sides of the image to black values. This implies that we

can utilize a larger area of the images, and can increase the size of the person by using cameras with

larger focal length.

VII. HARDWARE STRUCTURE

VII.1 Disparity estimator

We have started hardware implementation of the disparity estimation algorithm with the overall

structure that was shown in fig.3. The goal was to build a target hardware without dedicated chip

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -23-

design, such that only custom chips, digital signal processors (DSPs) and field programmable gate

arrays (FPGAs) were to be used. While the preprocessing and postprocessing (interpolation) stages

can easily be realized within FPGAs, the matching stages give the most demanding task with respect

to processing power. Nevertheless, we found it feasible to implement the matching kernel of the

global stage by using one FPGA and one DSP. An additional FPGA is needed for pixel access

control. The basic structure is similar to the local block matching module, which is indeed more

complicated and will be described with more detail in the rest of this section.

Figure 27 shows the hardware architecture of the local block matching module in more detail.

Input data of this module are luminance of left and right video image and estimates of the global

disparity vectors for a number of feature points. All input data form a multiplexed CCIR 656 data

stream, where left and right video replace the luminance and chrominance data, and feature point

coordinates as well as estimated global disparity vectors are transmitted in the horizontal blanking

interval.

Signal processing on the local block matching board is divided into several modules: two block

matching modules, one 7-tap median filter module, and three DSP modules. Mechanical and

electrical specification of all six modules and their connectors comply with the TIM-40 standard

[18] to make interfacing and testing easy. While the DSP TIM-40 modules are commercially

available, the other three modules are developed especially for the needs of this project.

The complexity of present high-end FPGAs allows the design of a FPGA-based block matching

processor containing 20 cell elements for parallel MAD calculations as well as additional circuitry

for pixel addressing and MAD postprocessing. The applied principle of MAD calculation is the

parallel accumulation of absolute pixel differences by shifting measure pixels. As depicted in figure

28, the cell array contains 20 block matching cells, each calculating the MAD for a single search

position and passing on the measure pixel to the next cell. Thus, search pixels (St) and measure

pixels (Mt) have to be fetched only once for the complete estimation of a single block.

The block matching processor is depicted in figure 29. It interacts with a DSP for provision of

parameters used for address generation and MAD postprocessing, and with a dual-port memory

which sequentially stores both left and right image slices and outputs left and right pixels for

matching blockwise. The cost function is performed by an addition of the MAD and a counter

register which is preloaded with the temporal prediction vector. Thus, the expression α|dz(t)-dz

(t-1)| is

built by decrementing or incrementing the counter register. The last step of finding the minimum of

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-24- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

the cost function requires the presence of a vector mask, which defines the valid positions to be

taken into account for the search. The mask is provided by the DSP and results from ten start

vectors for local search and the start vector transmitted to the processor.

As depicted in figure 30, the estimation algorithm allows a parallel connection of Block-

Matching Processors, each calculating a separate image slice. The parameters used for local

disparity estimation require four processors working in parallel to meet the real-time requirements.

The processors are supervised by DSPs, which build the block-matching commands and start

vectors for each processor.

This hardware solution offers an area-efficient design, such that both local stage and the dense field

generation can be placed on a single board. The whole disparity estimator will be realized on two

boards (actually, everything would fit on one board, but the two-board solution is more practical in

the prototype version, because the two matching stages are built by different partners).

VII.2 Synchronization unit

To increase the testability of the system, the signals at the interfaces between camera, disparity

estimator, encoder, decoder, interpolator and display are in conformance to the parallel CCIR 656

standard. To compensate for the (variable) delay of the disparity estimator, a synchronization unit is

inserted to ensure synchronization of the disparity fields and the recorded image sequences at the

input of the encoders.

A block scheme of the synchronizer is shown in figure 31. Based on the implementation of the

disparity estimator, a maximum delay of one frame is assumed (= 40 msec). At start-up of the

system the start code for a new frame is searched for in the CCIR 656 data streams of the left and

right image sequences. When found, the synchronizer starts to buffer the image data and at the same

time starts to search for the code of a new frame in the CCIR 656 stream of the command map

(disparity field information). As soon as the latter is detected, the synchronizer starts to output the

image data and command map data to the encoders. If for some reason the data streams become

asynchronous again during operation, the detection procedure is repeated.

Synchronization at the output of the decoders is obtained by using the inserted timestamps

(PresentationTimeStamp, DecodingTimeStamp, SystemTimeClock) during (MPEG) encoding and

multiplexing.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -25-

VII.3 Interpolator

The hardware architecture of the interpolator is shown in figure 32. Since the goal of the project is

to realize a prototype system, we have chosen to implement the interpolator with off-the-shelf

memories and FPGAs (Altera 9k/10k series).

The decoded left and right images are demultiplexed and the luminance and chrominance data

are stored in separate 8-bit FIFO buffers. After demultiplexing the subsampled command map is

first expanded to a dense map and then stored in a 1-bit FIFO buffer. When enough data is available

in the FIFO buffers the weights Wleft and Wright are calculated as described in section V, controlled

by the Algo block. The Algo block is the most critical part of the interpolator. It controls the pixel

positions in the intermediate view, taking into account the external parameters SHIFT, S, the

numbers BLACKleft and BLACKright of black pixels at the left and right size of the intermediate

image and the head position. Moreover it takes into consideration the different sampling of

luminance and chrominance data. The controller takes care of the timing and synchronization of

input and output data. The multipliers (X) are implemented as serial pipelined multipliers, since the

parallel version in the chosen technology was not fast enough to run at the intended 27 MHz. This

introduces a small extra delay of eight clock periods. The overall delay of the interpolator is about 1

µs which is neglectable, compared to the total delay of the chain. The interpolator is implemented

twice, one for the virtual left camera and one for the virtual right camera.

VIII. SUMMARY AND CONCLUSIONS

A method for disparity estimation and image synthesis applied to 3D-videoconferencing with

viewpoint adaptation is introduced. The novelty of the disparity estimator is twofold : On one hand,

it has been optimized in order to achieve a very low hardware complexity, and on the other hand, it

shows robustness and accuracy with regard to the addressed application. The goal, to estimate

reliable displacement maps with extremely low computational costs, is reached by an improved

hierarchical Block-Matching method. The idea at the heart of the approach presented is to combine

previously estimated vectors to predict and correct each newly-calculated disparity vector, applying

a suitable cost function and taking into account the assumptions about the scene. The image

synthesis performs a weighted interpolation, wherein the specific weights for the left and right

camera images are adapted to the degree of contraction within the disparity field. The methods

reported in this paper were designed under the constraints to keep implementation costs low and to

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-26- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

supply intermediate views with good image quality. The performance of the presented methods was

tested by computer experiments using natural stereoscopic sequences representing typical

videoconferencing situations. The system is presently realized in a hardware testbed by the project

partners. The disparity estimator and image synthesis method introduced in this paper are expected

to be capable to offer realistic 3D-impression with continuous motion parallax in videoconferencing

situations.

ACKNOWLEDGEMENTS

This work was supported by the European Commission within the ACTS PANORAMA project

under grant AC092. The sequences used for the experiments were recorded at HHI, Germany, and

CCETT, France.

REFERENCES

[1] N. Tetsutani, K. Omura and F. Kishino : "Wide-screen autostereoscopic displays employing

head-position tracking," Opt. Eng. , vol. 33, no. 11, pp. 3690-3697, Nov. 1994.

[2] K. Hopf, D. Runde and M. Böcker : "Advanced videocommunications with stereoscopy and

individual perspective," in Towards a Pan-European Telecommunication Service

Infrastructure - IS&N '94, Kugler et. al. (eds.), Berlin, Heidelberg, New York : Springer 1994.

[3] R.Börner : "2-channel lenticular system for 3D-imaging with tracked projectors," HHI Annual

Report 1996, Berlin : HHI 1997

[4] W. Hoff and N. Ahuja : "Surfaces from stereo : Integrating feature matching, disparity

estimation and contour detection," IEEE Trans. Patt Anal. Mach. Intell., vol. PAMI-11, no.2,

1989.

[5] J. Weng, N. Ahuja and T.S. Huang : "Matching two perspective views," IEEE Trans. Patt

Anal. Mach. Intell., vol. PAMI-14, no.8, 1992.

[6] H.H. Baker and T.O. Binford : "Depth from edges and intensity based stereo," Proc. 7th Int.

Joint Conf. Artif. Intell., pp. 631-636, Vancouver, Canada, Aug. 1981.

[7] Y. Ohta and T. Kanade : "Stereo by intra- and inter-scanline," IEEE Trans. Patt Anal. Mach.

Intell., vol. PAMI-7, no.2, pp. 139-154, Mar. 1985.

[8] P. Anandan : "Measuring visual motion from image sequences," PhD thesis, University of

Massachusetts, 1987.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -27-

[9] I.J. Cox, S.L. Hingorani and S.B. Rao : "A maximum likelihood stereo algorithm," Computer

Vision and Image Understanding 63 (1996), no.3, pp. 542-567

[10] B. Chupeau : "A multiscale approach to the joint computation of motion and disparity :

Application to the synthesis of intermediate views," Proc. 4th Europ. Worksh. on Three-

Dimension. Televis., pp. 223-230, Rome, Italy, Oct. 1993.

[11] E. Izquierdo and M. Ernst : "Motion/disparity analysis for 3D video conference applications,"

Proc. Int. Workshop on Stereoscopic and Three Dimensional Imaging, pp. 180-186, Santorini,

Greece, Sept. 1995.

[12] H.P. Moravec : "Towards automatic visual obstacle avoidance," Proc. Of Fifth Intern. Joint

Conf. On Artif. Intell., p.584, Cambridge, MA, August 1977

[13] M.J. Hannah : "A system for digital stereo image matching," Photogrammic Engineering and

Remote Sensing 55 (1989), no. 12, pp. 1765-1770

[14] P.A. Redert and E.A. Hendriks : "Disparity map coding for 3D teleconferencing applications,"

to appear in Proceedings of SPIE VCIP, San José, USA, 1997

[15] J. Ziv and A. Lempel : "Compression of individual sequences via variable-rate coding," IEEE

Trans. Inf. Theor., 1978

[16] J. Liu, I.P. Beldie and M. Wöpking : "A computational approach to establish eye-contact in

videocommunication," Proc. Int. Workshop on Stereoscopic and Three Dimensional Imaging,

pp. 229-234, Santorini, Greece, Sept. 1995.

[17] J.-R. Ohm and Ebroul Izquierdo : "An object-based system for stereoscopic

videoconferencing with viewpoint adaptation," in Digital Compression Technologies and

Systems for Video Communications, N.Ohta, Editor, Proc. SPIE 2952, pp.29-41, Berlin, Oct.

1996

[18] Texas Instruments: "TIM-40, TMS320C4x Module Specification," Version 1.01, 1993

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-28- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

List of Figures

Fig.1. Setup of stereoscopic cameras and screen, and variable position of virtual camera pair.

Fig.2. The complete system chain.

Fig.3. Invisible areas in parallel camera setup

Fig.4. Flowchart of disparity estimator algorithm.

Fig.5. Gradient-based operator applied for homogeneity decision.

Fig.6. a foreground region b/c highest-variance matching points (left/right image) of sequence

ANNE.

Fig.7. a foreground region b/c highest-variance matching points (left/right image) of sequence

CLAUDE.

Fig.8. Relation of block position, point of highest Moravec output and matching window.

Fig.9. Positions of 9 spatial candidate vectors (one candidate from temporal preceding

displacement field not shown).

Fig.10. Postprocessing of disparities at foreground/background segmentation mask borders (illegal

disparities indicated as dotted lines).

Fig.11. Vertical median filter after horizontal interpolation.

Fig.12. Dense disparity fields a of sequence ANNE b of sequence CLAUDE.

Fig.13. a Violation of ordering constrained b Interpolated fill.

(violating vectors indicated as bold lines)

Fig.14. Disparity example for generation of disparity command map.

Fig.15. Usage of L→R and R→L disparities exploiting position of the foreground masks.

Fig.16. Block diagram of the interpolator.

Fig.17. Disparity paths of two scanlines and the interpolated analytical and rounded paths

Fig.18. Left (a) and right (b) camera images, and the associated disparity map (c).

Fig.19. Scanlines of left, right and virtual cameras.

Fig.20. The effect of the control parameters with respect to image position.

Fig.21. The definition of pixel values in the visible region of intermediate images, for the cases of

Y, U and V components.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -29-

Fig.22. The disparity path representation relating to fig.14.

Fig.23. Weights Wleft for different interpolation positions (top : left position, bottom : right

position) a with weighting according to (12) b,c with disparity-adapted weighting

according to (14)-(15).

Fig.24. Left-view image, synthesized central viewpoint and right-view image, sequence MAN.

Fig.25. Left-view images, synthesized central viewpoints and right-view images, sequence ANNE.

Fig.26. Left-view images, synthesized central viewpoints and right-view images, sequence

CLAUDE.

Fig.27. Hardware architecture of the local block matching module

Fig.28. Cell array

Fig.29. FPGA-based block matching processor

Fig.30. Structure for block matching

Fig.31. Structure of the synchronizer

Fig.32. Hardware structure of the interpolator

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-30- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Fig.1.

Fig.2.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -31-

Fig.3.

Preprocessing and

Segmentation

Global disparity estimation

Local disparityestimation

Dense field inter-polation, vertical

median filter

L image fieldR image field

L disparity fieldR disparity fieldsegmentation mask

L image fieldR image field

t o i n t e r p o l a t o r

feature pointcoordinates

globaldisparities

localdisparities

Fig.4.

Fig.5.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-32- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

a) b) c)

Fig.6.

a) b) c)

Fig.7.

Block of size 16x16

point with highestMoravec output

Matching windowof size 13x9

Fig.8.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -33-

Fig.9.

Fig.10.

actual row

rows processed during horizontal interpolation

rows processed during vertical interpolation

median filter inputs

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-34- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Fig. 11.

a) b)

Fig.12.

L image

R image

L image

R image

a)

b)

Fig. 13.

Fig.14.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -35-

R image

R mask

R->L disparities

L image

L mask

L->R disparities

Fig.15.

Fig.16.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-36- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

Fig.17.

a) b) c)

Fig.18.

Fig.19.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -37-

Fig.20.

Fig.21.

Fig.22.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-38- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

a) b) c)

Fig.23.

Fig.24.

Fig.25.

Fig.26.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -39-

globaldisparityvectors

globaldisparityvectors

disparityvectors

every 4x4block

horiz. interpolation

densedisparityevery 4th

line

start vector generation

postprocessing

multiplexedleft/rightvideo and

globaldisparity

COMports

Dual DSP 320C44TIM Module with

4x 128k x 32 SRAM

7 tap Vertical MedianModule, FPGA based,TIM size & pinning

Local BlockmatchingModule, FPGA based,TIM size & pinning

Local BlockmatchingModule, FPGA based,TIM size & pinning

Dual DSP 320C44TIM Module with

4x 128k x 32 SRAM

start vector generation

COMports

Dual DSP 320C44TIM Module with

4x 128k x 32 SRAM

COMports

disparityvectors

every 4x4block

Fig.27.

∑Mt-St ∑Mt-19-St∑Mt-1-StMt

St

MAD

Position 1 Position 2 Position 20

Mt-19Mt-1

Fig.28.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

-40- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation

BlockAddress

Generation

Blockvectors

TV right

TV leftDual PortMemory

CellArray

Minimumfinder

Vectormask

DSPInterface

Costfunction

Temporalprediction vector

Minimumposition

BlockmatchingProcessor

MPix SPix

MAD

Fig.29.

BlockMatchingProcessor

DualPort

MemoryDSP

TV left/rightGlobal

disparity field

Localdisparity field

Stripe 0

Stripe 1

Stripe 3

Stripe 2

DSP

DSP

DSP

Fig.30.

Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :

A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -41-

w ritecounter

w ritecoun ter

readcounter

readcoun tercon tro l

readcounter

d isparity

v ideo left

v ideo righ tECL

ou tpu tbuffers

ECLoutpu tbuffers

ECLou tpu tbuffers

m em ory640K x 16

m em ory640K x 16

F IFO4 x 1by te

register

register

video left

video right

disparity

Rec.656interpretatorrecognizes

EAVpreamble

Rec.656interpretatorrecognizes

EAVpreamble

Rec.656interpretatorrecognizes

EAVpreamble

start

start

start

address

address

synchronizedseparate module

separate module

Fig.31.

Fig.32


Recommended