Date post: | 13-May-2023 |
Category: |
Documents |
Upload: | independent |
View: | 0 times |
Download: | 0 times |
A Realtime Hardware System for Stereoscopic Videoconferencingwith Viewpoint Adaptation
Jens-Rainer Ohm1, Karsten Grüneberg1, Emile Hendriks2, Ebroul Izquierdo M.1,
Dimitris Kalivas3, Michael Karl1, Dionysis Papadimatos4, André Redert2
ABSTRACT
This paper describes a hardware system and the underlying algorithms that were developed for
realtime stereoscopic videoconferencing with viewpoint adaptation within the European
PANORAMA project. The goal was to achieve a true telepresence illusion for the remote partners.
For this purpose, intermediate views at arbitrary positions must be synthesized from the views of a
stereoscopic camera system with rather large baseline. The actual viewpoint is adapted according to
the head position of the viewer, such that the impression of motion parallax is produced. The whole
system consists of a disparity estimator, stereoscopic MPEG-2 encoder, disparity encoder and
multiplexer at the transmitter side, and a demultiplexer, disparity decoder, MPEG-2 decoder and
1 Heinrich-Hertz-Institut Berlin, Germany.2 Delft University of Technology, Netherlands3 INTRACOM, Greece4 University of Patras, Greece
The work described herein was performed within the ACTS PANORAMA project, funded by the European Commission
under grant AC092.
Corresponding author :
Dr.-Ing. Jens-Rainer Ohm
Heinrich-Hertz-Institut
Image Processing Department
Einsteinufer 37, D-10587 Berlin, Germany
Phone : +49-30-31002-617 Fax : +49-30-392-7200 E-mail : [email protected]
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-2- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
interpolator with viewpoint adaptation at the receiver side. For transmission of the encoded signals,
an ATM network is provided. In the final system, autostereoscopic displays will be used. The
algorithms for disparity estimation, disparity encoding and disparity-driven intermediate viewpoint
synthesis were specifically developed under the constraint of hardware feasibility.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -3-
I. INTRODUCTION
A telepresence videoconferencing system should give the users an illusion of true contact, bringing
participants together in a virtual space. This cannot be achieved by an ordinary stereoscopic image
acquisition/presentation chain. Firstly, the users should not wear glasses, which implies that
autostereoscopic displays have to be used [1]. Second, the viewer would expect that the view angle
alters with movements of the head ; this effect, called motion parallax, is very important for a true
illusion of being inside a 3D scene. To achieve this, viewpoint adaptation must be performed, which
means that the view angle on the display is altered automatically according to the viewer's head
movements [2]. Third, the stereoscopic cameras cannot be positioned in front of the display, which
implies that the baseline between the cameras must be at least 50 cm with a relatively small display,
and 80 cm with a larger display. Such a baseline is by far too large for rendering stereoscopic
images directly. The extreme differences between left- and right-view images do not correspond to
the small distance of human eyes, and the resulting stereo presentation would perturb the viewer.
Hence, it is necessary to synthesize intermediate-view stereo image pairs with smaller baseline
(fig.1). At the same time, a headtracking system can be used to adapt the actual viewpoint on the
interocular axis between the cameras, which gives the impression of natural motion parallax. A
prototype hardware system, which will perform these tasks, is presently developed within the
framework of the European PANORAMA project.
The synthesis of natural-looking intermediate views can be done by interpolation from the left-
and right-view images, if the positions of corresponding points are known. This requires the
knowledge of depth information, which can be obtained by disparity estimation between the left-
and right-view images. The disparity vectors can then be used to project pixels onto an intermediate
image plane. However, a critical case is the presence of occlusion areas, where some parts of the
scene may only be found in the left- or in the right-view image. In these cases, instead of
interpolation, a unidirectional projection has to be performed. Furthermore, disparity estimation is
not a trivial task due to the rather large baseline. We have found that horizontal disparity ranges of
up to 120 pixel are necessary with a 50 cm baseline, and a distance of 1.5 m between the camera and
the user. The vertical disparity shifts are much smaller. In the special case where a coplanar camera
geometry is used (which means that cameras shoot in parallel directions and are adjusted at the same
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-4- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
height), the vertical disparity is approximately zero all over the images. We decided to use this
geometry in order to simplify the algorithms with respect to hardware realization.
The paper is organized as follows. Section II introduces the concept of the whole system chain. In
section III, the algorithm for disparity estimation algorithm is described. Section IV specifies the
scheme developed for encoding of the disparity map. The algorithm for interpolation synthesis is
introduced in section V. Section VI gives results of computer simulations and shows examples.
Section VII describes the hardware concepts for the different parts of the chain. In section VIII,
conclusions are drawn.
II. SYSTEM CHAIN
Fig.2 shows the complete system chain in schematic form. The transmitter side consists of data
acquisition (stereoscopic camera, microphone), disparity estimation processing and encoding. The
stereoscopic camera setup uses parallel camera geometry, which allows to reduce the disparity
estimation search to the horizontal shift component. The disparity estimation processing also
includes delays which are necessary to synchronize video, audio and disparity data prior to
encoding, and a special disparity command conversion which is described in more detail in section
IV. The left and right view image signals and the audio signal are encoded by separate,
commercially-available MPEG-2 encoders. However, it is necessary to provide a separate encoder
for the subsampled disparity fields that are output from the estimator. The system multiplexer,
which is compatible to a standard MPEG-2 multiplex, integrates the encoded disparities as
additional stream data, independent from video data. Furthermore, it is necessary to synchronize the
independent left- and right-image video encoders and the disparity data.
For transmission, a standard ATM network is provided. We are using constant rate transmission
in AAL 5. At the receiver side, demultiplexing of video, audio and disparity data is performed first.
Then, the separate elementary streams are fed into the appropriate decoders, of which only the
disparity decoder is a non-standard device again. Video and disparity data are then fed into the
intermediate viewpoint interpolator, which gets further information from a headtracking system
about the required viewpoint. The information of the headtracker is used at the same time to drive
the autostereoscopic display, which is a system based on projection onto a lenticular screen and
must be adapted according to the viewing angle [3].
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -5-
The system is designed such that it can work in three modes, and hence can also be configured in
a flexible way for other purposes. In the direct mode, only data acquisition, estimator processing,
interpolator processing and data presentation are performed, which is the basic configuration for a
stereoscopic system with viewpoint adaptation. In coding mode, the chain is extended by the
encoders, multiplexer, demultiplexer and decoder, which enables compression of the required data.
In ATM mode, also a transmission of data is performed.
III. DISPARITY ESTIMATION
Disparity estimation is the most demanding task for the system hardware. To match the
corresponding points between left- and right-view images, disparity ranges of up to 120 pixel are
necessary, when the baseline is 50 cm and the distance between user and camera is 1.5 m. With 80
cm baseline, this range would even increase to approximately 230 pixel. Due to the use of a parallel
camera geometry, the disparity estimator needs only to take into account the horizontal disparity
shift. On the other hand, the parallel setup has the disadvantage that the absolute disparity shift
between left and right images is much larger than the ranges given above ; specifically, the zero
disparity is only met for a point with infinite distance. As a consequence, a large portion at the left
side of the left image is not present in the right image, and a large portion at the right side of the
right image is not present in the left image (see fig.3). This circumstance is taken into account
during estimation by defining an additional disparity offset doff, and must also be treated during
interpolation (see section V.2).
During the last years, many different schemes for disparity estimation have been proposed.
Though feature-based [4,5,6] and dynamic-programming [7,8,9] approaches seem to perform very
well, we found them to be too complex for a hardware system with the requirement of large
disparity ranges even in the case of pure horizontal disparities. Matching approaches can be
classified as area-based schemes [10,11]. We have compared several algorithms (feature-based,
dynamic programming and matching) with respect to subjective quality results and hardware
feasibility, and decided to implement the hierarchical block matching scheme, which is described in
this section. This scheme easily copes with arbitrary disparity ranges, and performs robust even in
the case of low correspondence between left- and right-view images, e.g. in partially occluded areas.
A criterion based on an absolute-difference feature is used to determine optimum positions of the
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-6- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
matching windows. At the same time, in a preprocessing stage, a simple foreground-/background
segmentation is performed, which is used to refine the results of estimation.
The disparity estimation algorithm can be divided into 4 different modules :
1. Preprocessing and segmentation. The goal of this stage is to find points with highest relevance
for matching, and to perform a raw subdivision into foreground and background areas.
2. Block matching with large block size for global bidirectional disparity estimation, followed by a
cross-consistency check.
3. Block matching with small block size for local bidirectional disparity estimation, followed by a
cross-consistency check.
4. Interpolation of dense L→R and R→L disparity fields, application of vertical median filters and
ordering-constraint check.
A flowchart describing the interrelation of the disparity estimator module blocks is given in fig.4.
Preprocessing and segmentation is performed on both input signals. Bidirectional (L→R and R→L)
sparse disparity fields are estimated in the global and refined in the local estimation stage. In order
to guarantee temporal continuity of the estimated disparities and to avoid temporally annoying
artefacts, the disparities estimated for the previous field pair are fed back to the estimator stages. For
this purpose, the dense field, generated at the final stage by bilinear interpolation, is used.
III.1. Preprocessing and segmentation
The preprocessing-and-segmentation stage uses a simple criterion based on pixel differences. To
select those image points which cannot be distinguished from their neighbours, we use a simple,
difference-based interest operator (see the so-called Moravec operator [12] in fig.5). This is applied
to both left- and right-view image fields. The directional difference in four directions (horizontal,
vertical and the two diagonals) is measured at each pixel position over a small square window of
size 5x5. In each of the four directions, we have five pixels, and four differences between adjacent
pixel pairs. In a first step, the sums-of-absolute-differences along all directions are calculated. The
output of the operator is then defined as the maximum of these four sum values. The goal of this
operation is two-fold :
− The Moravec operator's output is used to detect the point of highest interest within each matching
block for the subsequent global block matching stage.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -7-
− A threshold analysis detects large, uniform areas, at which valid disparity vectors cannot be
estimated by a block matching strategy, because no true correspondences can be found. If we
process head-and-shoulder scenes with relatively uniform background, it is easy to interpret this
as a raw foreground/background segmentation. The classification is performed on block basis,
where isolated areas with "wrong" classification are erased by comparison with their neighbors.
In this case, we end up with a unique foreground/background mask containing only two
segments.
Actually, this system was only optimized for head-and-shoulder scenes with uniform background. In
the future, it would be possible to replace this part by a more sophisticated algorithm, which enables
a more precise knowlege about the borders between foreground and background areas, where abrupt
changes in the disparities will occur due to occlusion effects. In the case where no reasonable
foreground/background segmentation mask is found (the "foreground region" of the segmentation
mask then covers the whole image), those features of the estimator, which are related to the
segmentation mask, are switched off, while the overall system still keeps working with fair quality.
Figures 6 and 7 show the extracted foreground regions from the left first frame of the sequences
ANNE and CLAUDE, and the highest-variance points which are used as matching correspondences by
the following global block matching stage
III.2 Global disparity estimation
In order to reduce noise sensitivity and simultaneously reach higher efficiency, both the left and
right image fields are subsampled by a factor of two in the horizontal direction. Only those
subsampled fields are used during the global estimation step, which are now divided into blocks of
size 16x16 pixel.
We use the point of highest interest within each block, which has been determined by the
preprocessing module, and match a reference window of size MxN pixel around this point (fig.8).
This means that the sampling position with highest omnidirectional difference inside the block is
chosen as representative point for the entire block. Furthermore, matching is performed only for
those blocks which are part of the foreground region from the segmentation mask (if present), which
means that blocks within uniform background areas are not considered at all during the matching
process. If the foreground region covers the whole image, matching is performed on all blocks.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-8- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
Let z=(x,y) be the sampling position of the particular highest-interest point (center of the
reference window) in the left field, that has been chosen to be matched. A full-search block
matching in horizontal direction is performed in order to find the corresponding block centered
around point ~ ( , )( )z x d d yzt
off= − − in the right field. Herein, dzt( ) denotes the absolute-valued
disparity vector from left to right at time t, and doff the predefined disparity offset (remark that the
left-to-right shift is always negative with a parallel camera setup !). The reference window is
compared with all corresponding windows (of size MxN as well) along the given horizontal search
interval, defined by the disparity search range and the disparity offset. The disparity range is from 0
to 63, such that a maximum disparity shift of 126 pixel plus offset can be estimated with respect to
the not-subsampled image field.
In order to select the best match among the allowed displacements, a matching criterion based on
temporal smoothing and mean absolute difference (MAD) is used. The MAD is given as
MAD dM N
I x i y j I x i d d y jzt
l r zt
offi M
M
j N
N
( ) ( , ) ( , )( ) ( )
/
/
/
/
=⋅
+ + − + − − +=−=−∑∑1
2
2
2
2
. (1)
Using the MAD, the cost function is defined as:
F d d MAD d d dzt
zt
zt
zt
zt( , ) ( )( ) ( ) ( ) ( ) ( )− −= + −1 1α , (2)
with dzt( ) as current displacement vector, dz
t( )−1 the temporal prediction vector and the weight
coefficient α, which should be set to an approximate value of 0.2. The block sizes were set to M=13
horizontally and N=9 vertically for the local matching stage. To realize a simple hardware structure,
the quotient 1/117 was omitted from (1), and α was set to 16 in (2). The temporal prediction vector
dzt( )−1 is taken from the same position z in the previously-estimated disparity field at time t-1. The
previous-field dense disparity maps are available from the dense-field interpolation module, which
are stored internally.
For each position ~z within the search interval, the function value F d dzt
zt( , )( ) ( )−1 is calculated.
The particular sampling position ~z , which minimizes this cost function, is the corresponding point
of z. Once ~z has been estimated, the same procedure is repeated from right to left, using ~z as
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -9-
reference sampling position on the right image, which means that the reference window of size MxN
is now centered at this position. The search window of the same size is placed on the left image and
shifted within a search interval of 64-pixel width again. Now, the temporal prediction is taken from
the R→L dense disparity memory. The correspondence search is then carried out without further
consideration of the previously-found L→R disparity dzt( ) .
MAD dM N
I x i y j I x i d d y jzt
r l zt
offi M
M
j N
N
( ) (~ , ) (~ , )~( )
~( )
/
/
/
/
=⋅
+ + − + + + +=−=−∑∑1
2
2
2
2
. (3)
Using the MAD, the cost function is defined as:
F d d MAD d d dzt
zt
zt
zt
zt( , ) ( )~
( )~( )
~( )
~( )
~( )− −= + −1 1α , (4)
Let us denote the estimated L→R disparity with z as reference sampling position as dzt( ) , and the
estimated R→L disparity with reference sampling position ~ ( , )( )z x d d yzt
off= − − on the right
image as dzt
~( ) . Then, a bidirectional consistency check [13] is performed in order to reject outliers.
If the vector difference condition
d d pelzt
zt( )
~( )− ≤ 1 (5)
is violated, the two vectors dzt( ) and dz
t~( ) are eliminated from both sparse disparity fields. This
verification enables a reliability criterion of disparity estimation, such that the remaining disparity
estimates can be considered as correct disparity values.
III.3 Local disparity estimation
Local disparity estimation is also a block matching procedure, but is applied to the full-resolution
(not-subsampled) image fields. The block center positions z=(x,y) are now 4 pixel apart in the
horizontal and vertical directions. The reference windows have a size of M=9 pixel horizontally and
N=5 pixel vertically, but the position z is always at the block center, such that adjacent windows
overlap by a regular value. Instead of using full search (as in global estimation), only the range
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-10- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
defined by candidate vectors is tested with an additional search range of ±2 pixel horizontally
beyond the minimum and maximum candidates. We are using 10 candidates, of which are
− 6 from the output of the global estimation, unless they are part of the background segment ;
− 3 candidates from neighboring blocks, which were already calculated during local estimation;
− 1 from the temporally-preceding displacement field at the same spatial position.
The positions of candidates and the procedure of search range determination are illustrated in fig.9.
Herein, the matching windows used during global matching are marked as hatched regions, with the
center anywhere within the global block areas of size 32x16 (this is the not-subsampled equivalent
of the 16x16 block size used during global estimation). Global candidates are the one that belongs
to the active area (of which the actual local block is part), and in addition its left/right neighbors and
the three neighbors below. It may happen that neither a global, nor a local candidate exists. This will
be the case, e.g. when all candidates are within uniform background areas, or if they did not pass the
bidirectional consistency check. In addition, the search range is cut by all disparities which would
point into the background part of the opposite field's segmentation mask. In the case where less than
4 positions would have to be checked, no matching is performed, and the positions in the sparse
disparity field are a priori marked as INVALID .
Those candidates, which originated from global matching, have to be multiplied by two, because
they were calculated on the basis of subsampled images. The search range of the local matching
procedure is determined on the basis of all candidates. For this purpose, the minimum (MIN) and
maximum (MAX) disparity values among the candidates are determined, and the search range
reaches from MIN-2 to MAX+2, but is limited between 0 and 127.
The rest of the procedure is very similar to global estimation (1)-(4), with exception of the search
range(s) and block sizes. Again, the search criterion is a combination of MAD and temporal
smoothness with approximately the same α-parameter (≈0.2) in (2),(4). In order to omit the division
in (1),(3), where the quotient should now be 1/45, α was set to a value of 8.
Local displacement estimation is also performed bidirectionally, in order to apply the cross-
consistency check (5) on the estimation result. As the first step, L→R disparity estimation is
performed. Hence, the positions z in the L image field are equidistant, while the best-matching
positions in the R image field are not necessarily equidistant, but anywhere on the same line. Sparse
disparity fields are generated with valid values at each fourth row. For R→L disparity estimation,
the same search range is used as for L→R estimation.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -11-
Regarding the output of the estimation, the number of disparity values calculated per image field
is not fixed. This is caused by possible INVALID values in the sparse disparity field. The maximum
possible number of estimated disparity values is 1/16 of image field size.
Another object-based postprocessing step is performed if a foreground/background segmentation
mask exists. Among the first four valid disparity values at the left and right sides of the foreground
part of the mask, it is checked whether the absolute disparity is smoothly decreasing towards the
outermost one. If this is not the case, those values are eliminated (set to INVALID ), which offend this
condition (indicated by broken lines in fig.10).
III.4 Generation of rowwise-dense disparity fields
After estimation of disparity values at sparse positions, as it is finally performed by the local
estimation procedure, the dense disparity fields are generated by bilinear interpolation. Herein, we
simply ignore the INVALID positions within the sparse disparity fields and generate the dense fields
only from valid ones. The bilinear interpolation is at this stage performed only horizontally, within
the rows where the local disparity estimation derives disparity values. This is called the rowwise-
dense disparity field, which is defined for rows 2,6,10,... of the image fields. This rowwise-dense
field is used exclusively for the feedback (temporal prediction) of disparities during estimation.
Furthermore, an extract from both rowwise-dense (L→R and R→L) fields is transformed into a
unique command map representation for encoding, which will be described in the next section.
After the interpolation, a vertical 7-tap median filter is applied to both disparity fields at these
rows, i.e. the median mask contains the values at the same x-positions from each 3 rows below and
above the actual point in the rowwise-dense disparity field (see fig.11). This median filter is
necessary to reject outliers, and to introduce vertical dependencies between the estimated disparity
values, which were so far calculated more or less independent of each other. If there are any
remaining rows in the rowwise-dense field, which do not contain any valid disparities at all (this
happens, e.g. in the case of a present segmentation mask, usually at the top of the image), these are
filled by a vertical linear interpolation, again starting with a zero value at the top border.
In fig.12, images representing the horizontal component of dense disparity field pointing from
left to right are displayed for the tenth frame pair of sequences ANNE and CLAUDE. Low gray levels
represent large negative horizontal vector components, whereas high gray levels represent large
positive horizontal vector components. A vector with horizontal component 0 is represented by the
gray value of 128.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-12- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
IV. DISPARITY MAP ENCODING
In our system, the information about the disparity map must be transmitted along with the encoded
data stream for the left and right camera signals. This is reasonable, because in multipoint
videoconferencing it would be superfluous to determine the disparity parameters for all participants
at each site. For the purpose of disparity map encoding, a new type of command map representation
has been developed, which is easily compressible and bears information about both (L→R and
R→L) disparity maps [14]. To transform disparities of the rowwise-dense field into the command
map representation we are using, several constraints are imposed onto the disparity fields :
− Only horizontal shifts can be treated (this was already a constraint from estimation) ;
− The disparity map must obey the ordering constraint, i.e. disparities within one line must not
cross each other ;
− The disparity map must be absolutely dense, i.e. no special treatment of occlusions is performed
(this demand is already fulfilled by the interpolation described in the last section).
IV.1 The Disparity Command map
If Dlt x( )( ) is the L→R absolute-valued disparity at pixel position x and time t, and Dl
t x( )( )−1 that of
the previous pixel position (same with Drt( ) for the R→L field), then the ordering constraint is
violated, if
D D D Dlt
lt
rt
rtx x x x( ) ( ) ( ) ( )( ) ( ) ; ( ) ( )− − < − − >1 1 1 1 . (6)
If the constraint is violated once at position x1, the check must be iterated setting x←x+1, until a
value is reached, which does not any more violate the condition
D D D Dlt
lt
rt
rtx x x x x x x x( ) ( ) ( ) ( )( ) ( ) ; ( ) ( )1 1 1 1− < − − > − . (7)
Fig.13 shows an example of a disparity map violating the ordering constraint (a) and the correction
(b). The command map, which is a transformed representation of the ordering-constraint checked
disparity map, indicates where the correspondences can be found between the left and right image
field. Two commands "match left" (ML) and "match right" (MR) are used, that indicate the
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -13-
propagation of corresponding points along the scanlines of the left and right image fields. If the
correspondence "halts", e.g. a number N of points of the left image are referenced to the same point
in the right image, N subsequent ML commands are produced. This case we call a "left contraction".
The other way round, in the case of a "right contraction", several subsequent MR commands are
released. In the "normal" case, where disparity remains constant along the scanline, the sequence
alternates ML-MR-ML-MR-... . An example is illustrated in fig.14. Starting at the left border of the
images, which is position 0 in the right image, there is one correspondence to the left image at the
same position (this will always be present, and must not be explicitly encoded), and 4 more
correspondences to subsequent positions in the left image. Hence, the command map starts with
four ML commands, which characterize the four additional correspondences of four left positions to
the right position zero. Then, the correspondence proceeds one position in the left image (ML), and
in the right image, too (MR). This happens once more, but then two additional MRs fit to the same
left position. Finally, there is again one ML and three MRs. The complete command sequence for
the example shown here is ML-ML-ML-ML-ML-MR-ML-MR-MR-MR-ML-MR-MR-MR. The total
number of commands per scanline is 2⋅xsize-2, where xsize is the number of pixels in the scanline.
The transformation from the disparity map to the command map is very simple. Suppose we are
using the L→R disparity field. Since the estimator does not allow any disparities to point outside the
image, and the dense field interpolation always starts with zero disparity at the borders, the disparity
value of the first position of the scanline (x=0) will be zero. Now, whenever the disparity value at
position x+1 is larger than the value at x, we have the case of left contraction ; if it is smaller, we
have the case of right contraction, if it is equal, we have the normal case. Specifically, the number of
ML commands to be released before the next MR command is equal to D Dlt
ltx x( ) ( )( ) ( )+ − +1 1 in
the case of zero or positive difference, and the number of MR commands to be released before ML
is equal to D Dlt
ltx x( ) ( )( ) ( )− + +1 1 in the case of negative difference. It is easy to see that this
automatically corrects disparity fields violating the ordering constraint (6),(7). For the R→L field,
everything inverts ; a smaller disparity value at the next position indicates left contraction, a higher
value the right contraction. It is left to the interested reader to determine the setting of ML and MR
commands.
Basically, both disparity fields contain approximately the same information, and are highly
redundant due to the application of the bidirectional consistency check. Major differences may be
present in the areas of heavy contractions (being in most cases originally occlusions), which are
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-14- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
indicated by many L→R (R→L) vectors pointing to only one or a very small number of pixels in the
right-view (left-view) image. Indeed, we found that R→L disparities produced by the estimator
algorithm are more reliable at left occlusions, while the L→R disparities produce better interpolated
image quality at the right occlusions. Since we deal with videoconferencing sequences, we can
employ a very simple model for head-and-shoulder scenes, which is based on the convex surface of
the human head and body [11]. Then, it is clear that left occlusions can occur only left from the
center of the foreground shape, while right occlusions will be present only at the right side from this
point. Hence, we can divide each scanline of disparity values into two parts, which are marked by
the mid position of the active area under the foreground object (fig.15). For the left part, the R→L
disparity field is used, while for the right part, the L→R field is better suited. The split position is
determined only from one of the masks (preferably R, since we start with this field), because the
mid positions must not necessarily coincide. In addition, it is required to perform an ordering-
constraint check at the crossover point, taking into account both fields.
If no segmentation mask is present, the fallback mode is switched, which uses only one disparity
field. As the estimation process is started with the L→R field, we chose to use this one.
IV.2 Encoding of the Command map
The command map is a compact representation of a disparity field checked for ordering constraint,
and has an extremely small amount of inter-symbol redundancy [14]. With two commands, we need
2⋅xsize-2 bits per scanline to represent the command map. With CCIR-601 video (P=720), this
results in a transmission rate of 5.177 Mb/s, if we transmit disparity values only at each fourth
scanline. Since we have a maximum of 4 Mb/s allocated for lossless transmission of disparity values
in our overall system, further reduction of data rate is necessary. We have found a Lempel-Ziv based
algorithm [15] capable to reach a further reduction of rate by a factor of approximately two ; this
algorithm can easily be implemented in realtime hardware, using a conventional microcontroller.
The reason for further redundancy reduction is firstly the limited disparity range, which does not
allow arbitrary command map sequences, but even more important are the interline, intraline, and
temporal redundancies of the disparities, which are caused by the smoothness of object surfaces, and
allow the code adapt to specific preferable ML/MR sequences.
The encoding algorithm takes into account these characteristics. The update heuristic uses the
frame being coded and the previous frame to update the Lempel-Ziv algorithm sequence memory.
Sequences of ML/MR commands from lines adjacent to the line being coded and corresponding
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -15-
lines in the previous frame have been assigned a gradually increased priority over sequences gained
from the rest of the sequence memory. Thus, during the coding process, these sequences are used
first to compress the incoming command sequences. Tests have shown that even by limitation of the
memory to these "near" sequences, the compression is not greatly affected : The compression
degradation incurring through reduction of possible sequence memory positions is more than offset
by the reduction of the absolute number of positions, which must be represented by specific codes.
Limitation of sequences also reduces the hardware complexity.
V. INTERMEDIATE VIEWPOINT INTERPOLATION
The task of the interpolator is to generate two images from virtual cameras, based on two images
from real cameras and their disparity map. The position of the virtual cameras should be related to
the position of the viewer’s eyes. We have developed an interpolation concept which can decide
dynamically, based on the degree of contraction, which areas of the intermediate image are truly
interpolated, i.e. taken from both images, and which areas are possibly subject to occlusions and
hence must only be projected from the corresponding area of one of the left- or right-view images.
The unique representation of disparities by the command map has three advantages that can be
exploited by the interpolator :
− the vertical interpolation of the missing rows of disparities (the rowwise-dense map being
defined so far only at each fourth row) is very simple using the command map ;
− the determination of pixel addresses for the corresponding points, necessary during interpolation,
can be performed only by increment operations ;
− it is very easy to check the grade of contraction (multiple correspondences to one point), to
decide whether it is more appropriate to perform extrapolation from one image instead of
interpolation from both.
Fig.16 shows the block diagram of the interpolator. We can recognize four main parts:
− Command expander
− Parameter controller
− Left eye image generator
− Right eye image generator
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-16- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
The command expander transforms the four times subsampled command map to a dense (not
subsampled) map by linear interpolation. The parameter controller generates control parameters for
the image generators, e.g. the relative position of the virtual cameras. The right- and left-eye image
generators are the actual image interpolators.
V.1 Command expander
The task of the command expander is to transform a four times vertically subsampled command
map into a dense command map. To do this we need both interpolation and extrapolation of the
command map. In the following example we will explain how to use these two options. Scanlines
are numbered from 0 to 287 for each field.
Since the original command map is sampled for example at scanlines 2, 6, 10, …, and 286, the
lines 0 and 1 have to be extrapolated from line 2, line 287 has to be extrapolated from line 286, and
all other lines have to be interpolated from the two lines that enclose them.
All remaining lines have to interpolated. For this we use linear interpolation. This operation is very
easily done with the two-command map. Let us consider two consecutive scanlines in the
subsampled command map at positions y and y+4. We are going to create an intermediate command
scanline at relative position α. For α=0, the scanline at position y is found, for α=1 the scanline at
y+4. So, for our application, the values α = ¼, α = ½ and α = ¾ are used.
The command map scanlines describe disparity paths. Fig.17 shows the two original disparity paths
A and B in black and an interpolated one I in dark grey with α = ½. The axes are the horizontal
positions of the left and right image scanlines. The disparity paths go from (0,0) to (719,719) in this
diagram.
The method of interpolation is very simple. We start at (0,0) for all three paths at time instant
zero. At every time instant, we execute one command of each original command scanline. These
commands we call ACOM and BCOM. They can be either ML or MR. An ML means a step in the
left image scanline, in fig.17 this is a horizontal step from left to right. An MR means a step in the
right image scanline, in the figure this is a vertical step from bottom to top. For the I command
scanline we construct a command by ICOM = (1-α) * ACOM + α * BCOM. This gives us the
analytical interpolated I path.
For α unequal to 0 and 1 but in between, intermediate types of commands arise for the I path,
other than ML or MR. The consequence of this is, that the analytical I path does not fit onto the grid
in general. It can no longer be described by the normal commands ML and MR. We solve this by
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -17-
rounding off the path to the nearest grid point. The disparity path in light grey shows a rounded
version of the original I path in dark grey.
Although the rounded light grey path is used for output, the analytical dark grey path has to be
calculated as well. If not, round off errors in the path will accumulate very fast. As the analytical
path is constructed step by step at each time instant, the rounded path can be constructed by
choosing either ML or MR. The goal is to minimize the distance between the endpoints of the
current analytical and rounded path.
V.2 Parameter controller
Fig.18a/b shows as an example the left and right camera image of the MAN sequence, first frame.
Fig.18c shows the associated command map disparity field, where the ML command is indicated as
black, MR as white, and the "normal" alternation of ML-MR as grey.
Fig.19 illustrates, for the example of one scanline at the tip of the nose of these images, some
important preliminaries that have to be observed by the interpolator and must be regulated by the
parameter controller. The scanline showing grey values from the left camera is located on top, that
from the right camera at the bottom. All corresponding points are connected as they were found by
the disparity estimator. The representation of fig.19 is very closely related to image interpolation :
Every horizontal cross section gives the luminance values along the scanline of a virtual camera at a
specific position between the real cameras. This relative position is indicated by the parameter S,
which varies from -½ (position of left camera) to ½ (position of right camera).
For the stereoscopic presentation, we take two virtual cameras with positions S1 and S2, S1 < S2,
which are the positions of the virtual left and right cameras, respectively. If a stereoscopic scene
shall appear behind the display, the nearest point must have a disparity shift of zero. However, in
our parallel camera configuration, the nearest point has the largest disparity shift dmax, such that it is
necessary to introduce a shift correction. Moreover, we must guarantee that the position of that point
on the screen remains the same when the view angle is changed. Hence, the shift relative to the right
camera at a specific position S, as it is shown in fig.19, must be
SHIFT d S= ⋅ −max ( )12 (8)
pixels to the right. In this example, the nearest point is the lightest value, which is the tip of the
nose.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-18- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
Now, the algorithm has two input parameters: the relative virtual camera position S and the shift
parameter SHIFT. The S parameter is a real number between -½ and ½, the shift parameter an
integer that can be both positive and negative, depending whether the left or right camera position is
interpolated. Due to the horizontal shift, it is possible that some pixels on the virtual scanline are not
in the visible region. In the example of fig.19, this is the case for the rightmost pixels. Using
information from the visible region to define the luminance and chrominance of these pixels is not
feasible. Therefore, we chose to set these pixels to black values. Specifically, it is necessary to set
dmax pixels to black value at the left side of the left-view image (if S=-½ is selected) and do the same
with dmax pixels at the right side of the right-view image (if S=½ is selected). At intermediate
positions, the number of black pixels BLACKleft at the left side is equal to SHIFT from (8), and the
number of black pixels BLACKright at the right side is equal to dmax-SHIFT.
The parameter controller has to generate two sets of each four parameters, one for the left eye
image generator and one for the right eye image generator. The headtracker information is vital for
this. The information of the headtracker are the real numbers X, Y and Z. At this moment, we are
using only the X component (left/right head position), which has the same scale as S.
Furthermore, we need to define the real number DIST that is related to the eye distance and the
camera baseline. This gives the relative distance between the viewpoints S of the left and right
virtual cameras. A value of DIST=0.05 is presently set. We still investigate a technique to adapt this
parameter to the Z information of the headtracker, which is the distance between display and viewer.
We now redefine the SHIFT parameter given in (8), such that zero shift is now obtained at
position S=0. All generated images with S<0 are then shifted towards left, and those with S>0 are
shifted towards right. At the same time, we assume that the blackening of pixels is performed after
the shift, such that always dmax/2 pixels are set to black at both sides of the images. The following
control parameters are then used:
Left control parameters
S x DIST SHIFT d S112 1= − ⋅ = ⋅; max (9)
Right control parameters
S x DIST SHIFT d S212 2= + ⋅ = ⋅; max (10)
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -19-
Note that the two S parameters should always be in the range of [-½,½]. For extreme DIST values, S
has to be clipped. Fig.20 shows the effect of the linear relation between the S and SHIFT parameter.
The black parameters are chosen just large enough to ensure that always the same set of pixels on
the virtual scanline is visible, independent of position S.
V.3 Image interpolators
Fig.21 shows one scanline of left and right images, their associated disparity field and an
intermediate virtual image. For each 2 pixels in all images, we have two Y values, one U and one V
value.
For each virtual Y-pixel, we determine which disparity vectors cross its area. Of those, we select
the ones most to the left. In the figure these are indicated by grey lines. After selecting a disparity
vector, we determine which Y-pixels of the left and right image are referenced by this vector, and
use a weighted average to create the value of the virtual Y-pixel:
Y Y Yvirtual left left right rightW W= ⋅ + ⋅ (11)
The same procedure is done for U- and V-pixels, the only difference being that these components
have only half the size, and double number of disparities values is defined for each pixel area. The
generation of virtual Y-, U- and V-pixels are separate processes and can be done in parallel.
In formula (11), two weights Wleft and Wright were introduced. Since the condition Wleft+Wright=1
must apply, we need to specify only one. For example, if we set Wleft=½, the interpolator would be
most simple ; however, to obtain better image quality, it is required to take into account the position
parameter S, and the contractivity of the disparity field described by the command map. The first
can be done by setting
W Sleft = −12 , (12)
which for extreme virtual camera positions |S| ≈ ½ produces much better results. An extension to
this scheme is to make the weights dependent on the disparity field. If done correctly, it is possible
to adjust the weights in such a way that in occlusion areas data is only taken from one image. In a
two command disparity map, we can not see the difference between an occlusion and a strong
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-20- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
contraction. However, in both cases it would be wise to take data only from one of the left and right
images.
To accomplish this, we use again the notion of the disparity path. Fig.22 shows a disparity path
representation of the disparity map scanline in fig.19. The horizontal and vertical axes are the
horizontal positions of the left and right image scanlines, respectively. Each point in this
representation corresponds to a possible match. The grey values indicate the probability of each
possible match. The disparity path goes from (0,0) to (719,719) for CCIR601 images. Each disparity
vector is one white point. The disparity range (allowed minimum and maximum disparities) is also
shown in white.
Next, we introduce the real number δ that indicates, for each disparity vector, the average
direction of the disparity path in a window around that vector. The length of the window is the even
integer N, and count(ML) is the number of ML commands in that window :
δ = ⋅ −2 1count( )ML
N(13)
In left occlusion or strong left contraction areas, δ becomes +1, indicating a horizontal path in fig.22
around the point of interest. In right occlusion or strong right contraction areas, δ becomes -1,
indicating a vertical path in fig.22 around the point of interest. Now, we choose the weight Wleft to
be:
W W Wleft = + ⋅δ ∆ (14)
with
W S W S= − = −12
12; ∆ (15)
It is easy to see that Wleft = 1 in a left occlusion, Wleft = 0 in a right occlusion and Wleft = W for a
normal object. This kind of adaptive weight setting is very easy to implement in hardware. Fig.23
shows the effect of the disparity-driven weighting procedure. Fig.23a is the weighting according to
(12), with white indicating Wleft = 1 and black indicating Wleft = 0. Fig.23b shows the disparity-
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -21-
adaptive weighting according to (14)-(15) with a window length N=8, and fig.23c with a window
length of N=64. It can clearly be seen that Wleft approaches 1 even for S→½ in the areas of left
occlusions (left side of the head), and Wleft approaches 0 even for S→-½ in the areas of right
occlusions with adaptive weighting. In these areas, the disparity-weighted interpolation algorithm
produces sharper images than the simple interpolator using (12).
VI. EVALUATION OF ALGORITHMS AND RESULTS OF COMPUTER SIMULATIONS
In the start phase of the PANORAMA project, we have compared different algorithms for disparity
estimation and viewpoint interpolation with regard to subjective quality and hardware feasibility.
Among the disparity estimators were a feature-based approach [16], two dynamic-programming
approaches and the hierarchical block matching, which is described in this paper. The latter one was
finally selected, because it showed superior performance and did not require larger hardware
complexity than any of the other proposals. Two interpolation concepts were investigated, one of
these an object-based approach [17], the other one the concept presented in this paper. Though the
former one performed slightly better in the areas of conclusions with uniform background scenes,
the latter one was selected, because it is less complex with respect to hardware realization, and more
universal applicable also to non-uniform background scenes. The subjective assessments of six
expert viewers were evaluated ; a "good" picture quality was attested to the selected scheme.
The performance of the methods presented in this paper has been tested with a set of natural
stereoscopic sequences in extensive computer simulation experiments. These sequences were
recorded within the framework of the European projects RACE-DISTIMA and ACTS-
PANORAMA. The image resolution is 720x576 pixels. The stereoscopic sequences MAN, ANNE
and CLAUDE, representing typical videoconferencing situations, are given here as a reference. It is
interesting to note that these three sequences were taken with different camera setups. While MAN is
truly in the configuration we are planning to realize (parallel cameras, but only the overlapping area
is shown), the ANNE images were artificially shifted in order to avoid the large non-overlapping area
between left and right image view. In both of these images, the baseline was 50 cm, with a distance
of 2-2.5 m between cameras and person. The CLAUDE sequence was even captured with a
convergent (non-parallel) camera setup, however with a much smaller baseline (15 cm), but also
with smaller distance between person and camera (1.2 m). All the sequences presented here fulfil
the uniform background assumption, such that the foreground/background segmentation can be
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-22- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
performed as planned in the hardware system. However, the disparity estimation algorithm did
succeed not only with these sequences, but has also successfully been applied to other (non-uniform
background) sequences.
All underlying disparity estimation experiments have been performed using the parameters given
in section III. Some results illustrating the performance of the image interpolation method are given
in figures 24-26, which show left-view images, synthesized central viewpoints and right-view
images using the tenth frame pairs of the sequences MAN, ANNE and CLAUDE. The computed central
viewpoint is displayed between the two original stereo images.
We obtain a good stereoscopic virtual-viewpoint image quality, when the difference in the s-
position between a synthesized left- and right-view image is between 0.05 and 0.1, which
corresponds approximately to the "natural" disparity due to the distance of human eyes. It is
remarkable, that some occasional distortions, which become visible as some kind of temporal
flicker near the foreground/background borders in a monoscopic presentation, are becoming
unnoticeable in the stereoscopic view. The telepresence illusion is very natural, rendering high
image quality.
Recently, we have also tested the performance of the system with sequences, that were taken
with a convergent camera setup. Herein, the preliminary condition is violated, that no vertical
disparities should be present, which implies that the actions of the disparity estimator and the
viewpoint interpolator should strictly not be limited to only one scan line. However, we have found
that the quality of images interpolated with our system remains high, if the convergence angle
between the camera is not too large (up to 15 degrees). With a convergent camera setup, the SHIFT
parameters in (8) (9) and (10) can drastically be lowered, such that it is no longer necessary to set a
large number of pixels at the left and right sides of the image to black values. This implies that we
can utilize a larger area of the images, and can increase the size of the person by using cameras with
larger focal length.
VII. HARDWARE STRUCTURE
VII.1 Disparity estimator
We have started hardware implementation of the disparity estimation algorithm with the overall
structure that was shown in fig.3. The goal was to build a target hardware without dedicated chip
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -23-
design, such that only custom chips, digital signal processors (DSPs) and field programmable gate
arrays (FPGAs) were to be used. While the preprocessing and postprocessing (interpolation) stages
can easily be realized within FPGAs, the matching stages give the most demanding task with respect
to processing power. Nevertheless, we found it feasible to implement the matching kernel of the
global stage by using one FPGA and one DSP. An additional FPGA is needed for pixel access
control. The basic structure is similar to the local block matching module, which is indeed more
complicated and will be described with more detail in the rest of this section.
Figure 27 shows the hardware architecture of the local block matching module in more detail.
Input data of this module are luminance of left and right video image and estimates of the global
disparity vectors for a number of feature points. All input data form a multiplexed CCIR 656 data
stream, where left and right video replace the luminance and chrominance data, and feature point
coordinates as well as estimated global disparity vectors are transmitted in the horizontal blanking
interval.
Signal processing on the local block matching board is divided into several modules: two block
matching modules, one 7-tap median filter module, and three DSP modules. Mechanical and
electrical specification of all six modules and their connectors comply with the TIM-40 standard
[18] to make interfacing and testing easy. While the DSP TIM-40 modules are commercially
available, the other three modules are developed especially for the needs of this project.
The complexity of present high-end FPGAs allows the design of a FPGA-based block matching
processor containing 20 cell elements for parallel MAD calculations as well as additional circuitry
for pixel addressing and MAD postprocessing. The applied principle of MAD calculation is the
parallel accumulation of absolute pixel differences by shifting measure pixels. As depicted in figure
28, the cell array contains 20 block matching cells, each calculating the MAD for a single search
position and passing on the measure pixel to the next cell. Thus, search pixels (St) and measure
pixels (Mt) have to be fetched only once for the complete estimation of a single block.
The block matching processor is depicted in figure 29. It interacts with a DSP for provision of
parameters used for address generation and MAD postprocessing, and with a dual-port memory
which sequentially stores both left and right image slices and outputs left and right pixels for
matching blockwise. The cost function is performed by an addition of the MAD and a counter
register which is preloaded with the temporal prediction vector. Thus, the expression α|dz(t)-dz
(t-1)| is
built by decrementing or incrementing the counter register. The last step of finding the minimum of
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-24- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
the cost function requires the presence of a vector mask, which defines the valid positions to be
taken into account for the search. The mask is provided by the DSP and results from ten start
vectors for local search and the start vector transmitted to the processor.
As depicted in figure 30, the estimation algorithm allows a parallel connection of Block-
Matching Processors, each calculating a separate image slice. The parameters used for local
disparity estimation require four processors working in parallel to meet the real-time requirements.
The processors are supervised by DSPs, which build the block-matching commands and start
vectors for each processor.
This hardware solution offers an area-efficient design, such that both local stage and the dense field
generation can be placed on a single board. The whole disparity estimator will be realized on two
boards (actually, everything would fit on one board, but the two-board solution is more practical in
the prototype version, because the two matching stages are built by different partners).
VII.2 Synchronization unit
To increase the testability of the system, the signals at the interfaces between camera, disparity
estimator, encoder, decoder, interpolator and display are in conformance to the parallel CCIR 656
standard. To compensate for the (variable) delay of the disparity estimator, a synchronization unit is
inserted to ensure synchronization of the disparity fields and the recorded image sequences at the
input of the encoders.
A block scheme of the synchronizer is shown in figure 31. Based on the implementation of the
disparity estimator, a maximum delay of one frame is assumed (= 40 msec). At start-up of the
system the start code for a new frame is searched for in the CCIR 656 data streams of the left and
right image sequences. When found, the synchronizer starts to buffer the image data and at the same
time starts to search for the code of a new frame in the CCIR 656 stream of the command map
(disparity field information). As soon as the latter is detected, the synchronizer starts to output the
image data and command map data to the encoders. If for some reason the data streams become
asynchronous again during operation, the detection procedure is repeated.
Synchronization at the output of the decoders is obtained by using the inserted timestamps
(PresentationTimeStamp, DecodingTimeStamp, SystemTimeClock) during (MPEG) encoding and
multiplexing.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -25-
VII.3 Interpolator
The hardware architecture of the interpolator is shown in figure 32. Since the goal of the project is
to realize a prototype system, we have chosen to implement the interpolator with off-the-shelf
memories and FPGAs (Altera 9k/10k series).
The decoded left and right images are demultiplexed and the luminance and chrominance data
are stored in separate 8-bit FIFO buffers. After demultiplexing the subsampled command map is
first expanded to a dense map and then stored in a 1-bit FIFO buffer. When enough data is available
in the FIFO buffers the weights Wleft and Wright are calculated as described in section V, controlled
by the Algo block. The Algo block is the most critical part of the interpolator. It controls the pixel
positions in the intermediate view, taking into account the external parameters SHIFT, S, the
numbers BLACKleft and BLACKright of black pixels at the left and right size of the intermediate
image and the head position. Moreover it takes into consideration the different sampling of
luminance and chrominance data. The controller takes care of the timing and synchronization of
input and output data. The multipliers (X) are implemented as serial pipelined multipliers, since the
parallel version in the chosen technology was not fast enough to run at the intended 27 MHz. This
introduces a small extra delay of eight clock periods. The overall delay of the interpolator is about 1
µs which is neglectable, compared to the total delay of the chain. The interpolator is implemented
twice, one for the virtual left camera and one for the virtual right camera.
VIII. SUMMARY AND CONCLUSIONS
A method for disparity estimation and image synthesis applied to 3D-videoconferencing with
viewpoint adaptation is introduced. The novelty of the disparity estimator is twofold : On one hand,
it has been optimized in order to achieve a very low hardware complexity, and on the other hand, it
shows robustness and accuracy with regard to the addressed application. The goal, to estimate
reliable displacement maps with extremely low computational costs, is reached by an improved
hierarchical Block-Matching method. The idea at the heart of the approach presented is to combine
previously estimated vectors to predict and correct each newly-calculated disparity vector, applying
a suitable cost function and taking into account the assumptions about the scene. The image
synthesis performs a weighted interpolation, wherein the specific weights for the left and right
camera images are adapted to the degree of contraction within the disparity field. The methods
reported in this paper were designed under the constraints to keep implementation costs low and to
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-26- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
supply intermediate views with good image quality. The performance of the presented methods was
tested by computer experiments using natural stereoscopic sequences representing typical
videoconferencing situations. The system is presently realized in a hardware testbed by the project
partners. The disparity estimator and image synthesis method introduced in this paper are expected
to be capable to offer realistic 3D-impression with continuous motion parallax in videoconferencing
situations.
ACKNOWLEDGEMENTS
This work was supported by the European Commission within the ACTS PANORAMA project
under grant AC092. The sequences used for the experiments were recorded at HHI, Germany, and
CCETT, France.
REFERENCES
[1] N. Tetsutani, K. Omura and F. Kishino : "Wide-screen autostereoscopic displays employing
head-position tracking," Opt. Eng. , vol. 33, no. 11, pp. 3690-3697, Nov. 1994.
[2] K. Hopf, D. Runde and M. Böcker : "Advanced videocommunications with stereoscopy and
individual perspective," in Towards a Pan-European Telecommunication Service
Infrastructure - IS&N '94, Kugler et. al. (eds.), Berlin, Heidelberg, New York : Springer 1994.
[3] R.Börner : "2-channel lenticular system for 3D-imaging with tracked projectors," HHI Annual
Report 1996, Berlin : HHI 1997
[4] W. Hoff and N. Ahuja : "Surfaces from stereo : Integrating feature matching, disparity
estimation and contour detection," IEEE Trans. Patt Anal. Mach. Intell., vol. PAMI-11, no.2,
1989.
[5] J. Weng, N. Ahuja and T.S. Huang : "Matching two perspective views," IEEE Trans. Patt
Anal. Mach. Intell., vol. PAMI-14, no.8, 1992.
[6] H.H. Baker and T.O. Binford : "Depth from edges and intensity based stereo," Proc. 7th Int.
Joint Conf. Artif. Intell., pp. 631-636, Vancouver, Canada, Aug. 1981.
[7] Y. Ohta and T. Kanade : "Stereo by intra- and inter-scanline," IEEE Trans. Patt Anal. Mach.
Intell., vol. PAMI-7, no.2, pp. 139-154, Mar. 1985.
[8] P. Anandan : "Measuring visual motion from image sequences," PhD thesis, University of
Massachusetts, 1987.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -27-
[9] I.J. Cox, S.L. Hingorani and S.B. Rao : "A maximum likelihood stereo algorithm," Computer
Vision and Image Understanding 63 (1996), no.3, pp. 542-567
[10] B. Chupeau : "A multiscale approach to the joint computation of motion and disparity :
Application to the synthesis of intermediate views," Proc. 4th Europ. Worksh. on Three-
Dimension. Televis., pp. 223-230, Rome, Italy, Oct. 1993.
[11] E. Izquierdo and M. Ernst : "Motion/disparity analysis for 3D video conference applications,"
Proc. Int. Workshop on Stereoscopic and Three Dimensional Imaging, pp. 180-186, Santorini,
Greece, Sept. 1995.
[12] H.P. Moravec : "Towards automatic visual obstacle avoidance," Proc. Of Fifth Intern. Joint
Conf. On Artif. Intell., p.584, Cambridge, MA, August 1977
[13] M.J. Hannah : "A system for digital stereo image matching," Photogrammic Engineering and
Remote Sensing 55 (1989), no. 12, pp. 1765-1770
[14] P.A. Redert and E.A. Hendriks : "Disparity map coding for 3D teleconferencing applications,"
to appear in Proceedings of SPIE VCIP, San José, USA, 1997
[15] J. Ziv and A. Lempel : "Compression of individual sequences via variable-rate coding," IEEE
Trans. Inf. Theor., 1978
[16] J. Liu, I.P. Beldie and M. Wöpking : "A computational approach to establish eye-contact in
videocommunication," Proc. Int. Workshop on Stereoscopic and Three Dimensional Imaging,
pp. 229-234, Santorini, Greece, Sept. 1995.
[17] J.-R. Ohm and Ebroul Izquierdo : "An object-based system for stereoscopic
videoconferencing with viewpoint adaptation," in Digital Compression Technologies and
Systems for Video Communications, N.Ohta, Editor, Proc. SPIE 2952, pp.29-41, Berlin, Oct.
1996
[18] Texas Instruments: "TIM-40, TMS320C4x Module Specification," Version 1.01, 1993
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-28- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
List of Figures
Fig.1. Setup of stereoscopic cameras and screen, and variable position of virtual camera pair.
Fig.2. The complete system chain.
Fig.3. Invisible areas in parallel camera setup
Fig.4. Flowchart of disparity estimator algorithm.
Fig.5. Gradient-based operator applied for homogeneity decision.
Fig.6. a foreground region b/c highest-variance matching points (left/right image) of sequence
ANNE.
Fig.7. a foreground region b/c highest-variance matching points (left/right image) of sequence
CLAUDE.
Fig.8. Relation of block position, point of highest Moravec output and matching window.
Fig.9. Positions of 9 spatial candidate vectors (one candidate from temporal preceding
displacement field not shown).
Fig.10. Postprocessing of disparities at foreground/background segmentation mask borders (illegal
disparities indicated as dotted lines).
Fig.11. Vertical median filter after horizontal interpolation.
Fig.12. Dense disparity fields a of sequence ANNE b of sequence CLAUDE.
Fig.13. a Violation of ordering constrained b Interpolated fill.
(violating vectors indicated as bold lines)
Fig.14. Disparity example for generation of disparity command map.
Fig.15. Usage of L→R and R→L disparities exploiting position of the foreground masks.
Fig.16. Block diagram of the interpolator.
Fig.17. Disparity paths of two scanlines and the interpolated analytical and rounded paths
Fig.18. Left (a) and right (b) camera images, and the associated disparity map (c).
Fig.19. Scanlines of left, right and virtual cameras.
Fig.20. The effect of the control parameters with respect to image position.
Fig.21. The definition of pixel values in the visible region of intermediate images, for the cases of
Y, U and V components.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -29-
Fig.22. The disparity path representation relating to fig.14.
Fig.23. Weights Wleft for different interpolation positions (top : left position, bottom : right
position) a with weighting according to (12) b,c with disparity-adapted weighting
according to (14)-(15).
Fig.24. Left-view image, synthesized central viewpoint and right-view image, sequence MAN.
Fig.25. Left-view images, synthesized central viewpoints and right-view images, sequence ANNE.
Fig.26. Left-view images, synthesized central viewpoints and right-view images, sequence
CLAUDE.
Fig.27. Hardware architecture of the local block matching module
Fig.28. Cell array
Fig.29. FPGA-based block matching processor
Fig.30. Structure for block matching
Fig.31. Structure of the synchronizer
Fig.32. Hardware structure of the interpolator
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-30- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
Fig.1.
Fig.2.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -31-
Fig.3.
Preprocessing and
Segmentation
Global disparity estimation
Local disparityestimation
Dense field inter-polation, vertical
median filter
L image fieldR image field
L disparity fieldR disparity fieldsegmentation mask
L image fieldR image field
t o i n t e r p o l a t o r
feature pointcoordinates
globaldisparities
localdisparities
Fig.4.
Fig.5.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-32- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
a) b) c)
Fig.6.
a) b) c)
Fig.7.
Block of size 16x16
point with highestMoravec output
Matching windowof size 13x9
Fig.8.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -33-
Fig.9.
Fig.10.
actual row
rows processed during horizontal interpolation
rows processed during vertical interpolation
median filter inputs
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-34- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
Fig. 11.
a) b)
Fig.12.
L image
R image
L image
R image
a)
b)
Fig. 13.
Fig.14.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -35-
R image
R mask
R->L disparities
L image
L mask
L->R disparities
Fig.15.
Fig.16.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-36- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
Fig.17.
a) b) c)
Fig.18.
Fig.19.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -37-
Fig.20.
Fig.21.
Fig.22.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-38- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
a) b) c)
Fig.23.
Fig.24.
Fig.25.
Fig.26.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -39-
globaldisparityvectors
globaldisparityvectors
disparityvectors
every 4x4block
horiz. interpolation
densedisparityevery 4th
line
start vector generation
postprocessing
multiplexedleft/rightvideo and
globaldisparity
COMports
Dual DSP 320C44TIM Module with
4x 128k x 32 SRAM
7 tap Vertical MedianModule, FPGA based,TIM size & pinning
Local BlockmatchingModule, FPGA based,TIM size & pinning
Local BlockmatchingModule, FPGA based,TIM size & pinning
Dual DSP 320C44TIM Module with
4x 128k x 32 SRAM
start vector generation
COMports
Dual DSP 320C44TIM Module with
4x 128k x 32 SRAM
COMports
disparityvectors
every 4x4block
Fig.27.
∑Mt-St ∑Mt-19-St∑Mt-1-StMt
St
MAD
Position 1 Position 2 Position 20
Mt-19Mt-1
Fig.28.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
-40- A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation
BlockAddress
Generation
Blockvectors
TV right
TV leftDual PortMemory
CellArray
Minimumfinder
Vectormask
DSPInterface
Costfunction
Temporalprediction vector
Minimumposition
BlockmatchingProcessor
MPix SPix
MAD
Fig.29.
BlockMatchingProcessor
DualPort
MemoryDSP
TV left/rightGlobal
disparity field
Localdisparity field
Stripe 0
Stripe 1
Stripe 3
Stripe 2
DSP
DSP
DSP
Fig.30.
Ohm/Grüneberg/Hendriks/Izquierdo/Kalivas/Karl/Pelekanos/Redert :
A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation -41-
w ritecounter
w ritecoun ter
readcounter
readcoun tercon tro l
readcounter
d isparity
v ideo left
v ideo righ tECL
ou tpu tbuffers
ECLoutpu tbuffers
ECLou tpu tbuffers
m em ory640K x 16
m em ory640K x 16
F IFO4 x 1by te
register
register
video left
video right
disparity
Rec.656interpretatorrecognizes
EAVpreamble
Rec.656interpretatorrecognizes
EAVpreamble
Rec.656interpretatorrecognizes
EAVpreamble
start
start
start
address
address
synchronizedseparate module
separate module
Fig.31.
Fig.32