Model Driven Compression of 3-D Tele-Immersion Data€¦ · Model Driven Compression of 3-D...

Model Driven Compression of 3-D Tele-ImmersionData

Jyh-Ming LienRuzena Bajcsy

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2006-170

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-170.html

December 12, 2006

Copyright © 2006, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Model Driven Compression of 3-D Tele-Immersion DataJyh-Ming Lien Ruzena Bajcsy

{neilien,bajcsy}@eecs.berkeley.eduUniversity of California Berkeley

Berkeley, CA USA

Abstract

Due to technology advances, reconstructing three-dimensional representations of physical environments inreal time using cheap and non-professional cameras andPCs becomes possible. These advances will make 3-D tele-immersive environments available for everyone inthe near future. However, the data captured by a Tele-Immersion (TI) system can be very large. Compressionis needed to ensure real-time transmission of the data.Due to this real-time requirement, compressing TI datacan be a challenging task and no current techniques caneffectively address this issue. In this paper, we proposea model driven compression. The main idea of this ap-proach is to take advantage of prior knowledge of objects,e.g., human figures, in the physical environments and torepresent their motions using just a few parameters. Theproposed compression method provides tunable and highcompression ratios (from 50:1 to 5000:1) with reasonablereconstruction quality. Moreover, the proposed methodcan estimate motions from the noisy data captured by ourTI system in real time.

1 Introduction

Three-dimensional Tele-Immersion (TI) systems aim toprovide a rich communication medium that capturesthree-dimensional data from physically separated envi-ronments and projects the captured data into a shared vir-tual space in real time [6, 25]. Due to recent technologyadvances, a 3-D TI system can be constructed cheaply us-ing off-the-shelf products. Figure 1 shows a 180-degreefull-body reconstruction from our TI system.

10 camera viewsfull body reconstruction

Figure 1: Point data captured from our TI system with 10 cali-brated camera clusters. All clusters are registered to a global co-ordinate system. The full 3-D reconstruction is simply a unionof the individual reconstructions without any post processing,e.g., hole filling or noisy point removal, due to the real-time con-straint.

One of the major bottlenecks of the current 3-D TI sys-tems is in transmitting the huge amount of data in realtime. For example, our system, which generates 640×480pixel depth and color maps from 10 camera clusters in arate of 15 frames per second (fps), requires a 220 MB/sbandwidth. Several methods [15, 24] have been proposedto address this problem using image and video compres-sion techniques, but the data volume remains to be prohib-itedly large. Currently, our system, using the compressionmethod by Yang et al. [24], can only reach 4∼5 fps whenconnected with a single remote TI.

Challenges of compressing TI data. Our TI data iscomposed of a stream of points as shown in Figure 1.Methods have been proposed to compress static or dy-namic point data. However, there are issues that have notyet been addressed in the literature, thus making our com-pression problem more challenging. The main challengesinclude:

1

point cloud Q model PT (P )

Step 1: Motion Estimation

InputOutput

residual atlas

R = |Q − T (P )|

Step2: Prediction Residuals

Figure 2: Model driven compression of the point data Q, where P is our model, and T is the motion parameter (e.g., the skeletonconfiguration) that transforms P to Q. The prediction residuals R is the difference between T (P ) and Q.

1. real-time constraint,2. little or no data accumulation, and3. multiple or no data correspondences.

The real-time constraint forbids us to directly use mostof the existing compression methods for point data, e.g.,the methods [12, 19] designed for off-line compression(which usually takes tens of seconds to several minutesto compress a few thousands of points, and our systemcreates about 8∼10 thousand points in each frame). Thereal-time constraint also requires us to accumulate littleor no data, i.e., the captured data should be transmitted assoon as they become available. However, data accumula-tion is usually necessary for the methods that exploit thetemporal coherence [17, 3, 14, 22].

Finding correspondences between two point data is an-other challenging problem. In many existing motion com-pression methods [13, 16, 26], the correspondence infor-mation is given. This is not the case in our TI data. More-over, a point in a point set may correspond to zero (e.g.,due to occlusion) or multiple (reconstructed in multipleviews) points in the next point set.

Our approach. This paper aims to address the prob-lems in TI data compression from a model driven ap-proach. The main idea of this work is to take advantageof prior knowledge of objects, e.g. human figures, in the

TI environments and to represent their motions using justa few parameters, e.g., joint positions and angles.

More specifically, our proposed method compressespoints that represent human motion using motion estima-tion. Therefore, instead of transmitting the point clouds,we can simply transmit the motion parameters. This ap-proach is based on the assumption that most points willmove under rigid-body transform along with the skele-ton. In reality, point movements may deviate from thisassumption, such as muscle movements and hair or clothdeformations. Therefore, we further compress the devia-tions from the rigid movements. As we will see later (inSection 4), the deviations (we call prediction residuals) inmost cases are small. An overview of this model drivenapproach is shown in Figure 2.

Key results. Our compressor provides (tunable) highcompression ratios (from 50:1 to 5000:1) with reasonablereconstruction quality. Our method can estimate motionsfrom the data captured by our TI system in real time (10+fps).

2 Related Work

TI data compression. Only a few methods have beenproposed to compress TI data. Kum et al. [15] pro-

2

posed an algorithm to compress TI data in real time. Theirmethod aimed to provide scalability to camera size, net-work bandwidth, and rendering power. Their method has5:1 compression ratio. Recently, Yang et al. [24] pro-posed a real-time compression method to compress depthand color maps using lossless and lossy compression, re-spectively. Their method has 15:1 compression ratio. Un-fortunately, the volume produced by these real-time com-pression methods is still too large to be transmitted in realtime.

Motion estimation. Extensive work has been done totrack human motion in images (see surveys [2, 9]). In thispaper, we focus on motion estimation from 3-D points.

A popular method called “Iterative Closest Points”(ICP) [4, 20] has shown to be an efficient method forestimating rigid-body motion [23] in real time. For twogiven point sets P and Q, ICP first computes correspond-ing pairs {(pi ∈ P, qi ∈ Q)}. Using these correspondingpairs, ICP computes a rigid-body transform T for

error = argminT

∑

i

|(T (pi), qi)|2 . (1)

The main step for solving Eq. 1 (see details in [4]) is tocompute the cross-covariance matrix ΣPQ of the corre-sponding pairs {pi, qi},

ΣPQ =1

n

n∑

i=1

[(pi − µp)(qi − µq)t] , (2)

where µp and µq are the centers of {pi} and {qi}, resp.,and n is the size of {pi, qi}. As outlined in Algorithm 2.1,ICP iterates these steps until the error is small enough.An important property of ICP is that the error alwaysdecreases monotonically to a local minimum when Eu-clidean distance is used for finding corresponding points.

Algorithm 2.1 ICP(P , Q), Besl and Mckay [4]Input. Two point clouds, P and Q

1: repeat2: find corresponding points {(pi ∈ P, qi ∈ Q)}3: compute error and T in Eq. 1.4: P = T (P )5: until error < τ

ICP has also been applied to estimate motions of ar-ticulated objects, e.g., hand [8], head [18], upper [7] and

full bodies [21]. In these methods, ICP is applied to min-imize the distance between each body part and the pointcloud Q, i.e., error(P,Q) =

∑

l error(Pl, Q), where Pl

represents the points of the body part l. However, this ap-proach lacks a global mechanism to oversee the overallfitting quality. For example, it is common that two bodyparts can overlap and a large portion of Q is not ‘cov-ered’ by P . Therefore, considering the global structure asa whole is needed. To the best of our knowledge, jointconstraint [7, 21] is the only global measure studied.

On the contrary, our ICP-based motion trackingmethod, called ARTICP, considers both joint and globalfitting constraints. More specifically, we propose to esti-mate the global fitting error as:

global error(P,Q) = |Fe(P ) − Fe(Q)|, (3)

where the function Fe transforms a point set to a spacewhere the difference is easier to compute. A more de-tailed discussion on Fe can be found in the next two sec-tions. Note that Eq. 3 also computes the prediction resid-uals (shown in Figure 2).

3 Motion Estimation

We extend ICP to estimate motion of dynamic point datain real-time. Our approach uses a computationally moreexpensive (may not be real time) initialization to generatean accurate skeleton of the subject and then a real-timetracking method will fit the skeleton to the point cloudscaptured from the rest of the movements. Several meth-ods, e.g., [5], exist and can provide us an initial skeletonto start the process. In this paper, we will focus on thetracking aspect.

3.1 Model and Segmentation

Unlike most ICP-based work [18, 8, 7, 21], our methoduses the segmentation of the initial frame to estimate themotion by taking the advantage of that the appearances ofthe initial frame, denoted by P , and the remaining framesare usually similar. We compute a segmentation of P us-ing a given skeleton S. A skeleton S is organized as atree structure, which has a root link, lroot, representing thetorso of the body. Each link has the shape of a cylinder.After segmentation, each point of P is associated with a

3

link. A point p is associated with a link l if l is closest top. The closeness is computed as the follows:

closeness(p, l) =

0 ~np ·−→l p ≤ 0

1 p ∈ l

e− dist(p,∂l)/r otherwise

,

(4)where ~np is the normal of p,

−→l p is the vector from (p’s

closest point on) link l to p, ∂l is the surface of the link andr is the radius of the link. We denote the points associatedwith a link l and a skeleton S as Pl and PS , respectively.

3.2 ICP of Articulated Body

Now, we can start to fit the skeleton S to the next pointcloud Q. Our goal here is to find a rigid-body transformfor each link so the (global) distance between PS and Q

is minimized. The main features of our ARTICP include:

• Hierarchical fitting for faster convergence,• articulation constraint,• monotonic convergence to local minimum guaran-

teed,• global error minimization.

ARTICP is outlined in Algorithm 3.1. We will discusseach of these important features in detail.

Algorithm 3.1 ARTICP(S, Q)Input. a skeleton S, and the current point cloud Q.

1: cluster Q . see Section 3.32: q.push(lroot) . q is a queue3: while q 6= ∅ do4: l← q.pop() . l is a link5: T ← ICP(Pl, Q) . T is a rigid-body transform6: for each child c of l do7: apply T to c

8: q.push(c)9: global fitting . see Section 3.4

Clustering. ARTICP first clusters the current pointcloud Q. Each cluster has a set of points with similaroutward normals and similar colors. These clusters willbe used in ICP for finding better correspondences. (Seedetails in Section 3.3.)

Hierarchical ICP. Next, we evoke ICP to compute arigid-body transform for each link. Note that we do this in

a fashion that the torso (root) of the skeleton is fitted firstand limbs are fitted last. By considering links in this order,ARTICP can increase the convergence rate by applyingthe transform not only to the link under consideration butalso applies the transform to the children of the link. Therationale behind this is that a child link, e.g., limbs, gen-erally moves with its parent, e.g., torso. If the child linkdoes not follow the parent’s movement, the movement ofthe child link is generally constrained and is easy to track.

Articulation constraint. We consider the articula-tion constraint, i.e., the transformed links should remainjointed, by replacing both of the centers µp and µq in Eq. 2by the joint position.

Global fitting. Now, after we apply ICP to the indi-vidual links, the skeleton S is roughly aligned with thecurrent point cloud Q but may still be fitted incorrectlywithout considering the entire skeleton as a whole. Weconsider a global fitting step in Section 3.4.

3.3 Robustness, Efficiency and Conver-gence

Real time is one of the most important requirements of oursystem. Computing correspondences is the major bottle-neck of ICP. Considering additional point attributes, e.g.,color and normal information, can increase both ICP’sefficiency and robustness. Some spatial partitioning datastructures, e.g, k-d tree, can also alleviate this problem,but the performance of these data structures degradesquickly when the dimensionality of the space becomeshigh. Moreover, considering addition attributes usuallydoes not guarantee monotonically convergence in ICP.

To gain efficiency, robustness, and convergence usingpoint colors and normals, we construct a data structurecalled (n, c)-clustered kd-tree (shown in Figure 3). In thisdata structure, we cluster normals using geographic co-ordinate system and cluster colors using hues and satura-tions. Then, we construct a k-d tree from the 3-D pointpositions in each cluster.

When a point m looks for its corresponding point, ICPwill only exam the points in m’s the neighborhood in a(n, c)-clustered kd-tree. By doing so we can find bettercorrespondences efficiently while guarantee that ICP canalways converge to a local minimum. Theorem 3.1 provesthis property.

4

kd-tree

latitu

de

n

m

π

2

−π πlongitude

−

π

2

color

Figure 3: A (n, c)-clustered kd-tree. Points n and m only lookfor their corresponding points in the gray regions around n andm.

Theorem 3.1. Using a (n, c)-clustered kd-tree, ICP canalways converge monotonically to a local minimum.

Proof. Besl and Mckay [4] have shown that ICP can al-ways converge monotonically. An important fact thatmakes this true is that the corresponding points of a pointp are always extracted from the same point set Q. The rea-son why considering point normals may not have mono-tonic convergence is that in each iteration p’ correspond-ing point can be extracted from a different subset of Q.On the contrary, ICP uses a (n, c)-clustered kd-tree willalways search for p’s corresponding point from the sameset of clusters, therefore, is guaranteed to converge mono-tonically.

3.4 Global Fitting

So far, we only consider fitting each link to the point cloudindependently. However, as we mentioned earlier, consid-ering links independently sometimes causes overlappingeven when the fitting error (Eq. 1) is small! Therefore,we realize that, in order to get more robust results, it isnecessary to consider the entire skeleton as whole.

Global error. Global fitting error (Eq. 3) is measuredas the difference between the model PS and the currentpoint cloud Q. Due to the size difference and ambiguouscorrespondences of PS and Q, it is not trivial to computethe difference directly. We need the function Fe to trans-form PS and Q so that the difference is easier to estimate.Our selection of Fe has indeed been mentioned before (inSection 3.1): Segmentation. It is important to note thatthe segmentation process is a global method, in whicheach link competes with other links to be associated witha point (using Eq. 4). Because Fe(PS) is already known

at the beginning, we only need to compute Fe(Q). Aftersegmenting Q, each link l will have two associated pointsets, Pl and Ql. To measure the difference of Pl and Ql,we use a compact shape descriptor: the third moment, µ3,of Pl and Ql. To summarize, our global function is ap-proximated as:

global error(PS , Q) =∑

l∈S

|µ3(Pl) − µ3(Ql)| . (5)

When the error is above a user-defined threshold, westart to minimize the global error.

Minimizing global error. To minimize the global er-ror, we iterate between the following two steps: matchingPl to Ql using ICP and segmenting Q (i.e., computinga new Ql). Another benefit of Eq. 5 is that we can dis-tinguish the quality of each link. Therefore, we can justapply this iteration to the links with intolerably large er-rors.

4 Prediction Residuals

Motion estimation brings the skeleton S close to the cur-rent point cloud Q. However, due to several reasons, e.g.,non-rigid movements and estimation errors, our model PS

may not match Q exactly. We call the difference betweenPS and Q “prediction residuals” (or simply residuals).Because PS and Q are close to each other, we expect theresiduals to be small. In this section, we present a methodto compute the prediction residuals.

Figure 4: A torso link.

Similar to Eq. 5, we computethe prediction residuals for eachlink by finding the difference be-tween Pl and Ql. However, thistime the difference is measuredby re-sampling the points. Moreprecisely, we project both Pl andQl to a regular 2-D grid embed-ded in a cylindrical coordinatesystem defined by the link l. Be-cause Pl and Ql are now encodedin regular grids, we can easilycompute the difference, whichcan be compressed using imagecompression techniques. Figure 5

5

illustrates the prediction residual computed from the torsolink in Figure 4. Because this projection is invariant froma rigid-body transform, we only need to re-sample Ql ateach time step.

(a) (b) (c)

Figure 5: (a) Color and depth maps at time t − 1 of the torsolink in Figure 4. The Y-axis of the maps is parallel to the link.(b) Color and depth maps at time t of the same torso link. (c)The differences between the maps at times t− 1 and t.

We determine the size of a grid from the shape of alink l and the size of l’s associated points |Pl|. We makesure that our grid size is at least 2|Pl| using the followingformulation, i.e., the width and the height of the grid are2πRlS and LlS, resp., where Rl and Ll are the radiusand the length of the link l and S =

√

|Pl|πRlLl

. We callthat a grid encodes 100% prediction residuals if the gridhas size 2|Pl|. As we will see later, we can tune the gridsize to produce various compression ratios and qualities.

5 Experimental Results

We use both synthetic and TI data to evaluate our method.Synthetic data is generated from a polygonal mesh ani-mated using the mocap data from Carnegie Mellon Uni-versity. We study synthetic data because it provides a‘ground truth’ for us to evaluate our method. We studythe quality of our model-driven compression on the TIdata with various levels of prediction residuals consid-ered. We also compare our model-driven compressionto H.264 video compression [1] and Yang et al.’s method[24]. All the experimental results in this section are ob-tained using a Pentium4 3.2GHz CPU with 512 MB ofRAM. The best way to visualize our results is via the sub-mitted videos.

5.1 Motion estimation of synthetic data

We study the robustness of our motion estimation us-ing three synthetic data sets, dancing, crouch&flip, andturning. Each frame in the synthetic data contains about10,000 points, and 75% of the points are removed ran-domly to eliminate easy correspondences. The table be-low shows a summary of the averaged motion estimationframe rate.

motion dance crouch&flip turn tai-chiestimated (Fig. 6) (Fig. 7) (Fig. 8) (in video)avg. fps 39.1 fps 29.6 fps 48.4 fps 61.3 fps

The quality of the estimated motion is measured as thenormalized distance offset et between joints, i.e.,

et =1

n · R

n∑

i=1

|jesti − j

mocapi |,

where, jesti and j

mocapi are the estimated and mocap joint

positions, resp., and R is the radius of the minimumbounding ball of the point cloud, i.e., we scale the skele-ton so that the entire skeleton is inside a unit sphere.

With and without global constraints. We comparethe difference between the results from the tracking algo-rithm with and without articulation and global fitting con-straints. Figure 6 shows that global constraints do have asignificant influence on motion estimation quality.

Downsampling factor. In this experiment, we studyhow point size (% of downsampling) affects the motionestimation quality. Figure 7 shows that the downsamplingrate does not have a significant influence on the quality (atleast down to 1% for this crouch & flip motion).

Noise level. In this experiment, we study how noiseaffects the quality. From Figure 8, it is clear that as we in-crease the noisiness of the points, the difference betweenthe estimated motion and the “ground truth” increases.However, the overall difference remains small even forvery noisy data.

5.2 Compressing TI data

We evaluate the results of our compression method usingfour motion sequences captured by our TI system. Thesemotions are performed by two dancers, one student and

6

(a) (b) (c)

0 0.02 0.04 0.06 0.08

0.1 0.12 0.14 0.16 0.18

0.2

0 20 40 60 80 100 120 140 160

norm

aliz

ed e

rror

time step

global constraints: dance

with global constraintswithout global constraints

Figure 6: Top: The postures from (a) motion capture, and (b)estimation with global constraints and (c) without global con-straints. Bottom: Tracking errors with and without global con-straints.

one tai-chi master. The date captured by our TI systemhave about 8 to 10 thousand points in each frame. Becausethe TI data is much noisier than the synthetic data and canhave significant missing data, estimating motion from theTI data is more difficult, thus we use 50% of the initialpoint set as our model. The table below shows that we canstill maintain at least 10 fps interactive rate in all studiedcases.

motion dancer 1 dancer 2 student tai-chi masterestimated (Fig 2) (Fig 4) (in video) (Fig 10)avg. fps 11.9 fps 11.5 fps 12.5 fps 12.6 fps

Quality. Unlike synthetic data, we do not have aground truth to evaluate the quality in the TI data. Instead,we directly measured the quality of our method as the dif-ference between the point data before and after our model-driven compression. More specifically, we compute the“peak signal-to-noise ratio” (PSNR) of the images ren-dered from uncompressed and compressed point data. Inour experiments, two sets of images (one for each pointset) are rendered from six (60 degree separated) cameraviews in each frame. The results of our study are summa-rized in Table 1.

In Table 1, we compare the reconstruction quality by

0

0.01

0.02

0.03

0.04

0.05

0.06

0 10 20 30 40 50 60 70 80 90 100

norm

aliz

ed e

rror

time step

downsampling factor: crouch and flip

100%40%10%1%

Figure 7: Tracking errors with different downsampling factors.

no noisenoise level level 2 level 3 level 4level 1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 20 40 60 80 100 120 140

norm

aliz

ed e

rror

time step

noise level: turn

no noisenoise level 1noise level 2noise level 3noise level 4

Figure 8: Tracking errors with different noise intensities.

computing PSNRs w.r.t the uncompressed data. Typi-cal PSNR values in image compression are between 20and 40 dB. We considered three compression levels, i.e.,compression without residuals and with 50% and 100%residuals. We see that encoding residuals indeed gener-ate better reconstruction than that without residuals by 4dB. Figure 9 provides an even stronger evidence that con-sidering residuals always produces better reconstructionsfor all frames. Another important observation is that thecompression quality remains to be the same (around 30)for the entire motion. Figure 10 shows that the differenceis more visible when the prediction residuals are not con-sidered.

7

Table 1: Compression quality. The “peak signal-to-noise ratio” (PSNR) is computed between rendered images of thecompressed and uncompressed point data. PSNR is defined as: 20 log10 ( 255

rmse ), where rmse =√

∑

i |Ii − Ji|2 is theroot mean squared error of images I and J . We also measure the qualify by varying the levels of residual considered.A compression with x% residuals means its residual maps are x% smaller than the full residual maps.

motion dancer 1 dancer 2 student tai-chi masterno residuals 23.87 dB 24.10 dB 25.82 dB 26.27 dB

avg. PSNR 25% residuals 25.77 dB 26.18 dB 27.96 dB 28.75 dB100% residuals 27.95 dB 28.27 dB 29.91 dB 30.83 dB

(a) (b) (c) (d) (e)

Figure 10: Reconstructions from the compressed data and their differences with the uncompressed data. (a) Uncompressed data.(b) Compressed without residuals. (c) Difference between (a) and (b). (d) Compressed with residuals. (e) Difference between (a)and (d).

22

24

26

28

30

32

34

36

0 50 100 150 200 250 300

PS

NR

time step

taichi

100% residualsno residuals

Figure 9: PSNR values from the tai-chi motion. Each point inthe plot indicates a PSNR value.

Compression Ratio. One of the motivations of thiswork is that no existing TI compression methods can pro-vide high compression ratios. In this experiment, we showthat our model-driven compression method can achieve50:1 to 5000:1 compression ratios. As we have shown ear-lier, our compression method can provide different com-

pression ratio by varying the level of residuals consideredduring encoding. We summarize our experimental resultsin Table 2.

We would like to note that because our method is fun-damentally different from the other two methods we canachieve very high compression ratio while maintainingreasonable reconstruction quality (as shown earlier). BothYang et al.’s and H.264 are image (or video)-based com-pressions, which take color and depth images as theirinput and output. On the contrary, our model-drivencompression converts the color and depth images to mo-tion parameters and prediction residuals. Moreover, eventhough H.264 provides high quality and high ratio com-pression, H.264 is not a real-time compression for theamount of data that we considered in this work.

6 Discussion and Future Work

We proposed a model driven compression method to con-vert point set data to a few motion parameters and a set of

8

Table 2: Compression ratio. Both Yang et al.’s [24] and H.264 (we use and implementation from [1]) compressionmethods took the color and depths images as their input and output. The compression ratio of H.264 reported in thistable is obtained using 75% of its best quality. We use jpeg and png libraries to compress color and depth residuals,respectively.

motion dancer 1 dancer 2 student tai-chi mastersize before compression 142.82 MB 476.07 MB 1.14 GB 988.77 MB

Yang et al. [24] 11.36 11.73 10.23 14.41compression H.264 [1] 64.04 53.78 32.04 49.58

ratio no residuals 1490.31 3581.52 5839.59 5664.5525% residuals 195.62 173.52 183.80 183.82

100% residuals 66.54 55.33 60.29 61.43

residual maps, whose size can be tuned adaptively accord-ing to available space and time. Our motion estimationmethod, ARTICP, can efficiently estimate the motionsfrom synthetic and TI data at interactive rates (10∼60frames per second).

Despite our promising results, our method has limi-tations. First, we observed that our motion estimationfails when many occlusions occurred. This problem be-comes more serious when multiple subjects appear in thescene. Second, we assume that points move under rigid-body transform and non-rigid motions are usually com-paratively small. However, this assumption becomes in-valid when the subjects wear skirts or long hairs. Third,although the quality of our compression method is rea-sonable, its PSNR values are still low comparing to thecurrent video compression standards. It is not clear at thispoint how our method can be improved to produce higherquality results.

Currently, we are exploring the possibility of estimat-ing motions from multiple targets. Our compressionmethod can also be extended in several ways. For exam-ple, we would like to investigate efficient ways to detecttemporal coherence in our residual maps. One possibleapproach is to accumulate residuals from a few framesand compress them using video compression techniques.

Note that even though we focus only on the applica-tion in data compression in this paper, the result of thiswork can also be used broadly in applications includinghuman activity recognition and analysis [11, 10], marker-less motion capture, and Ergonomics.

References

[1] Adobe. Qicktime 7.0 h.264 implementation, 2006.

[2] J. K. Aggarwal and Q. Cai. Human motion analysis:a review. Comput. Vis. Image Underst., 73(3):428–440, 1999.

[3] O. Arikan. Compression of motion capturedatabases. ACM Trans. Graph., 25(3):890–897,2006.

[4] P. J. Besl and N. D. McKay. A method for registra-tion of 3-d shapes. IEEE Trans. Pattern Anal. Mach.Intell., 14(2):239–256, 1992.

[5] K. M. Cheung, S. Baker, and T. Kanade. Shape-from-silhouette of articulated objects and its usefor human body kinematics estimation and motioncapture. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, June2003.

[6] K. Daniilidis, J. Mulligan, R. McKendall, G. Kam-berova, D. Schmid, and R. Bajcsy. Real-time 3d tele-immersion, 2000.

[7] D. Demirdjian and T. Darrell. 3-d articulated posetracking for untethered diectic reference. In ICMI’02: Proceedings of the 4th IEEE International Con-ference on Multimodal Interfaces, page 267, Wash-ington, DC, USA, 2002. IEEE Computer Society.

[8] G. Dewaele, F. Devernay, and R. Horaud. Hand mo-tion from 3d point trajectories and a smooth surfacemodel. In ECCV (1), pages 495–507, 2004.

9

[9] D. M. Gavrila. The visual analysis of human move-ment: a survey. Comput. Vis. Image Underst.,73(1):82–98, 1999.

[10] G. Guerra-Filho and Y. Aloimonos. Human activ-ity language: Grounding concepts with a linguisticframework. In Proc. of the 1st International Con-ference on Semantics and Digital Media Technology(SAMT), 2006.

[11] G. Guerra-Filho, C. Fermuller, and Y. Aloimonos.Discovering a language for human activity. In Proc.of the AAAI 2005 Fall Symposium on An-ticipatoryCognitive Embodied Systems, 2005.

[12] S. Gumhold, Z. Karni, M. Isen-burg, and H.-P. Sei-del. Predictive point-cloud compression. In Sig-graph 2005 Sketches, 2005.

[13] L. Ibarria and J. Rossignac. Dynapack: space-timecompression of the 3d animations of triangle mesheswith fixed connectivity. In SCA ’03: Proceedingsof the 2003 ACM SIGGRAPH/Eurographics sym-posium on Computer animation, pages 126–135,Aire-la-Ville, Switzerland, Switzerland, 2003. Eu-rographics Association.

[14] Z. Karni and C. Gotsman. Compression of soft-body animation sequences. Computers & Graphics,28(1):25–34, 2004.

[15] S.-U. Kum and K. Mayer-Patel. Real-time multi-depth stream compression. ACM Trans. MultimediaComput. Commun. Appl., 1(2):128–150, 2005.

[16] J. E. Lengyel. Compression of time-dependent ge-ometry. In SI3D ’99: Proceedings of the 1999 sym-posium on Interactive 3D graphics, pages 89–95,New York, NY, USA, 1999. ACM Press.

[17] W. M. Marc Alexa. Representing animations byprincipal components. Computer Graphics Forum,19(3):411–418, 2000.

[18] L.-P. Morency and T. Darrell. Stereo tracking us-ing icp and normal flow constraint. In Proceedingsof International Conference on Pattern Recognition,2002.

[19] T. Ochotta and D. Saupe. Compression of point-based 3d models by shape-adaptive wavelet codingof multi-height fields. In Symposium on Point-BasedGraphics, pages 103–112, 2004.

[20] S. Rusinkiewicz and M. Levoy. Efficient variants ofthe icp algorithm. In Proceedings of the Third In-ternational Conference on 3-D Digital Imaging andModeling (3DIM), pages 145–152, 2001.

[21] R. D. S. Knoop, S. Vacek. Sensor fusion for 3dhuman body tracking with an articulated 3d bodymodel. In Proceedings of the IEEE InternationalConference on Robotics and Automation, Walt Dis-ney Resort, Orlando, Florida, May 15 2006.

[22] M. Sattler, R. Sarlette, and R. Klein. Simpleand efficient compression of animation sequences.In SCA ’05: Proceedings of the 2005 ACM SIG-GRAPH/Eurographics symposium on Computer an-imation, pages 209–217, New York, NY, USA,2005. ACM Press.

[23] D. Simon, M. Hebert, and T. Kanade. Real-time 3-d pose estimation using a high-speed range sensor.In Proceedings of IEEE International Conferenceon Robotics and Automation (ICRA ’94), volume 3,pages 2235–2241, May 1994.

[24] Z. Yang, Y. Cui, Z. Anwar, R. Bocchino,N. Kiyanclar, K. Nahrstedt, R. H. Campbell, andW. Yurcik. Real-time 3d video compressionfor tele-immersive environments. In Proc. ofSPIE/ACM Multimedia Computing and Networking(MMCN’06), San Jose, CA, 2006.

[25] Z. Yang, K. Nahrstedt, Y. Cui, B. Yu, J. Liang,S. hack Jung, and R. Bajscy. Teeve: The next gen-eration architecture for tele-immersive environment.In ISM ’05: Proceedings of the Seventh IEEE Inter-national Symposium on Multimedia, pages 112–119,Washington, DC, USA, 2005. IEEE Computer Soci-ety.

[26] J. Zhang and C. B. Owen. Octree-based animatedgeometry compression. In DCC ’04: Proceedingsof the Conference on Data Compression, page 508,Washington, DC, USA, 2004. IEEE Computer Soci-ety.

10

Date post:	26-Jun-2018
Category:	Documents
Upload:	lamnhu
View:	221 times
Download:	0 times

Model Driven Compression of 3-D Tele-Immersion Data€¦ · Model Driven Compression of 3-D...

Documents