[IEEE 2012 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video...

LIVE CA PTURE, RECTIFICATION, A ND STREAMING OF

STEREOSCOPIC INTERNET VIDEO FOR CASUAL USERS

M. Scott Bishop * Hyojin Kim t Viswanathan Swaminathan+

Adobe San Jose, CA USA

University of California, Davis Davis, CA USA

Adobe San Jose, CA USA

ABSTRACT

We investigate stereoscopic 3D (S30) video using H.264/AVC supplemental enhancement information messages, real-time rectification, and show objective quality metrics PSNR and MSSIM for depth adjusted and rectified common test sequences converted to side-by-side frame compatible video. We use common computer vision techniques to estimate scene disparity and apply texture warping to vertically align images. The S30 system can capture live video from either two commodity webcams or an inexpensive stereo webcam. Two video streams are captured, rectified, encoded as H.264 video and transmitted to a media server. Clients connect to the media server to receive the H.264 encoded S30 video. A client internet browser-plugin decodes the S30 video as either 20 or S30 if the client has GPU rendering hardware.

Index Terms - Stereo video, H.264/AVC SEI messages, 3D video coding, streaming media.

1. INTRODUCTION

Stereoscopic 3D (S30) video is popular in commercial movies, home 3D TV systems, and PC gaming; but real-time S30 streaming for video chat applications and real-time S30 internet video production is not as common. NVIDIA's® S30 JavaScript library and Microsoft's® Silverlight web browser plug-in [1] provide online S30 video viewing but application support for end-to-end S30 construction remains challenging.

Real-time S30 video encoding and streaming using professional S30 video camera systems are available for broadcast television, but these systems are generally too expensive for most casual home users. This paper introduces an end-to-end live streaming S30 video system that uses inexpensive web cameras to produce visually appealing real-time live S30 internet video that can be viewed on 3D ready LCD displays and OLP projectors.

To produce the S30 effect for users, S30 systems can take advantage of several S30 hardware rendering devices. S30 systems can use active shutter glasses, passive stereo glasses, or autostereoscopic displays. Images are offset to a viewer's left and right eyes to mimic visual disparity between the two eyes. The difference between the left and right images are fused by the human visual system (HVS) to produce the desired S30 effect. Challenges to producing real-time S30 include the coding of the offset (disparity) between the left and right images prior to viewing and

*UC Davis computer science Ph.D. student; Adobe Intern.

t UC Davis computer science Ph.D. student; work performed at Adobe.

+ Principal Scientist, Adobe.

providing quality video at real-time frame rates. Proper image disparity and video quality enables the HVS to fuse the two images to create a pleasing stereo effect.

We provide a system to produce S30 video using two single head webcams or a single dual-head webcam in anaglyph or active stereo. To produce the S30 video, we use H.264/AVC encoding, the Real-Time Messaging Protocol (RTMP) [2] for streaming and the Adobe Flash Media Server® (FMS) to serve video to clients.

The main contributions of our work include:

• a system to capture two live 20 video streams, rectify, depth adjust, encode, and publish S30 internet video using inexpensive hardware;

• and video quality results using rectified and unrectified common multi-view sequences [3] streamed as S30 video.

2. RELATED WORK

We review S30 and multi-view end-to-end streaming video systems, epipolar geometry estimation methods used for rectification, and common video quality metrics.

2.1. End-to-End S3D

Kiraly et al. [4] present a two camera system that uses a single stream to encode S30 video. The video is transmitted as a multiplexed NTSC (interleaved LlR) signal and digitized on the client. To encode the stereo frame pair, frames are transmitted every other frame with a loss of temporal resolution. Aksay et al. [5] use a dual-head camera to encode video offline in an S30 system that streams video from a server. They use a dedicated two-head camera for video capture and transmit a monoscopic and an enhanced view with a content-adaptive stereo codec based on H.264.

Multi-view systems include Zhoe et al. [6] who present a transport strategy to stream 8 views over two IP channels with an error concealment scheme to resolve dropped packets on the client. Lou et al. [7] describe a 32 camera system to stream multi-view video at approximately 2Mbps with a latency of approximately 0.2 to 0.8 seconds. A 16 camera system created by Matusik and Pfister [8] captures, encodes, and simulates broadcast transport using a local area gigabit ethernet network to simultaneously transmit individual views as MPEG-2 streams. The system transmits 14. 4 Gb/sec of data and would be challenging to implement as a web-based system with real-time frame rates where bandwidth is limited. larusirisawad and Saito [9] create a free-viewpoint S30 content generation system from four uncalibrated cameras using epipolar geometry. They segment the background from moving objects and reconstruct moving targets by silhouette volume intersection.

2.2. Epipolar Geometry Estimation for Rectification

Algorithms to compute the fundamental matrix F used in rectifying images can be found in Hartley and Zisserman [10] including the 7-point, normalized 8-point, and gold standard (MLE) methods. To improve the quality of F, fitness functions can be used to remove outliers including random sampling consensus (RANSAC) [11], least median of squares (LMedS) [12], and least trimmed squares (LTS) [13]. We use the OpenCV [14] implementation of RANSAC to remove point correspondence outliers when computing F. Given F, there are rectification algorithms such as [15], but for real-time S3D video we use a simplified OpenGL based warping rectification scheme.

2.3. Objective Quality Metrics

Measuring the quality of the stereoscopic video can be done by subjective or objective measures. Hewage et al. [16] report that depth in S3D video is highly correlated with video quality as measured with peak signal to noise ratio (PSNR) and structural simi

larity (SSIM). We use PSNR and the mean SSIM (MSSIM) to test our system's S3D video quality for rectified and unrectified video.

3. S3D VIDEO CONSTRUCTION

We encode the S3D video using the supplemental enhancement information (SEI) [17] two-view frame packing arrangement SEI message and use the SEI message to signal when an S3D video is present on the client. This frame packing is also referred to as frame compatible S30 video. The SEI message is inserted into the stream prior to encoding and is parsed by compatible H.264 decoders. We use a side-by-side frame arrangement with a frame_packing_arrangemenLtype value of 3. Frames can also be arranged as top-bottom; checkerboard, etc. as described in Table D-8 of the H.264 specification [17].

Y2 or full width ,

Figure I: Single SBS packed video frame sent to encoder.

For backward compatibility of video systems, frame compat

ible S3D video resolution is reduced by half of the width for sideby-side packing. An illustration of a single side-by-side S3D frame is shown in Figure I.

Our system can generate S3D video from dual-head webcams (Sony Bloggie 30® , Minoru® webcam, etc.), or two conventional single-head webcams. We first capture left and right videos, perform image rectification, depth adjustment, SEI message insertion, encoding, and then the transmission of the S30 video stream (Figure 2). Rendering defaults to 20 video if the client support for S3D video is unavailable. Without applying pre-processing, specifically rectification and depth adjustment, two-single head cameras and even some low-cost dual-head mount cameras produce images that are not immediately suitable for S3D due to vertical misalignment that can interfere with the stereo etlect.

3.1. S3D Video Rendering in Web Browser

For clients without active stereo support, we allow anaglyph S3D video to be delivered in a rectified blended left and right frame if

Real-time video capture

Figure 2: S3D video encoder pipeline.

assembled on the server (single frame video), or as left and right frame compatible video that needs to be assembled on the client. An advantage offrame compatible anaglyph versus blended is that the left frames can be rendered as 2D video. Active S3D video is supported on the client using a browser plugin, graphics hardware, and NVIDIA's® active shutter 3D Vision system [18].

4. METHODS

Our S3D system processes video prior to encoding. The real-time pre-processing includes feature detection; correspondence identification; optical flow tracking; epipolar geometry construction; and image rectification using OpenGL texture warping.

4.1. S3D Correspondence and Rectification

We use the Harris comer detector [19] to find features in an image I(x, y) around a patch of pixels (Xi, Yj) with a weighting function Wi,j [20]. The Harris detector finds comers by thresholding the eigenvalues [21, 22] of matrix G created from the components of the image gradient (Ix,Iy), where Ix = g� and Iy = g�:

Given a set of matching features, we compute the image disparity and track features across the left and right video frames using the Lucas-Kanade (LK) based optical flow algorithm [23]. LK assumes consistency in brightness, temporal consistency and spatial consistency with respect to pixels in the images [20]. For each comer feature in the left image I, we compute the corresponding feature in the right image r. The LK method tracks features as motion vectors (1L, v ) computed using the partial derivatives of image I w.r.t pixel (x, y) and time t (represented in our system by the stereo pair jrames(l,r)) as:

The maximum displacement from all feature points matched using optical flow is used as the global disparity shift dx. Applying dx to each image results in image disparity between the pixels in the left and right frames, which allows the HVS to perceive depth. We found that using the maximum otlset (maxdx) similar to Pritch et al. [24] between all corresponding features resulted in an S3D effect that was visually more appealing than other values such as the mean (/1dx) offset.

We compute the 3 x 3 fundamental matrix F between the left and right video frames using RANSAC. F is the mathematical representation of the epipolar geometry (Figure 3) between point correspondences in two images [10]. For point correspondences PI and pr, F satisfies the equation P; FPI = O. We compute F and rectify images only if the difference between the left and right

point correspondences y pixel value is greater than a threshold I: (I: = 3 in our system).

CI

P ." .......... ." ....

II Ir Figure 3: Epipolar geometry for images II.T with camera centers CI.r, epipoles el,r and scene point p.

The epipoles el, er are points defined by finding the intersection between the line connecting the left and right camera centers CI,r and the image planes 11,1" To test the stability of the funda

mental matrix we use el and er to verify the null space conditions e� F = 0 and Fel = 0 [10] (see line 12 of Algorithm I).

The process of rectification removes vertical disparity and constrains feature searches in subsequent video frames to horizontal rows in each frame versus the entire image space. We rectify each video frame using an OpenGL texture warping function and create image disparity between the left and right frames by shifting the right video frames by dx. Our S3D end-to-end system uses the steps shown in Algorithm I.

4.2. S3D Streaming

Stereo video can be transported using several protocols as discussed by Gourler et al. [25]. Our active stereo application publishes a single RTMP [2] S3D video stream to FMS from two videos. Our system makes no changes to encoding standards and works with existing video streaming protocols. A client browser application connects to FMS, receives the bitstream, decodes and then renders the video as either 20 or S3D video.

4.3. Objective Video Quality Metrics

PSNR is calculated as P SN R = 10 logl o ('!:ra;;), where M SE

is the mean square error between two images and max is the largest pixel value. A high PSNR value indicates higher quality video. We compute the SSIM as described by Wang et al. [26]. For grey scale images hand h, SSIM is computed using the mean intensities JiII,!2' constants CII,!2 for near zero terms, the standard deviations CTI 1,!2' variances CT71,!2' and zero mean cross correlation CTIlI2 as:

For images (h, h), MSSIM = if L:��l SSIM(hj ,I2j), where IvI is the number of local windows and hj, hj are the pixel values at the jth local window in each image [26]. MSSIM values we report are between 0 and 1, with higher being better.

5. EXPERIMENTAL RESULTS

The system was developed in C++ using a 2.8GHz Intel Core 2 Duo i7 processor with 4 GB of RAM. Results were obtained using the International Telecommunication Union (ITU) standard multiview video sequences [3] Akko&Kayo views 26 and 27, Crowd

Algorithm 1 Pseudocode for S3D construction.

1: S3Dfr +- S3D frame to be encoded 2: Sl, Sr +- 0 IlfeaturesJeft and features_right 3: el,r +- 0 Illeft and right epipoles 4: F +- zeros(3, 3) Ilfundamental matrix 5: while not EOF or end_video_captureO do

6: II +- left_frame; Ir +- righLframe

7: repeat

8: Sl = fo ... fn +- find_corners(Il) IIHarris 9: Sr +- match_features(SI, Ir) IILK-Optical Flow

10: F +- compute_fundamentaLmatrix(Sl, Sr) IIRANSAC 11: el,r +- geLepipolesO 12: until Fel = 0 for I: � 0 13: IU; +- rectify(F,II,Ir) 14: S3D fT +- assemble_frame(I{,I�) 15: inserLSELmessage(S3D ir) 16: streamh264 +-encode(S3Dfr) 17: publish_RTMP(streamh264) 18: end while

views 0 and 1, Racel views 0 and 1, and Objects2 views 1 and 2 (shown in Table 1). Video test sequences have a resolution of 640 x 480 with a combined SBS resolution of 1280 x 480. We used a commercial encoder to create the H.264 bitstream. We input the video test sequences from disk for running the experiments reported in Figures 4( a) and 4(b).

To compute PSNR and MSSIM each sequence was loaded into our system and assembled as a 1280 x 480 SBS video. Prior to H.264 encoding, a reference video (refvid.yuv) in YUV format was stored to disk for use in computing the objective quality metrics. After H.264 encoding, a second video (encvid.H264) was stored to disk for quality comparison. For each video test sequence pair created, the frames were unrectified and not depth adjusted (Table 1, row 1), or rectified and depth adjusted (Table 1, row 2). Using FFMPEG, the refvid and encvid were converted to yuv420 for computing PSNR using our implementation and avi mpeg4 format for computing MSSIM implemented from [14].

The experiments demonstrate that rectification and depth adjustment decrease FPS on average for all videos by 18.5% for SBS video. While framerates for reading video from file are low using the common video test sequences, we note that FPS during previous live-capture experiments obtained while running a non-local RTMP server average approximately 18fps and 14fps, for unrectified and rectified video respectively. PSNR values on average for all videos are approximately 3.2dB, or 9.4%, higher for the rectified and depth adjusted videos. There is only a negligible difference in the average MSSIM values for all sequences. The higher PSNR values are perhaps due in part to smoothing or loss of edge pixels caused by warping the frames. The black edge pixels return a zero error thus skewing the results in favor of the rectified versions. We conclude that our S3D video construction is encoder friendly and does not significantly degrade S3D video quality based on the PSNR and MSSIM values.

6. SUMMARY

We introduced an end-to-end real-time capture S3D video system, and report performance results using four common test video sequences that were unrectified or rectified and depth adjusted prior to streaming. We use frame compatible and composite formats for creating active and anaglyph stereo video with server-side con-

PSNR for Rectified and Unrectified Video Sequences 1-4 so ,-----------------------------------------------------

E 45 £

ITrl Un rectified S3D Video • Rectified S3D Video

� 40 +---------------------------------� "

� li 35

> OJ > iii' 30 :!!. a: Z � 25

20

SEQ1-Akko&Kayo SEQ2-Crowd SEQ3-Race SEQ4-0bjects

ITU Common Test Condition Video Sequence

0.99

0.98

� 0.97

£ ;;: 0.96

.2 0.95 ::;: in

� 0.94

� " 0.93 � � 0.92

0.91

0.9

Mean SSIM for Rectified and Un rectified Video Sequences 1-4

E§ Un rectified S3D Video

SEQ1-Akko&Kayo SEQ2-Crowd SEQ3-Race

ITU Common Test Condition Video Sequence SEQ4-0bjects

(a) Average yuvPSNR results. (b) 3-channel average MSSIM results.

Figure 4: Comparison of PSNR and MSSIM for four common test sequences.

U nrectified

Rectified

Table I: SBS unrectified and rectified test video sequences overlaid with epipolar lines. Rectified right frame shifted horizontally by dx.

troIs. Since we are most interested in the casual users S3D viewing experience, a further analysis of the accuracy of our uncalibrated S3D system may prove useful in producing a more natural stereo effect. We can quantify and perhaps improve the quality of the estimated epipolar geometry and rectification by using the reprojection errors obtained from the Sampson distance [10] or by using the pixel distance between a known feature point and its reprojected point to assess errors introduced in rectification. Future work will also focus on improving the performance of the system by experimenting with MVC, disparity compression, or HEVC; improving the visual quality of the S3D video using disparity warping techniques [27]; and performing subjective stereo video quality assessments [28].

7. REFERENCES

[1] NVlDLA, "NVlDIA 3D Vision streaming support for html5 and silverlight," whitepaper, May 20 II.

[2] Adobe, "Real-time messaging protocol," online technical memo.

[3] ITU-T, "JV T of ISO/lEC MPEG & ITU-T V CEG: connnon test conditions for multiview video coding," Ju1 2006.

[4] Z. Kiraly, G. S. Springer, and J. Van Dam, "Stereoscopic vision system," SPIF:

Optical Lngineering, vol. 45, 2006.

[5] A. Aksay, S. Pehlivan, E. Kurutepe, C. Bilen, T. Ozcelebi, G. B. Akar, M. R. Civanlar, and A. M. Tekalp, "End-to-end stereoscopic video streaming with content-adaptive rate and format control," Image Communication., vol. 22, no. 2, pp. 157-168, Feb 2007.

[6] Y. Zhou, C. Hou, Z. Jin, L. Yang, J., and J. Guo, "Real-time transmission of high-resolution multi-view stereo video over TP networks," in 3DTV COilference, May 2009, pp. I -4.

[7] J. Lou, H. Cai, and J. Li, "A real-time interactive multi-view video system," in Proc. of the 13th annual ACM 1m. Con{ on Multimedia, New York, NY, USA, 2005, MULTIMEDIA, pp. 161-170, ACM.

[8] W. Matusik and H. Pfister, "3D TV: a scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes," ACM TOG, vol. 23, no. 3, pp. 814-824, Aug 2004.

[9] S. Jarusirisawad and H. Saito, "3DTV view generation using uncalibrated cameras," in 3DTV Conference, May 2008, pp. 57 -60.

[10] R. Hmtley and A. Zissennan, Multiple vieJ..v geometry in complller vision, Cambridge University Press, 2004.

[II] M. A. Fischler and R. e. Bolles, "RANSAC: a paradigm for model fitting with applications to image analysis and automated cartography," Commun. ACM,

vol. 24, pp. 381-395, Jun 1981.

[12] P. J. Rousseeuw, "Least median of squares regression," Journal "fthe American Statistical Association, vol. 79, no. 388, 1984.

[13] P. J. Rousseeuw and K. Driessen, " Computing LTS regression for large data sets," Data Min. Know/. Discov., vol. 12, no. I, pp. 29-45, Jan. 2006.

[14] G. Bradski, "The opencv library," Dr. Dobb's Journal ofSofrware Tools, 2000.

[15] M. Pollefeys, R. Koch, and L. J. Van Gool, "A simple and et1icient rectification method for general motion," in ICCV, 1999, pp. 496-501.

[16] e.T.E.R. Hewage, S.T. Worrall, S. Dogan, and A.M. Kondoz, "Prediction of stereoscopic video quality using objective quality models of 2-d video," Jour·

nal Article: Llectronics Leflers, vol. 44, no. 16, Ju1 2008.

[17] lTU-T, "H.264 SERlES H: Advanced video coding for generic audiovisual services." Mar 20 I O.

[18] NVTDIA, "NVTDIA 3D Vision system," nvidia.com.

[19] C. Harris and M. Stephens, "A combined comer and edge detector," in Proc.

"fThe Fourth Alvey Vision COIlt:, 1988, pp. 147-151.

[20] G. Bradski and A. Kaehler, Learning OpenCV- computer vision with the

OpenCV library, O'Reilly, 2008.

[21] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3-D Vision,

SpringerVerlag, 2004.

[22] J. Shi and C. Tomasi, "Good features to track," in Proc. CVPR '94, Jun 1994.

[23] B. D. Lucas and T. Kanade, "An iterative image registration technique with an

application to stereo vision," in Proc. o{the 7th Int. Joint Co�t: on AT. 1981, vol. 2, pp. 674-679, Morgan Kaufmann Pub. Inc.

[24] Y. Pritch, M. Ben-Ezra, and S. Peleg, "Automatic disparity control in stereo panoramas (onmistereo )," in Proc. oj lhe lEEt; Workshop on Omnidireclional

Vision, Washington, DC, USA, 2000, IEEE Computer Society.

[25] e.G. Gourler, Goorkemli B. , G. Saygili, and A.M. Tekalp, "Flexible transport of 3D video over networks," Proc. of the ILLL, vol. 99, no. 4, pp. 694 -707, Apr 2011.

[26] Z. Wang, A.e. Bovik, H.R. Sheikh, and E.P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IFFF Trans on Image

Processing, vol. 13, no. 4, pp. 600 -612, Apr 2004.

[27] M. Lang, A. Hornung, O. Wang, S. Poulakos, A. Smolic, and M. Gross, "Nonlinear disparity mapping for stereoscopic 3D," ACM TOG, vol. 29, pp. 75:1-75:10, Ju12010.

[28] ITU-R BT.500-11, "Methodology for the subjective assessment of the quality of television picture," 2002.

Date post:	15-Dec-2016
Category:	Documents
Upload:	viswanathan
View:	213 times
Download:	1 times

[IEEE 2012 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video...

Documents