Interactive multi-frame reconstruction for mobile devicesmiguelbl/publications/Paper MTAP12.pdf ·...

Multimedia Tools and Applications manuscript No.(will be inserted by the editor)

Interactive multi-frame reconstruction for mobile devices

Miguel Bordallo L opez · Jari Hannuksela · OlliSilven · Marku Vehvil ainen

Received: date / Accepted: date

Abstract The small size of handheld devices, their video capabilities and multiple cam-eras, are under-exploited assets. Properly combined, the features can be used for creatingnovel applications that are ideal for pocket-sized devices, but may not be useful in laptopcomputers, such as interactively capturing and analyzing images on the fly. In this paper weconsider building mosaic images of printed documents and natural scenes from low resolu-tion video frames. High interactivity is provided by givinga real-time feedback on the videoquality, while simultaneously guiding the user’s actions.In our contribution, we analyzeand compare means to reach interactivity and performance with sensor signal processingand GPU assistance. The viability of the concept is demonstrated on a mobile phone. Theachieved usability benefits suggest that combining interactive imaging and energy efficienthigh performance computing could enable new mobile applications and user interactions.

Keywords multi-frame reconstruction· mobile device· mobile interactivity

1 Introduction

Mobile communication devices are becoming attractive platforms for multimedia applica-tions as their display and imaging capabilities are improving together with the computationalresources. Many of the devices have increasingly been equipped with built-in cameras thatallow the users to capture high resolution still images as well as lower resolution videoframes.

The capabilities of mobile phones in portable imaging applications are on par or ex-ceed the ones of laptop computers despite the order of magnitude disparity between thecomputing power budgets. Table 1 points out the versatilityof the hardware in handheldsin comparison to laptops. The size and semi-dedicated interfaces of handheld devices aresignificant benefits over the general purpose personal computer technology based platforms,

M. Bordallo · J. Hannuksela· O. SilvenCenter for Machine Vision Research, University of Oulu, 90570 Oulu, Finland,Tel.: +358-458-679699E-mail: [email protected]

M. VehvilainenNokia Research Center, Tampere, Finland

2 Miguel Bordallo Lopez et al.

despite their apparent versatility. On the other hand, eventhe most recent mobile communi-cation devices have not used their multimedia and computingresources in a novel manner,but are merely replicating the functionalities already provided by other portable devices,such as digital still and video cameras. Also the popularityof laptop PCs and portable videoplayers, as a means to access multimedia content via WiFi or 3G networks, have clearlyinfluenced the handheld application designs.

Consequently, most handhelds rely on keypad and pointer user interfaces, while their ap-plications use content provided via the Internet to supplement locally stored music, moviesand maps. The users can also create images and video content and stream it to the networkfor redistribution.

As more and more applications are being crammed into handheld devices, their limitedkeypads and small displays are becoming too overloaded, potentially confusing the user whoneeds to learn how to use each individual application. Basedon the personal experiences ofmost people, increasing the number of buttons, as with remote control units, is not the bestsolution from the usability point of view. As a result, dedicated applications have becomepopular, replacing even net browsers in accessing specific services.

The full keyboard, touchpad or mouse, and higher resolutiondisplays of laptop PCs ap-pear to give them clear benefits as platforms for multiple simultaneous applications. How-ever, the small size of handheld devices and their multiple cameras are under-exploitedassets. Properly combined, these characteristics can be used for novel user interfaces andapplications that are ideal for handhelds, but may be considered less suitable for laptopcomputers.

Fig. 1 represents two mobile devices with cameras and sensors. Touch-sensitive screen

Fig. 1 Current mobile devices include several cameras, sensors, keys and displays included in a very smallsize.

interaction is often viewed as a solution for interaction with mobile devices. However, it usu-ally requires both hands and can cause additional attentionoverhead [7]. On the other hand,camera-based interfaces can provide single-handed operations in which the users’ actionsare recognized without having to interact with the screen orkeypad. In our contribution,

Interactive multi-frame reconstruction for mobile devices 3

Table 1 Characteristics of typical laptop computers and handheld mobile devices.

Laptop computer Handheld device Typical ratioStill image resolutions up to 2 Mpixel up to 12Mpixel 0.20xNumber of displays 1 1-2 0.5xNumber of cameras 0–1 1–3 0.5xVideo resolution (display) 1920x1080/30Hz 1920x1080/30Hz 1xDisplay size (inches) 12–15 2–4 5x (area 20x)Processor clock (GHz) 1–3.5 0.3–1.2 3-10xDisplay resolution (pixels) 1024x768–2408x1536 176x208–960x640 12xProcessor DRAM (MB) 1024–8192 64–1025 16x

we show how image sequences captured by the cameras of mobilephones can be used fornew self intuitive applications and user interface concepts. We also analyze the observedplatform-dependent limitations and features in future interfaces that could help in imple-menting vision-based solutions. The key ideas rest on the utilization of the handheld natureof the equipment and the analysis of video frames captured bythe device’s camera. In thiscontext, we highlight the application development challenges and trade-offs that need tobe dealt with battery powered devices, presenting how graphical processing units of thosedevices can be utilized to accelerate and improve energy efficiency of computer vision al-gorithms. A multi-frame reconstructor is described as an example application which canbenefit from the enriched user experience. We analyze building mosaic images of printeddocuments and natural scenes using the mobile phone camera in a highly interactive man-ner. We describe an intuitive user interaction framework which utilizes quality assessmentand feedback in addition to motion estimation.

The paper is organized as follows. In Section 2 we introduce related work on vision-based user interfaces on mobile devices and mobile implementations of multi-frame appli-cations. Section 3 discusses the challenges of camera-based interactivity and its implica-tions in the design of algorithms and applications. As a casestudy, an interactive real-timemulti-frame reconstructor is described in Section 4. Section 5 highlights the application de-velopment challenges and trade-offs that need to be dealt with battery powered devices anddiscusses the desirable future platform developments for interactive applications. The useof a GPU is considered to reduce the computational load of camera-based applications. Aperformance evaluation of the system on a mobile device is described in Section 6. Finally,Section 7 summarizes the paper and discusses the possible future directions.

2 Related work

Cellular phone cameras have traditionally been designed toreplicate the functionalities ofdigital compact cameras. They have been included as almost stand-alone subsystems ratherthan as an integrated part of the device interfaces. However, some work has been carriedout to incorporate the camera systems as a crucial part of vision-based mobile interactiveapplications.

In 2003, Siemens introduced an augmented reality game called Mozzies developed fortheir SX1 cell phone. This was probably the first mobile phoneapplication utilizing thecamera as a sensor. The goal of the game was to shoot down the synthetic flying mosquitoesprojected onto a real-time background image by moving the phone around and clicking atthe right moment. When the user is executing some action, themotion of the phone was


recorded using a simple optical flow technique. Figure 2 depicts a Nokia N95 phone witha Mozzies type application. Since Mozzies, we have seen the multimedia capabilities of

Fig. 2 A camera-based mosquito killing game, similar to Mozzies, the one included in the Siemens SX1device.

mobile phones advancing significantly. Mobile phones with high-resolution digital camerasare now inexpensive, widely available, and very popular. The rapid evolution of image sen-sors and computing hardware on mobile phones has made it attractive to apply computervision techniques to create new user interaction methods and a number of solutions havebeen proposed [7].

2.1 Vision-based mobile interactivity

Much of the previous work on vision-based user interfaces with mobile phones has utilizedmeasured motion information directly for controlling purposes. Figure 3 depicts three pre-viously implemented camera-based interaction methods. For instance, Mohring et al. [21]presented a tracking system for augmented reality on a mobile phone to estimate 3-D cam-era pose using special color-coded markers. Other marker-based methods used a hand-heldtarget [13] or a set of squares [37] to facilitate the tracking task. A solution presented byPears et al. [23] uses a camera on the mobile device to track markers on the computer dis-play. This technique can compute which part of the display isviewed and the determinationof the 6-DOF position of the camera with respect to the display. An alternative to markersis to estimate motion between successive image frames with similar methods to those com-monly used in video coding. For example, Rohs [28] divided incoming frames into a fixednumber of blocks and then determined the relative x, y, and rotational motion using a simpleblock-matching technique. Another possibility is to extract distinctive image features, suchas edges and corners which exist naturally in the scene. Haroet al. [17] have proposed afeature-based method to estimate movement direction and magnitude. Instead of using lo-cal features, some approaches extract global features suchas integral projections from theimage [1].


Fig. 3 Mobile implementation of three examples of camera-based interaction: a) Color coded markers. b)Handheld markers. c) Ego motion image browsing.

A recent and generally interesting direction for mobile interaction is to combine infor-mation from several different sensors. In their feasibility study, Hwang et al [18] combinedforward and backward movement and rotation around the Y axisdata from camera-basedmotion tracking, and tilts about the X and Z axis from the 3-axis accelerometer. In addition,a technique to couple wide area, absolute, and low resolution global data from a GPS re-ceiver with local tracking using feature-based motion estimation was presented by DiVerdiand Hollerer [9].

2.2 Multi-frame reconstruction

The approaches described above have utilized camera motionestimation to improve userinteraction in mobile devices. Mobile phones equipped witha camera can also be used forinteractive computational photography or multi-frame reconstruction.

Adams et al. [1] presented an on line system for building 2D panoramas. They usedviewfinder images for triggering the camera whenever it is pointed at previously uncapturedpart of the scene. Ha et al. [12] also introduced the auto shotinterface to guide mosaiccreation using device motion estimation. Other panorama creation applications include thework of Xiong and Pulli [38] and Wagner et al. [35]. Kim and Su [20] used a recursivemethod for constructing super resolution images that is applicable to mobile devices. An-other super resolution technique-based on soft learning priors can be seen in the work ofTian et al. [33].

On mobile devices, Bilcu et al. [3] proposed a technique for creating high resolutionhigh dynamic range images, while Gelfand et al. [11] proposed the fusion of multi-exposureimages to increase the quality of the resulting images. A good survey of work on mobilemulti-frame techniques can be found in the work of Pulli et al. [25]

2.3 GPU-based computer vision

Using mobile GPUs for multimedia applications and computervision is an attractive option.The work of Kalva et al. [19] presents a good tutorial on the advantages and shortcomingsof GPU platforms when developing multimedia applications while Fung et al. [10] explainhow to use the GPU to perform computer vision tasks. Pulli et al. [24] analyzes the use ofGPU-based computer vision for real-time applications, by studying the performance underan OpenCV environment.

The use of GPUs as general purpose capable processors has notbeen extensively con-sidered yet on mobile phones. However, some work can be foundin the literature. The work


of Seo et al. [29] describes a 3D tracking application that uses the mobile GPU to acceleratecertain parts of the algorithm, such as Canny-edge detection. Singhal et al. [31] analyzes theperformance of several computer vision algorithms on a handheld GPU. The recent workfrom Wang et al. [36], uses a mobile GPU-CPU platform to construct an energy-efficientface recognition system. Our previous work, evaluates the use of a handheld GPU to assistimage recognition applications [5] and in document stitching [4].

3 Camera-based Interactivity

The usability of camera-based applications critically rests on latency. This becomes apparentwith computer games in which action-to-display delays exceeding about 100-150 ms areconsidered disturbing [8]. This applies even to key-press to sound or display.

If we employ a camera as an integral real-time application component, its integrationtime will add to the latency, as well as the image analysis computing. If we sample the sceneat 30 frames/second rate, our base latency is 33 ms Assuming that the integration time is33 ms, the information in the pixels read from the camera is onan average 17 ms old fora typical rolling shutter scheme. As the computing and display/audio latencies need to beadded, achieving the 100-150 ms range is challenging.

Vision-based interactivity requires always-on cameras that may compromise the batterylife. Consequently, it is advisable to turn the cameras on only when needed. This interactivityissue can be improved by predicting the user’s intentions. In the most typical case, thisinvolves recognizing the raising of the device in front of the face to horizontal pose to capturean image.

3.1 Automatic application launching

The key ideas for the automatic launching of camera applications rest on the utilization ofthe hand-held nature of the equipment and the user being in the field of view of a camera[16]. We use the camera to detect whether the user is watchingthe device that is often a goodindication of interaction needs. Figure 4 illustrates the user handling the device to launch acamera application. Clearly the recognition of this context benefits from the coupled use ofmotion and face sensing using the frontal camera, provided that is on all the time. The keylock is released, the back light is turned on, and the back camera is activated automatically.From the user’s point of view it would be most convenient if the device automatically rec-ognizes the type of target that the user is expecting to capture without demanding manualactivation of any application. Several targets could be differentiated by for example showinga dialog box on the capture screen with the suggested options.

3.2 Interactive Multi-frame reconstruction

To demonstrate camera-based interactivity, we have built examples of multiframe recon-struction applications. Multiframe reconstruction techniques can be included in several ap-plications such as scene panorama imaging, handheld document scanning, context recogni-tion, high dynamic range composition, super resolution imaging or digital zooming. Multi-frame reconstruction is a process that merges the information obtained from several inputframes into a single result. The result can be an image that presents increased field of view


Fig. 4 Automatic launching of a camera application. When the device is raised in front of the user and a faceis in the field of view, the main camera starts the capture.

or enhanced quality, but also a feature cloud with combined information obtained from theinputs, that can be used for example in object recognition.

While not a replacement for wide angle lenses or flatbed scanners, a multiframe imagereconstructing application running on a cellular phone platform is essentially an interactivecamera-based scanner that can be used in less constrained situations. The usage concept ofthe handheld solution offers a good alternative as the userscannot realistically be expectedto capture and analyze single shot high-quality images of certain types of targets, such asbroad scenes, three-dimensional objects, big documents, white board drawings or posters.Instead, our approach relies on real-time user interactionand capturing of HD-720p resolu-tion images that are registered and stitched together.

In addition to interactivity benefits, the use of low resolution video imaging can be de-fended from purely technical aspects. In low resolution mode the sensitivity of the cameracan be better as the effective size of the pixels is larger, reducing the illumination require-ments and improving the tolerance against motion blur. On the other hand, a single-shot highresolution image, if captured properly, could be analyzed and used directly after acquisition,while the low resolution video capture approach requires significant post-processing effort.Figure 5 shows the four steps of a multiframe reconstructionapplication. In the proposed in-teractive solution, the capture interface sends images to the frame evaluation subsystem andgives feedback to the user. The best frames are selected. Theimages are corrected, unwarpedand interpolated. The final stage constructs the resulting image. The next section describesthe implementation details of the interactive multi-framereconstructor applications.


Fig. 5 The four steps of a multiframe reconstruction application.Image registration aligns the features ofeach frame. An image selection subsystem, based on quality assessment, identifies the most suitable inputimages. A correction stage unwarps and enhances the selected frames. A blending algorithm composes theresult image by reconstructing the final pixels.

4 Implementation

4.1 Interactive capture and pre-registration

To reconstruct a single image from multiple frames, the userfaces the capture of severalgood quality partial images. The input frames are often far from perfect, when not unsuitablefor the reconstruction. The most relevant problem when capturing several frames that aregoing to be merged is the camera orientation and perpendicularity to the target among the setof captured frames. The user might involuntarily tilt or shake the camera, causing the framesto have lack of focus or motion blurriness, which will resultin a low quality reconstructedimage. Because a handheld camera is used, it is difficult for the user to maintain a constantviewing angle and distance, so the user interaction scheme just targets at capturing the targetsusing a free scanning path.

The key usability challenge of a handheld camera-based multiframe reconstructor isenabling and exploiting interactivity. For this purpose, our solution is to let the device in-teractively guide the user to move the device during the capture [15]. The user starts thescanning by taking an initial image of some part of the target, for example a newspaper pageor a white board drawing. Then, the application instructs the user to move the device to the


next location. The scanning direction is not restricted in any manner, and a zig-zag stylepath can be used. Rotating the camera may be necessary to avoid and eliminate shadows orreflections from the target, and it is a practically useful degree of freedom.

The allowed free scanning path is a very useful feature from the document imagingpoint of view, however, it sets significant computational and memory demands for the im-plementation and prevents building final mosaics in real time. We have also developed ascene panorama application that limits the scanning path into a unidirectional one. With thisapproach, a mobile phone can be used to stitch images on the flywith the resulting imagesgrowing in real time with the frame acquisition [6]. The memory requirements are smalleras not all selected frames need to be stored until the end of the panorama blending process.

Figure 6 shows the typical problems present during the capture stage and the proposedsolutions-based on interactivity and quality assessment.Each image is individually pro-

Fig. 6 The problems appearing during image acquisition and the proposed solutions. involuntary tilting andshadows can be solved with the help of the user if the proper guiding is offered. Quality assessment can selectthe best frames and avoid the possible moving objects present on the screen.


cessed to estimate motion. The estimation of the motion is based on modified Harris cornersand a best linear unbiased estimator. A detailed description of the subsystem can be found inthe paper from Hannuksela et al. [14]. The blurriness of eachpicture is measured and even-tual moving objects are detected. In multi-frame reconstructions the regions with movingobjects or typically shadows, are simply discarded. Based on shutter time and illumination-dependent motion blur, the user can be informed to slow down when the suitable overlapbetween images has been achieved [15], and a new image for stitching is selected fromamong the image frames based on quality assessment. The usercan also be asked to back-up or he can return to lower quality regions later in the scanning process. As a result, goodpartial images of the target can be captured for the final stitching stage.

The practical result of the interactive capture stage is a set of high quality images thatare pre-registered and aligned. The coarse frame registration information based on motionestimates computed during interactive scanning is employed as the starting point in con-structing the mosaic image. The strategy in scanning is to keep sufficient overlaps betweenstored images to provision for frame re-registration usinga highly accurate feature basedmethod during the final processing step.

4.2 Re-registration and blending

After on-line image capturing, the registration errors between the regions to be stitchedcan be on the order of pixels that would be seen as unacceptable artifacts. In principle,it would be possible to perform accurate registration during image capture, but buildingfinal document images will in any case require post-processing to adjust the alignments andscales.

The fine-registration employed for automatic mosaicking ofdocument images is basedon a RANSAC estimator with a SIFT feature point detector. In addition, graph based globalalignment and bundle adjustment steps are performed in order to minimize the registrationerrors and to further improve quality. Finally, warped images are blended to the mosaic usingsimple Gaussian weighting. A more detailed description of the implementation can be foundin the work of Hannuksela et al. [15].

Memory needs are an a usual implementation bottleneck of fine-registration with currentmobile devices, limiting the size of the final mosaics and thenumber of input frames. Itshould be noticed that with the lower resolution frames the registration and blending errorsare easy to see and reveal the eventual shortcomings of the methodology.

4.3 Quality determination and frame selection

Taking pictures of documents with a handheld camera is oftenhampered by the self-shadowof the device, appearing as a moving region in the sequence offrames. In practice, the re-gions with moving objects, whether they are shadows or something else, are not desirablewhen stitching the final image. Instead of developing advanced methods for coping withthese phenomena, we mostly count on the user interaction to avoid the problems from harm-ing the reconstruction result.

The treatment of shadows, reflections or moving objects depends on the type of the scenethat is processed. For every selected frame with natural scenes, if a moving object is presentand fits the sub-image, the image is blended, drawing a seam that is outside the boundaries


of the object. If only a partial object is present, the part ofthe frame without the object is theone that is blended.

The individual frames are selected based on moving objects detection and blur measures[6]. A blur detection algorithm estimates the images sharpness by summing together thederivates of each row and each column. Motion detection is done in a very simple fashion tomake the process fast. First, the difference between the current frame and the previous frameis computed. The result is a two-dimensional matrix that covers the overlapping area of thetwo frames. Then, this matrix is low-pass filtered to remove noise and is thresholded againsta fixed value to produce a binary motion map. If the binary image contains a sufficientamount of pixels that are classified as motion, the dimensions of the assumed moving objectare determined statistically. These operations are computed in real time to enable feedbackto the user.

However, as the differences in the image content may distortthe results, the accuracyof motion estimates used for preliminary registration needs to be a reasonable one. In prac-tice, this is a computational cost, interactivity and quality trade-off. An increased accuracyimplies that less overlap is needed between frames and a decrease of the computing require-ments.

5 Interactivity and energy efficiency

5.1 Architectural maximization of stand-by and active-state battery life

The stand-by and active-state battery lives of a mobile device are interconnected. High stand-by power consumption means that active use regularly startswith a partially charged battery.As this is a recognized usability issue, the designers optimize for low stand-by currents,primarily by turning off sub-systems such as motion sensorsand cameras whenever possible.However, this exposes another usability issue as the responsiveness for interaction of thedevice can be compromised. For instance, the device may be unable to detect its handling instand-by state.

The roots of the problem are in the involvement of the application processor of theplatform, and we see this as an argument for dedicated cameraand sensor processors. Theavailability of a fast application processor enables the straightforward implementation ofnovel camera-based applications and even vision-based user interfaces. On the other hand,the versatility and easy programmability of the single processor solution have lead to designdecisions that compromise the battery life if high interactivity is needed. In active-state afast application processor may consume more than 900mW withmemories, while the wholedevice can go up to 3W, a limit above which it becomes too hot tohandle. This can push thebattery life below one hour.

The background is in the typical top level hardware organization of a current mobilecommunications device with multimedia capability, as the example that we propose in Fig-ure 7(a). Most of the application processing functionality, including camera and displayinterfaces, has been integrated to a single system chip. Thebaseband and mixed signal pro-cessing, such as the power supply and analog sensor control,have their own subsystems.For instance, with the design of Figure 7(a) the accelerometer measurements can not be kepton all the time due to the power hungry processor. In comparison, the sport watches that in-clude accelerometers, operate at sub-mW power levels, thanks to their very small footprintprocessors.


In practice, the bulk of the sensor processing needs to be moved to dedicated low-powersubsystems that can be on all the time. This reduces the number of tasks the applicationprocessor needs to execute, improving its reactiveness, e.g., for highly interactive vision-based user interfaces. In Figure 7(b) we propose a possible future design with low powersensor processors. The inclusion of small-footprint processors that operate very close tothe sensor units, allow the energy efficiency of the subsystems, that can remain always on,enabling new interaction methods.

(a) Current device

(b) Future device

Fig. 7 Possible organization of a current and a future multimedia device. In current devices, the processingof the motion sensors, frontal and back cameras are mainly done in the application level using a power-hungry main processor. We propose that future multimedia devices include several dedicated small-footprintprocessors that minimize the transfers, improving the energy efficiency an allowing to be always activated.


The practical challenge of an interactive camera-based application scenario is the as-sumption on having an active front camera. If it is operated at lower frame rate, the latencysavings may not materialize, while a higher frame rate reduces power efficiency. If the imageprocessing is coupled with the employment of other sensors,we can formulate an approachthat is both reliable and energy efficient.

Much of the application start latencies and delays can be hidden by predicting the usersintention to capture an image [16]. In the most typical case,this involves the use of themotion sensors to recognize the handling and raising of the device to the horizontal pose.When this happens, the camera can be switched on or its frame rate can be increased toimprove interactivity, hiding the latencies perceived by the user.

The needs of interactivity are pushing the manufactures to add more sensors to theirdevices, as well as to adopt architectural solutions that provide for long battery life. Thedesigners need to take into account the power needs of the required signal processing andthe sensors themselves.

For instance, a triaxial accelerometer dissipates in sub-mW range, a QVGA camera re-quires about 1mW/frame/second, while capacitive touch screens demand around 3mW. Theanalysis of 150/second sample signals from triaxial accelerometers and magnetometers re-quire less than 30 000 instructions per second. Face tracking from QVGA (320-by-240)video requires around 10-15 MIPS per frame using Local Binary Pattern technology [2].If implemented on ARM7, the energy per instruction (EPI) is around 100pJ, while withoptimized sensor processing architectures the EPI can be pushed well below 5pJ [22]. Con-sequently, if implemented at a low frame rate of 1 frame/second, the corresponding powerneeds range from micro-Watts up to 1 mW.

Employing the application processor with its interfaces for the same purposes woulddemand tens of mW, significantly reducing the battery life ofactive-state uses. Figure 8shows our measurements of the battery discharge times of a Nokia N9 phone under constantload. Since the battery life is a nonlinear function of the load current, small improvementsin the energy efficiency of the applications can achieve highimprovements in the operationtimes. Similar observations of the battery life and its kneeregion have been made by Silvenand Rintaluoma [30] and can be found in the earlier work of Rakhmatov and Vrudhula [26].

Table 2 compares the stand-by and active-state power needs of the designs in Figures7(a) and 7(b) with ”conventional” and ”advanced” user interfaces. With the latter one, acamera and accelerometers are used for detecting user interaction needs. Current mobile

Table 2 Comparison of stand-by and active-state power needs of userinterfaces.

Stand-by Active-statepower consumption power consumption

[mW] [mW]Advanced UI on current platformVGA camera (15fps.) + sensors 55 55Image + sensor processing (App. proc.) 70 700Others (memory, screen, subsystems) 5 500Total 130 1255Advanced UI on future platformQVGA camera (1 fps.) + sensors 3 20Image + sensor processing (Dedicated proc.) 2 200Others (memory, screen, subsystems) 5 500Total 10 720Conventional UI (Phone total) 5 650


Fig. 8 Discharge time of a 1450mAh Li ON battery on a N9 phone. The shape of the discharge curveimplies that small improvements in the applications’ energy efficiency can achieve high improvements in theoperation times.

platforms, do not include dedicated energy efficient sensorprocessors that can be employedto develop advanced user interfaces. However, an increasing number of devices already in-clude a Graphics Processing Unit that can be accessed with standard APIs such as OpenGLES or OpenCL. A good solution that can be integrated into the current platforms is the em-ployment of a GPU for general purpose computing, which improves the active-state batterylife due to its lower energy consumption.

Using mobile GPUs for camera-based applications is an attractive option. We considerspeeding up interactive applications by using the graphicsprocessing resources. Also, theirsmaller EPI makes them a suitable candidate for reducing thepower consumption of com-putationally intensive tasks.

5.2 GPU implementation of multiframe techniques

Several algorithms that can be used in a multiframe reconstruction application have beenimplemented using a PowerVR530 mobile GPU. The Nokia N9 graphics processor is ac-cessible via the OpenGL ES application programming interface (API). However, the use ofGPU as general purpose capable processors has not been extensively considered yet on mo-bile phones. This fact causes that the developers of image processing algorithms that use thecamera as the main source for data lack on fast ways of data transferring between processingunits and capturing or saving devices.

In current platforms, this must be done by copying the imagesobtained by the cam-era from the CPU memory to the GPU memory in a matching format [5]. Furthermore,


the overheads of copying images as textures to graphics memory result in significant slow-downs. The lack of shared video memory causes multiple accesses to the GPU memory toretrieve the data for the processing engine.

Although the OpenGL ES programmable pipeline enables the implementation of manygeneral processing functions, the APIs still have several limitations. The most importantone is that the GPU is forced to work in single buffer mode to allow the read-back of therendered textures. Other shortcomings include the need to use power of two textures or therestricted types of pixel data.

While traditional approaches to mosaic building algorithms use to follow a sequentialpath with multiple accesses the memory from the processing unit, in our example applica-tion, each step of the mosaicking algorithm has to be evaluated separately in order to findthe best ways of organizing the data and to reduce the overheads.

For the time being, the most obvious operations to be accelerated using OpenGL ESare pixel-wise operations and geometrical transformations such as warps and interpolations[4]. Part of the computations required in a feature machine based registration process can bemoved to the GPU through the use of programmable thread shaders. Previous works showsthat desktop-GPU SIFT feature extraction used along with a RANSAC estimator in parallelhas shown a 50% CPU load reduction [32] and that feature extraction times on VGA framescan be reduced about ten times [27]. A mobile-GPU Harris Corners detector can be used toaccelerate the registration process [31].

As a study case for feature extraction, LBP extraction has been implemented on anOMAP3630 platform, using the OpenGL ES 2.0 shading language[5]. Our experimentsshow that the equivalent CPU algorithm outperforms the GPU algorithm on all three imagesizes, although the GPU times are comparable on bigger imagesizes, where the paralleliza-tion is easier. However, processing does not interfere withthe GPU. The CPU can be utilizedto compute part of the frames concurrently or to perform someother tasks while the GPUperforms extraction.

Area-based image registration methods are also suitable tobe highly parallelized. Forexample, the method by Vandewalle et al. [34] uses Tukey window filtering and FFT-basedphase correlation computations to register two images. Experiments ran on an OMAP 3630platform show that window filtering and complex division routines increase their executionspeed up to three times when performed on the built-in GPU.

The stitching process requires the correction of each selected frame with a warping func-tion, that must interpolate the pixel data to the coordinates of the new frame. This costlyprocess can be done in a straightforward manner on several steps using any fixed or pro-grammable graphics pipeline.

The programmable pipeline of OpenGL ES 2.0 enables shader programming in imple-menting blur detection in a similar way as the feature extraction method. The first stage ofthe blur detection is a simple derivation algorithm, which can be implemented efficientlywith OpenGL ES 2.0 shader. Our tests show that on an OMAP 3630 platform the deriva-tion algorithm with HD-720p images can be computed about three times faster on the GPU,while reducing the CPU load an 80%.

The pixel blending operation can be done in a straightforward manner with the hard-ware implemented blending function. When the blending function is enabled, overlappingtextures will be blended together. The transparency can be determined by choosing a blend-ing factor for every channel of both images and then a blending function. The channel valuesare multiplied with their respective factors and then the blending function is applied to eachchannel pair. Since OpenGL ES 2.0 has a programmable pipeline, blending can also be donewith a shader algorithm. In this way, all the needed calculations can be combined in only one


rendering stage. Table 3 shows the computation times and energy consumptions of severalalgorithms utilized in multiframe reconstruction.

Table 3 Computational and energy costs per HD-720p frame of severaloperations implemented on a mobileplatform (OMAP 3630).

CPU time CPU energy GPU time GPU energyconsumption consumption consumption consumption

[ms] [mJ] [ms] [mJ]Grayscale conv. 18 3.6 8 1.0Scaling 24 5.3 12 1.5LBP extraction 48 9.1 90 13.1Harris Corners detector 60 13.5 170 22.5Blur detection 80 28.2 60 8.0Tukey windowing 35 5.1 15 2.1Image warping 140 27.3 40 5.4Image blending 270 50.7 120 15.8

6 Application performance

The computing requirements of a multiframe reconstructor are quite significant for a batterypowered mobile device, although the application can be broken down into the interactive,real-time frame capture part, and the non-interactive finalmosaic stitching post-processor.

The application has been developed on a Nokia N9 device. Thisdevice is based on anOMAP 3630 System on Chip composed by a 1 GHz Cortex A8 ARM and a Power VR SGX530 GPU, supporting openGL ES 2.0. Nokia N9 has a 3.9 inches capacitive touchscreenwith a maximum resolution of 854x480 pixels and a 8 Mpixel camera with a maximumvideo resolution of 2180x720 pixels. The device includes a 1450mAh battery. Table 4 showsthe computational and energy costs of the most expensive parts of the HD-720p frame-baseddocument scanning application when implemented entirely on the application processor.

Table 4 Algorithm’s computational and energy costs per frame on a N9(ARM Cortex-A8).

Computation time [ms] Energy consumption [mJ]Online loopCamera motion estimation 100 200Quality assessment 50 10Offline computationsImage registration 5000 800Image correction 200 40Image blending 100-300 40

The application has been implemented using only fixed point arithmetic to achieve goodperformance on most devices. The implementations of the interactive capture stage allowthe process of about 7.5 frames/second on Nokia N9 (ARM cortex-A8 processor) in HD-720p resolution mode. The off-line stage operates at about 0.2 frames/second, and dependsboth on the available memory resources and processor speed.


High resolution video frames require long processing timesthat although still suitablefor interactivity purposes, might not result in the best user experience. Reducing the reso-lution of the input frames proportionally decreases the needed processing times, allowing abetter user experience. However, the fast evolution of mobile processor and the recent inclu-sion of multi-core chip sets suggest that the future mobile platforms will be able to handle amultiframe reconstruction application in real time using,input images of even higher resolu-tion. Table 5 shows a comparison of the processing times at different resolutions on a NokiaN9. The experiments show that processing time increases almost linearly with the numberof pixels and operations.

Table 5 Application processing times at different resolutions on aN9 (ARM Cortex-A8).

320x240 640x480 1280x720 1920x1080Online loop 15 52 150 330Offline computations 500 1900 5400 11900

7 Summary

Interactive camera-based user interface reliant applications are among the most demandinguses of mobile devices. All the available processing resources are needed to ensure smoothoperation, but that may compromise battery life and usability. Nevertheless, based on ourexperiences, camera-based interactivity is an extremely attractive scheme.

Camera sub-systems on mobile device platforms are a rather recent add-on, and designedjust for capturing still and video frames. At the same time the energy efficiency features ofthe platform architectures, computing resources, and displays have been optimized for videoplayback. From the point of view of the interactive camera-based applications, compatibledata formats for the camera and graphics systems would be a major improvement.

We have presented a system for building document mosaic images from selected videoframes on mobile phones. High interactivity is achieved by providing a real-time feedbackof motion and quality, while simultaneously guiding the user. The captured images are auto-matically stitched together with good quality and high resolution. The graphics processingunit of the mobile device is used to speed up the computations. Based on our experiments,the use of the GPU improves performance, but only moderatelyand increases battery lifefacilitating interactivity. We believe that the cameras infuture mobile devices may, for themost of time, be used for sensory purposes rather than capturing images for human viewing.

References

1. A. Adams, N. Gelfand, and K. Pulli. Viewfinder alignment. In Eurographics 2008, pages 597–606,2008.

2. T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application toface recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037–2041,2006.

3. R. Bilcu, A. Burian, A. Knuutila, and M. Vehvilainen. Highdynamic range imaging on mobile devices.In Electronics, Circuits and Systems, 2008. ICECS 2008. 15th IEEE International Conference on, pages1312 –1315, 31 2008-sept. 3 2008.


4. M. Bordallo, J. Hannuksela, O. Silven, and M. Vehvilainen. Graphics hardware accelerated panoramabuilder for mobile phones. InProceeding of SPIE Electronic Imaging 2009, 7256, 2009.

5. M. Bordallo, H. Nykanen, J. Hannuksela, O. Silven, and M. Vehvilainen. Accelerating image recognitionon mobile devices using gpgpu. InProceeding of SPIE Electronic Imaging 2011, 7872, 2011.

6. J. Boutellier, M. Bordallo, O. Silven, M. Tico, and M. Vehvilainen. Creating panoramas on mobilephones. InProceeding of SPIE Electronic Imaging 2007, 6498, 2007.

7. T. Capin, K. Pulli, and T. Akenine-Moller. The state of the art in mobile graphics research.IEEEComputer Graphics and Applications, 1:74–84, 2008.

8. J. Dabrowski and E. Munson. Is 100 milliseconds too fast? In Conference on Human Factors in Com-puting Systems, pages 317–318, 2001.

9. S. DiVerdi and T. Hollerer. Groundcam: A tracking modality for mobile mixed reality. InIEEE VirtualReality, pages 75–82, 2007.

10. J. Fung and S. Mann. Using graphics devices in reverse: Gpu-based image processing and computervision. InMultimedia and Expo, 2008 IEEE International Conference on, pages 9 –12, 23 2008-april 262008.

11. N. Gelfand, A. Adams, S. H. Park, and K. Pulli. Multi-exposure imaging on mobile devices. InPro-ceedings of the international conference on Multimedia, MM ’10, pages 823–826, New York, NY, USA,2010. ACM.

12. S. J. Ha, S. H. Lee, N. I. Cho, S. K. Kim, and B. Son. Embeddedpanoramic mosaic system usingauto-shot interface.IEEE Transactions on Consumer Electronics, 54(1):16–24, 2008.

13. M. Hachet, J. Pouderoux, and P. Guitton. A camera-based interface for interaction with mobile handheldcomputers. InI3D’05 - ACM SIGGRAPH 2005 Symposium on Interactive 3D Graphics and Games,pages 65–71. ACM Press, 2005.

14. J. Hannuksela, P. Sangi, and J. Heikkila. Vision-basedmotion estimation for interaction with mobiledevices. Computer Vision and Image Understanding: Special Issue on Vision for Human-ComputerInteraction, 108(1–2):188–195, 2007.

15. J. Hannuksela, P. Sangi, J. Heikkila, X. Liu, and D. Doermann. Document image mosaicing with mobilephones. In14th International Conference on Image Analysis and Processing, pages 575–580, 2007.

16. J. Hannuksela, O. Silven, S. Ronkainen, S. Alenius, and M. Vehvilainen. Camera assisted multimodaluser interaction. InProceeding of SPIE Electronic Imaging 2010, 754203, 2010.

17. A. Haro, K. Mori, T. Capin, and S. Wilkinson. Mobile camera-based user interaction. InIEEE In-ternational Conference on Computer Vision, Workshop on Human-Computer Interaction, pages 79–89,Beijing, China, 2005.

18. J. Hwang, J. Jung, and G. J. Kim. Hand-held virtual reality: A feasibility study. InACM Virtual RealitySoftware and Technology, pages 356–363, 2006.

19. H. Kalva, A. Colic, A. Garcia, and B. Furht. Parallel programming for multimedia applications.Multi-media Tools Appl., 51(2):801–818, Jan. 2011.

20. S. Kim and W.-Y. Su. Recursive high-resolution reconstruction of blurred multiframe images.ImageProcessing, IEEE Transactions on, 2(4):534 –539, oct 1993.

21. M. Mohring, C. Lessig, and O. Bimber. Optical tracking and video see-through ar on consumer cellphones. InWorkshop on Virtual and Augmented Reality of the GI-Fachgruppe AR/VR, pages 193–204,2004.

22. L. Nazhandali, B. Zhai, J. Olson, A. Reeves, M. Minuth, R.Helfand, S. Pant, T. Austin, and D. Blaauw.Energy optimization of subthreshold-voltage sensor network processors. InProceedings of the 32ndannual international symposium on Computer Architecture, ISCA ’05, pages 197–207, Washington, DC,USA, 2005. IEEE Computer Society.

23. N. Pears, P. Olivier, and D. Jackson. Display registration for device interaction. In3rd InternationalConference on Computer Vision Theory and Applications, pages 446–451, 2008.

24. K. Pulli, A. Baksheev, K. Kornyakov, and V. Eruhimov. Real-time computer vision with opencv.Com-mun. ACM, 55(6):61–69, June 2012.

25. K. Pulli, W.-C. Chen, N. Gelfand, R. Grzeszczuk, M. Tico,R. Vedantham, X. Wang, and Y. Xiong.Mobile visual computing. InUbiquitous Virtual Reality, 2009. ISUVR ’09. International Symposium on,pages 3 –6, july 2009.

26. D. Rakhmatov and S. Vrudhula. Energy management for battery-powered embedded systems.ACMTrans. Embed. Comput. Syst., 2(3):277–324, Aug. 2003.

27. J. M. Ready and C. N. Taylor. Gpu acceleration of real-time feature based algorithms. InProceedingsof the IEEE Workshop on Motion and Video Computing, page 8, Washington, DC, USA, 2007. IEEEComputer Society.

28. M. Rohs. Real-world interaction with camera-phones. In2nd International Symposium on UbiquitousComputing Systems, pages 39–48, 2004.

29. B.-K. Seo, J. Park, and J.-I. Park. 3-d visual tracking for mobile augmented reality applications. InMultimedia and Expo (ICME), 2011 IEEE International Conference on, pages 1 –4, july 2011.

30. O. Silven and T. Rintaluoma. Energy efficiency of video decoder implementations. InF. Fitzek and F.Reichert (eds.) Mobile Phone Programming and its Applications to Wireless Networking, pages 421–439.Springer, 2007.


31. N. Singhal, I. K. Park, and S. Cho. Implementation and optimization of image processing algorithms onhandheld gpu. InImage Processing (ICIP), 2010 17th IEEE International Conference on, pages 4481–4484, sept. 2010.

32. S. N. Sinha, J. M. Frahm, M. Pollefeys, and Y. Genc. Gpu-based video feature tracking and matching.In Workshop on Edge Computing Using New Commodity Architectures, 2006.

33. Y. Tian, K.-H. Yap, and Y. He. Vehicle license plate super-resolution using soft learning prior.MultimediaTools and Applications, pages 1–17, 2011. 10.1007/s11042-011-0821-2.

34. P. Vandewalle, S. Susstrunk, and M. Vetterli. A frequency domain approach to registration of aliasedimages with application to super-resolution.EURASIP Journal on Applied Signal Processing (specialissue on Super-resolution), 24:1–14, 2006.

35. D. Wagner, A. Mulloni, T. Langlotz, and D. Schmalstieg. Real-time panoramic mapping and tracking onmobile phones. InVirtual Reality Conference (VR), IEEE, 2010.

36. Y. Wang and K. Donyanavard, B. Cheng. Energy-aware real-time face recognition system on mobilecpu-gpu platform. InInternational Workshop on Computer Vision on GPU, 2010.

37. S. Winkler, K. Rangaswamy, and Z. Zhou. Intuitive map navigation on mobile devices. In C. Stephanidis,editor,4th International Conference on Universal Access in Human-Computer Interaction, Part II, HCIInternational 2007, LNCS 4555, pages 605–614. Springer, Beijing, China, 2007.

38. Y. Xiong and K. Pulli. Fast panorama stitching for high-quality panoramic images on mobile phones.IEEE Transactions on Consumer Electronics, 56(2):298–306, 2010.

Date post:	12-Mar-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Interactive multi-frame reconstruction for mobile devicesmiguelbl/publications/Paper MTAP12.pdf ·...

Documents