Multimodal ImageRegistrationwithApplications to ImageFusion · due to differences between the...

2005 7th International Conference on Information Fusion (FUSION)

Multimodal Image Registration with Applications toImage Fusion

Jamie P. HeatherWaterfall Solutions Ltd

32 London Road, Guildford, UKjamie.heather(i)waterfallsolutions.co.uk

Dr Moira I. SmithWaterfall Solutions Ltd

32 London Road, Guildford, UKmoira.smith(i)waterfallsolutions.co.uk

Abstract - This paper presents an algorithm foraccurately aligning two images of the same scenecaptured simultaneously by sensors operating in differentwavebands (e.g. TV and IR). Such a setup is common inimage fusion systems where the sensors are physicallyaligned as closely as possible and yet significant imagemis-alignment remains due to differences in field of view,lens distortion and other camera characteristics. Ourproposed registration method involves numericallyminimising a global objective function defined in terms oflocal normalised correlation measures. The algorithm isdemonstrated on real multimodal imagery andapplications to imagefusion are considered. In particularwe illustrate thatfused image quality is closely related tothe degree of registration accuracy achieved. To maintainthis accuracy in real systems it is often necessary tocontinuously update the transform over time. Thus weextend our registration approach to execute in real timeon live imagery, providing optimal fused imagery in thepresence ofrelative sensor motion andparallax effects.

Keywords: Image registration, image fusion.

1 IntroductionImage registration is the process of spatially aligning

two or more images of the same scene, possibly recordedat different moments in time. This broad definitionencompasses a multitude of image alignment problems inthe fields of medicine, defence, remote sensing, computervision and pattern recognition. In each case thefundamental problem is the same; to find a mapping

(x,y) -) (u,v), u = u(x,y), v = v(x,y)(1

between the pixels (x, y) in one image and the pixels (u, v)in another. The complexity of the solution will depend onthe application under consideration. In the simplest case, astraightforward geometric translation or rotation may besufficient to accurately align the two images. Moreadvanced global approaches include affine, polynomialand projective transformations. Where a globaltransformation is not appropriate, a piecewise, elasticmembrane or optical flow technique may be appliedinstead. Brown [1] presents an overview of these varioustransformations and the established methods for obtaining

a registration solution. She also emphasises that beforetrying to solve an image registration problem it is vital tofirst appreciate the cause of the mis-alignment and thenselect a transformation which appropriately models it.Sources of mis-alignment can be grouped into twocategories:

* Spatial mis-alignments commonly arise due todifferences in sensor position/viewing angle and alsodue to differences between the sensors themselves(e.g. field ofview, resolution, lens distortion).

* Temporal mis-alignments occur when there is somerelative motion between the sensor and the objects inthe scene and the images are captured at differentpoints in time.

Figure 1 below illustrates these sources of spatial andtemporal alignment for a two-camera setup.

yaw

< pitch

offset ,,,^, roll

offset4 ^ ..

Figure 1. Example sources of image mis-alignment

The possibilities above give rise to a large number ofvery different image registration problems. In this paperwe consider the specific problem of aligning two imagescaptured simultaneously from (approximately) the sameviewpoint by rigidly mounted sensors operating indifferent regions of the electromagnetic spectrum (Figure2). This setup is common in image fusion systems whichaim to combine the complementary features from two ormore wavebands into a single image with extendedinformation content. The fused image offers significantbenefit over working with the raw sensor outputsincluding increased situational awareness and improvedtarget detection/identification accuracy. A variety ofsimple and advanced image fusion algorithms have been

0-7803-9286-8/05/$20.00 @2005 IEEE 372

Authorized licensed use limited to: University of Surrey. Downloaded on April 19,2010 at 15:48:36 UTC from IEEE Xplore. Restrictions apply.

developed over the last two decades [2] but they alloperate at the pixel level and make the fundamentalassumption that the source imagery is properly spatiallyaligned. The accuracy of the registration process istherefore critical to overall system performance.

Figure 2. The raw outputs from a TV sensor (top) and an IRsensor (bottom). A significant difference in scale is immediatelyobvious and closer inspection of the IR image reveals a barreldistortion effect not present in the TV image.

Image fusion systems usually employ one of two sensorconfigurations (Figure 3). In the first illustration thesensors are closely mounted side-by-side and boresightedfor a particular distance (often infinity). In terms of theregistration problem this simple setup has an immediatedisadvantage: parallax effects. The separation between theoptical paths of the sensors means that it is not possible tofind a fixed transformation which will always map onesource image onto the other. Instead the transformationhas a complex dependency on distance to objects in thescene. In practice, for closely mounted sensorsboresighted at infinity, the parallax effect becomesnegligible (i.e. sub-pixel) when the distance exceeds acertain value (usually a few hundred metres).

Figure 3. Simple examples of boresighted (left) and commonaperture (right) sensor configurations.

Parallax effects can be avoided (to a large extent)through the use of a common aperture. In thisconfiguration (also illustrated) one or more beam-splittersare employed, thereby allowing the two sensors to share acommon optical path. In practice some designcompromises are usually necessary due to the physicalconstraints of sensor/beam splitter size (particularly insystems with more than two cameras) and consequentlysome small degree of parallax may remain. Commonaperture systems can also be expensive and, as such, maynot be the preferred 'off-the-shelf' low-cost solution.

Initially we will assume that parallax effects arenegligible in our image fusion system (either because theobjects in the scene are sufficiently distant or through theuse of a common aperture). The additional assumptionthat the two sensors are synchronised allows us to seek afixed transformation mapping the pixels in one image(referred to as the 'reference' image) onto those in theother image (referred to as the 'input' image). Any imagemis-alignments are now due to differences in sensorcharacteristics (e.g. field of view, pixel resolution, opticaldistortion) and can generally be represented using an \!hdegree polynomial transformation

u(x, y) = E aijx y' , v(x, y) = bibjjx y1 (2)i-O j=O i-o j=O

where the coefficients aij and b,,j are constants to bedetermined. When N = 1 equation (2) reduces to thepopular affine transformation which is capable ofrepresenting translations, rotations and shears. Addinghigher order terms to the equation allows compensation ofcomplex lens distortions such as pin cushion or barreldistortion effects. A second or third degree polynomial issufficient for most practical applications.At this point we note that the projective transform, as

defined by the equations

) C,X+C2y+ C3 V(X y)= d1x+d2y+d3 (3)cux+cxy+, dx+dy,+yI

373

I/I 11It i

iIII I

,I iI,:1,

1


(where c1,..., c5,dl,..., d5 are constants), is often argued tobe a superior representation of image mis-alignment inimage fusion systems. This is because the underlyingmodel assumes the scene is being viewed from differentviewpoints and hence is able to compensate for parallaxeffects. However, the model also assumes the scene isplanar (i.e. completely flat) which is generally only avalid approximation in airborne applications, whereparallax effects are less noticeable anyway. Moreover,even when the scene is effectively planar the transformcoefficients must be re-calculated whenever the camerasare moved to a new viewing angle (Figure 4). Wetherefore advocate that for most practical image fusionapplications the projective transform offers no benefitover the more flexible polynomial transform (although insection 5 we consider the real-time update of transformcoefficients).

Projectively warping image A and overlaying on image B

E

X@E ~~~~~GoodE _ alignment

Applying the same transformation from a different viewpoint

E

E p :: Tt 0 tf_ 0 ;0:: ;0 ;0_alignmentL AL

Figure 4. For 'flat scenes the projective transfofm is able tocompensate for parallax effects but the transform coefficients mustbe re-calculated if the viewpoint changes.

Determining optimal global transform coefficients for amultimodal registration problem is no trivial task due tothe complex relationship between the wavebands. Theintensity of a particular image pixel is determined not onlyby the camera response but also by a number of physicalproperties (e.g. materials in the scene, atmosphericconditions and background radiation) and hence, evenafter accurate spatial alignment, multimodal imagery oftenremains uncorrelated. Consequently, many establishedregistration techniques (e.g. from the mature field ofcomputer vision [3]) cannot be applied directly tomultimodal imagery.

In this paper we propose an algorithm for automaticallydetermining optimal transform coefficients for aligningmultimodal imagery (under the assumptions described

above). Our approach is based on previous work by Iraniand Anandan [4] which split the registration problem intotwo stages:

i. Identification of a suitable image representationbased on a multi-scale analysis of spatial structure.This representation is (relatively) invariant to rawimage intensity and hence is ideal for assessingalignment ofmultimodal imagery.

ii. Formulation of a new automatic alignment techniqueutilising normalised correlation as a local similaritymeasure.

This method has been demonstrated on real multimodalimagery and appears to work well. Our approach alsoutilises an invariant image representation and localcorrelation measures but they are formulated into a globalobjective function to be minimised. We then have arigorously-defined optimisation problem which is solvedto provide highly accurate registered images.

2 ApproachWe present an overview of our automatic registration

algorithm as an optimisation problem where an objectivefunction is to be minimised. The objective functionprovides an assessment of 'how good' a particular choiceof transform coefficients is. It is constructed by locallyassessing the similarity between the reference image andthe transformed image and then summing over all localregions. Standard correlation techniques can be used assimilarity measures provided the multimodal imagery isfirst decomposed into an intensity invariantrepresentation. However, the weighting of alignmentachievement across the field of view (particularly forlarge fields of view) should also be considered. Thisregion could, for example, be driven as a foveal patch inpiloting applications.

2.1 Image correlation techniquesImage correlation techniques have been successfully

applied in a wide range of applications for many years,and particularly in the field of computer vision. Perhapsthe most common example is in stereo-vision, where twosnapshots taken from different viewpoints are used torecover depth information. Here the sensors are usuallyidentical and hence a straightforward mean-square-difference calculation

kA = _(A(x,y)-B(x,y)yx.y

(4)

is suitable as a similarity measure which is minimisedwhen the two images A and B are perfectly aligned. Arelated measure is the nornalised correlation

374


EA(x,y)B(x,y)

A= xY)A 2B gBx,y )

which lies in the range [-I 1] and is invariant to scalarmultiplication. This function attains its maximum/minimum value when B = caA for some positive/negativeconstant a respectively, and is zero when the two imagesare uncorrelated. The final measure we introduce here isthe statistical correlation

CBY_(A(x, y)-|1A XB(X, y) PB )

X,y

x,(AxY- xA2,y(( y_p)

which also lies in the range [-I I] and is invariant to bothscalar multiplication and addition. It is therefore the mostrobust of the three measures.

2.2 Invariant image representationThe correlation techniques above can be made more

robust by applying them to local image regions rather thanto whole images. We can construct a global similaritymeasure by summing the correlation over all local regionsand the registration task is then defined as an optimisationproblem. However, this global similarity measure is stillnot sufficiently robust to be applied directly to multimodalimagery, where the wavebands are often uncorrelated(both globally and locally) and may exhibit disjointfeatures. Correlation techniques can be used though if wechoose an image representation that emphasises commonspatial structures (lines, cormers, contours, etc) andsuppresses low spatial frequency features which tend tobe more modality-dependent.

Laplacian filtering is an obvious approach forextracting high frequency spatial detail. In practicethough, this filter tends to extract 'too much' informationfor registration due to its rotational invariance. Irani andAnandan suggest the more selective approach of applyingdirectional derivative filters to the raw imagery and thenoptimising the transformation coefficients for vertical,horizontal and diagonal alignment. To improveperformance the filtered images are also squared to makethe representation more invariant to contrast reversals.These energy images tend to be well-correlated (at leastlocally) when the original images are well-aligned andhence are a good invariant image representation. In termsof implementation we propose a variant of the methoddescribed above; instead of transforming the energyimages and then testing alignment it is better to transformthe raw images and then calculate their energy andalignment. This ensures we are comparing like-with-like.

Figure 5 demonstrates the benefit of this approach whentesting for large image distortions.

Calculating (horizontal) energy ofA and warping o

(D ~~~~~~~~~~~a.

Calculating energy of reference image B .

0

~~~~~~~~~~~0

E | -..

Warping image A and then calculating energy

Figure 5. Trivial example illustrating why image warping should beapplied before the energy calculation when testing alignment. Inthis case a rotation of 90' accurately aligns the source images butonly one approach reveals this.

Irani and Anandan also propose using image pyramidsto aid the registration process. Pyramids are a familiar toolin image fusion for performing multi-scale image analysis[2] and are easily created by successively filtering anddown-sampling the source images. A common example isthe Gaussian pyramid which takes the name of its filter(Error! Reference source not found.).

Incorporating a multi-resolution image representationinto the registration algorithm greatly simplifies the searchfor the optimal transform coefficients; crude estimates are

obtained by aligning the small images at the base of thepyramid and these are then steadily refined by repeatingthe alignment process on the higher levels. This approachquickly locates the global minimum of the objectivefunction while avoiding local minima and is thus able toovercome fairly large image distortions.

In practice, the need for coarse-to-fine registration can

be greatly reduced by providing a good initial guess forthe transform coefficients, e.g. by manually specifying a

few tie points and then applying least squares [1]. Thislevel of human guidance is usually acceptable for imagefusion applications where the registration process onlyneeds to be performed once for a particular sensor

configuration.

Figure 6. A Gaussian image pyramid with 4 levels of resolution.

375


However, image pyramids are still useful because oftenthe information they contain in their lower levels is just asrelevant for assessing alignment as the full size imagery.This is particularly true when aligning noisy imagery orwarping a low resolution sensor onto a higher resolutionsensor (Figure 7). Thus we propose that instead ofperforming coarse-to-fine registration a better strategy isto align all the levels in the image pyramidsimultaneously.

TV sensor image IR sensor image

optimisation technique this makes real-time alignment ofmultimodal imagery a feasible proposition.

TransformationAffine

Number of coefficients1 ~~~6

Projective 12Quadratic 12Cubic 20

Table 1. Some common parametric transforms and the number ofcoefficients that the registration algorithm must determine.

The following notation is now introduced: R(x, y) andU(u,v) denote the reference image and the unregisteredimage respectively. We will map the pixels (x, y) in Ronto the pixels (u, v) in U using a global transformationparametrised by the vector p. Our transformed image T isthen a function of both position (x, y) and the chosenparameters p,

T(x, y; p) = U[u(x, y; p), v(x, y; p)]. (7)

Vertical energy images (level 1)

Vertical energy images (level 2)

Figure 7. Coarse-to-fine processing can sometimes lead theregistration process away from an optimal solution. In the exampleabove the IR image contains a fixed noise pattem which manifestsitself in the first energy image (corresponding to high spatialfrequencies) but is suppressed in the lower levels of the pyramid.Consequently a better end result is achieved by considering allresolution levels simultaneously when aligning the images.

3 AlgorithmWe formulate our objective function as a sum of

squares and then, given an initial guess for the transformcoefficients, rapidly search for the local minimum.Provided the initial guess is sufficiently accurate thealgorithm will terminate with the optimal set of transformcoefficients. An advantage of working with globalparametric transforms (as opposed to more advancedmodels of image distortion, such as optical flow [5][6]) isthe relatively low dimensionality of the search space.Typically we must solve for a dozen or so unknowncoefficients (Table 1) and when combined with our fast

When expressed in this way it is clear that in practicalterms the backward mapping from the reference imageonto the unregistered image is more useful forconstructing the transformed image than the forwardmapping, even ifthis is a little counter-intuitive.Our aim is to find the parameter vector p that gives the

best possible spatial alignment between the images T andR. In section 2.2 it was proposed that alignment should betested not just on the full size images but also on theirmulti-scale decompositions. This can be achieved bycomputing the Gaussian pyramids of T and R and thensumming the similarity measures across the differentscales. Unfortunately, though this leads to an objectivefunction with a complex dependency on p which makesoptimisation difficult. A much simpler strategy is tocalculate the Gaussian pyramids of the original images Uand R and then appropriately transform the unregisteredimages at each level. Thus our method begins with theconstruction of the pyramid {U,,U2,...,U)} definediteratively by

U=RU for1=1U1{REDUCE(U,-,) forl=2,3,...,L

(8)

where REDUCE denotes the filter/decimate operation [2].Each image in this pyramid is then warped by applying arescaled version of the original transformation

T, (x,y;p) = U, (u[x,y; g (p)],v[x, y; g (p)]) (9)

where the function g, describes the effect of the imagedown-sampling that takes place between the l' and the thlevels of the pyramid on the original transformationparameters. In most cases g, is easily determined andtakes a fairly simple form.

376


Following the construction of the transformed pyramid{T,,T ,...,T } the next step is to convert to an 'invariant'image representation. For this we introduce the directionalderivative operator Vk defined by

a aVk =ak +Pk - (10)

where Ctk and Pk are constants satisfying ak, + Pk2 = 1.We use the four directional derivatives defined in Table 2to assess the horizontal, vertical and diagonal alignmentof each transformed image against the correspondingreference image. In practice a numerical approximation tothe Vk operator must be employed due to the discretenature of the images. This is implemented as aconvolution operation with a small discrete kernel (e.g. aSobel filter). To make the representation invariant tocontrast reversals we will also square the filtered images(as discussed in section 2.2).

1234

DescriptionHorizontal derivativeVertical derivative

First diagonal derivativeSecond diagonal derivative

atk10

1/421/42

01

1/42-1AN2 I

Table 2. Four directional derivatives and their correspondingvalues of ak and Pk in definition (10).

At each level of resolution we assess local similaritybetween the corresponding squared derivative images todetermine the degree of alignment. Clearly there are manypossible ways of dividing the images into local regions;partitioning into disjoint blocks or sliding neighbourhoodprocessing are two common approaches. Whatevermethod is chosen let us denote the family of local regionsfor the 1 resolution level as

QZ, {Co," (d, (oM,}(II)

where okl is the set of all pixels (x, y) that make up themts local region and M, is the number of local regions atthis resolution. Defining an image similarity measurethen allows the formulation of our global objectivefunction which is minimized over all possible choices ofp. Solving using our fast optimization technique tends tobring rapid convergence and usually only a few iterationsare necessary to achieve very good image alignment.

4 ResultsThe algorithm outlined above has been successfully

applied with a variety of multimodal imagery withexcellent results. In particular, the algorithm has beenused to align the TV and IR sensors in a real-time systemdeveloped by Waterfall Solutions and QinetiQ whichutilises a proprietary high-performance multi-resolution

fusion technique. A cubic transform was used totransform the IR image onto the TV image (bothmeasuring 768x576 pixels), compensating for thesignificant difference in scale and the barrel distortioneffect between the two. Iterations were initialised byspecifying a few tie points by hand and calculating acrude affme transformation. The results are shown inFigure 8. Accurate image alignment can be observed byforming the composite of the transformed IR and thereference TV image (Figure 9).

TV image

IR image (with overlayed grid) & registered IR image

Figure 8. Registration of TV and IR images captured by an imagefusion system. A grid has been overlayed on the IR image to helpvisualise the transform.

In terms of image fusion the excellent performance ofthe registration algorithm can be seen in the high qualityimage generated by our proprietary fusion algorithm(Figure 10). The relationship between registrationaccuracy and fused image quality is discussed further inthe following section.

377

---k |L


Figure 9. Image composite formed by selecting altemate squareblocks from the TV image and the registered IR image. Accuratealignment has been achieved across all image regions.

Figure 10. Accurate image registration results in a crisp, sharpfused image.

5 Real-Time Fusion SystemsIn many image fusion systems the sensors are

synchronised and rigidly mounted in place and hence afixed transform is usually considered sufficient to alignthe source imagery. After boresighting the sensors asclosely as possible in the laboratory an appropriate set ofimagery must be captured to enable accurate registrationto be performed. This should consist of two (or more)good quality images containing a large number ofcommon features, distributed throughout the image, thatcan aid the registration process. The algorithm describedin the preceding sections is ideal for determining anaccurate set of transform coefficients given such imagery.These coefficients can then be hard-coded into the fusionsystem.

In practice there are a number of factors that can causethe source images in a fusion system to become mis-aligned. If the system is mounted on a moving platform(e.g. an aircraft) then operational phenomena such asaccelerations, vibrations and even temperature changescan cause the sensors to gradually 'drift' from theiroriginal alignment over time. On a much shorter timescale sensor vibration can cause individual image framesto become mis-aligned. In both cases the extent of themis-alignment is often fairly small but nevertheless canlead to a noticeable reduction in fused image quality(Figure 11). A more significant issue though in manyfusion systems is parallax. It is important to appreciatethat the initial registration coefficients will only remainvalid ifthe distance between the sensors and the objects inthe scene does not change from the initial setup. Thisrequirement is clearly unrealistic and often violated whena system is put into service, leading to large mis-alignments and poor quality fused imagery.

Figure 11. The images above show the fused image before (left)and after (right) introducing a mis-alignment of 3 pixels in the x andy directions.

In view of the limitations described above it is highlydesirable in many real-time fusion systems tocontinuously update the transform coefficients over timerather than use a fixed transformation. This helps ensurethat fused image quality is maintained at all times(although the extent to which the system can compensatefor parallax will depend on the underlying transform: thepolynomial transform is a relatively poor model of thiseffect; the projective transform is a better choice when thescene is planar - see section 1). As it stands the algorithmdescribed in section 3 is too complex to execute as part ofa real-time system (at least with currently affordablecomputing technology) but such an implementation doesbecome feasible after a straight-forward modification ofthe algorithm; we propose to perform one iteration of theoptimization method for each time-step in the real-timesystem.The modified method begins as before: after acquiring

the latest source images they are decomposed intoGaussian pyramids and a new set of candidate transformcoefficients is calculated and accepted/rejected accordingto the observed convergence as before. At this pointthough instead of looping around for another attempt atrefining the coefficients we use the current best guess toregister the images and then the fusion algorithm isapplied. The process is repeated at the following time-stepto further refine the transform coefficients for the next setof acquired images. The rapid convergence of theoptimisation method ensures that given an accurate initialguess at the transform coefficients our latest estimate isnever far from the optimal solution. The extendedregistration algorithm has been applied to short imagesequences with promising results. In terms ofimplementation, two points are observed:

It is possible for one or more of the source sensors todegrade over time (e.g. due to noise or saturation) orfor the different wavebands to exhibit a lack ofcommon features. Thus for a robust implementationsome restraint must be placed on the registrationprocess to prevent it from going too far 'off-course'.The magnitude of the objective function can be usedas the basis for this controlling logic.

* The registration algorithm described above naturallycombines with a pyramidal fusion scheme; the imagepyramids generated at each timestep may be re-usedat the fusion stage, reducing the number ofcalculations that must be performed there.

378


6 DiscussionThis paper has reported on some findings from recent

research conducted by Waterfall Solutions into improvedmethods for registering multi-modal imagery. A numberof issues associated with registering images of differentmodality have been highlighted, and particular emphasishas been placed on exploitation within image fusionsystems for visible and infrared imagery. A new approachfor achieving off-line, highly accurate image registrationhas been discussed and pictorial results provided.Consideration has also been given to the suitability of thealgorithm to real-time applications and the veryencouraging progress made in this area has been reported.Further research into a number of important aspects ofthis work continues, including: registration requirementsdefinition; faster methods for improved real-timeperformance; metrics for establishing registrationaccuracy achieved; and human factors impact of imageregistration on fused imagery.

AcknowledgementsThe authors would like to thank QinetiQ for the use oftheir source imagery in this paper.

References[1] L.G. Brown, A Survey of Image RegistrationTechniques, ACM Computing Surveys, Vol. 24, No. 4,pp. 325-376, 1992.

[2] M.I. Smith and J.P. Heather, A Review of ImageFusion Technology in 2005, SPIE Defense & SecuritySymposium, Orlando, Florida, 28 Mar - 1 Apr 2005, Vol.5782.

[3] U. Dhond and J. Aggrawal, Structurefrom Stereo: AReview, IEEE Trans. on Systems, Man, and Cybernetics19, pp. 1489-1510, 1989.

[4] M. Irani and P. Anandan, Robust Multi-SensorImage Alignment, Sixth Int. Conf. on Computer Vision,Bombay, India, 4- 7 Jan 1998, pp. 959-966.

[5] J.L. Barron, D.J. Fleet and S.S. Beauchemin,Performance of Optical Flow Techniques, Int. Journal ofComp. Vision 12, pp. 43-77, 1994.

[6] R. Szeliski and J. Coughlan, Spline-Based ImageRegistration, International Journal of Computer Vision,Vol. 22, pp. 199-218, 1997.

379


Date post:	04-Aug-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Multimodal ImageRegistrationwithApplications to ImageFusion · due to differences between the...

Documents