Copyright 1994 IEEE. Published in the Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, April 19-22 1994, Adelaide, Australia.

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

Spanning the Gap between
Motion Estimation and Morphing

Michele Covell Margaret Withgott
Interval Research Corporation
1801 Page Mill Rd, Bldg C
Palo Alto, CA 94304

Abstract

Motion estimation is an important and well-studied method for determining correspondences between images in video processing. In contrast with motion estimation, the use of image warping on image sequences is a new and developing field. To provide image warping effects, morphing algorithms rely on mesh points to align two images. Presently, the selection of these mesh points is a labor-intensive, manual process. The research reported in [Covell93,Covell94] and within this paper automates the selection of mesh points for morphing by recognizing this problem as one of determining image correspondences. Applications include video compression, animation sequence generation and high-quality time dilation of video.

1 Introduction

Determining correspondences in images is a well-studied problem in video processing. Both stereo mapping and motion estimation have received a vast amount of attention from the vision and signal processing communities in the last ten years and now constitute fairly mature technology. Much of this work has centered on motion estimation for video compression. For example, MPEG specifies motion compensation on subblocks of the video frame, in order to increase the temporal correlation between frames and, hence, the compression ratios. At low data rates, this block compensation introduces blocking artifacts and "unnatural" image sequences.

In contrast with motion estimation, the use of image warping in video processing is a new and developing field. The technique of image warping in video, or "morphing", has captured widespread attention through its compelling effects in movies such as Terminator II and videos such as Jackson's Black or White [Beier92]. To produce these effects, morphing algorithms rely on mesh points to define the correspondence points between two images. The morphing process then warps and fades from one image to the other, using the mesh of correspondence points to define the warping function. With well-placed mesh points, this process generates a natural sequence of intermediate images. The selection of these mesh points is now done by hand.

The research reported in [Covell93, Covell94] and within this paper automates the selection of mesh points for morphing. The problem is viewed as one of determining image correspondences, which enables us to apply many of the relatively mature techniques currently used for motion estimation and stereo matching. Automatically generated morphing can, in turn, be used in video compression to avoid blocking artifacts. Other applications include generation of "in-between" frames for animation sequences and for high-quality time-dilated video sequences.

2 Image Matching and Warping

Figure 1 shows a high-level flow graph of the proposed combination of image matching and image warping. The first step in the process is to match the input images, using methods originally developed for motion estimation. This step results in a dense and unreliable field of correspondence estimates. A parallel process takes each image separately and returns for each a "cartoon" of the image. These cartoons provide a measure of the local image contrast, emphasizing those portions that are likely to be visually important. We then combine the image cartoons with the dense field of image matches. This provides a composite cartoon of those locations which were coupled by the image matching and which were tagged as salient in both images separately. This last step provides a large set of correspondence points. To reduce the number of mesh points which must be used by the morphing algorithm and to increase their reliability, line segments are computed in each image based on the evidence provided by the composite cartoon. These line segments and their endpoints act as constraint lines and as mesh points in the final image morphing step. The remainder of this section discusses each of these steps in detail.

Figure 1: Overview of image matching / image warping. A dense field of match vectors is estimated for the input image pair. These vectors and the "cartoon" descriptions of the images are combined. The resulting composite cartoon (including the match vectors) is used to determine line segments in the original images which are constrained to match in the final morphed sequence.

2.1 Dense-field correspondence estimation

The primary difficulty in combining motion estimation techniques with morphing lies in the nature of the image data. Motion estimation typically uses image sequences taken with a high measure of temporal and spatial continuity. In contrast, morphing typically operates on two images or image sequences of distinct objects at disjoint times. To support the expected range of data, the motion estimation algorithm must allow for large, non-rigid deformations and for variations in absolute brightness and color.

To allow for such large, non-rigid image deformations, the matching process should use an image description which separates large-scale translations from the local image deformations. We have chosen to use an octave-scale description in conjunction with a coarse-to-fine correlation-based matching process. Correlation-based comparison is a simple, well-understood method of signal matching and highly efficient computational methods are known for computing local correlation functions under a sliding window [Covell91].

To reduce the sensitivity of the correlation to illumination levels, we use the ratio of the grayscale curvature at each scale to the large-area, local average at that scale. Similar contrast-based image descriptions are suggested by the perceptual theories of [Land71], [Pearson85], and others [Marr82,Grimson84]. This similarity leads to the hope that the image measures will preserve features which are visually salient at each scale and compress other features. Obviously, there is no computational requirement that the image descriptions accurately mirror the human perceptual system and, in the case of an adequately sampled, natural time sequence, there is no reason to think that such a perceptually based description will be the best way to determine correspondences. However, the fact that our images are drawn from distinct scenes or times, along with our criterion of visually "natural" transitions, forces us to seek as much perceptual support as possible. This scale-dependent image description will be referred to as a bandpass density image in the next section.

In summary, we compute an image match recursively starting with matching the coarsest-scale density image and proceeding to finer-scale descriptions. Although this certainly is not the only way to reliably estimate motion, we believe it is one of the few ways which will exhibit robust behavior on the different classes of image pairs we need to consider.

2.2 Salient-point location and coincidence marking

The correspondence estimate described in the previous section results in a dense set of matching vectors. Various methods for reducing the set to a sparse mapping of points are suggested by classical model-based vision techniques. For example, edge or corner detection followed by matching generates a sparse field of measurements and "high-confidence" matches. Unfortunately, the same features must appear in both images in a consistent and unambiguous way. These two requirements are unlikely to be met in the image pairs that we must consider.

We advocate generating a dense-field estimate of the match vectors separately from the selection and description of the visually salient points in the images. This set of visually salient points can be selected using a combination of the bandpass density images previously used in the match estimation. First, as mentioned above, these bandpass density images weigh those areas which are visually distinctive more heavily than those that are uniform. An important difference between the density descriptions used in matching from those used as a "cartoon" is that the descriptions used in matching are not scalar functions: they have different values at each scale. In contrast, the cartooning process must combine the visual information across scales into a single description. While the best way of recombining information from different scales is most likely to be highly data dependent, we have chosen to recombine the information using a high-frequency emphasis function proportional to the square-root of the frequency.

To summarize the computation to this point, each of the images is transformed by a octave-band bank of filters which compute the ratio between the local grayscale curvature and the large-area average. The outputs from this bank of filters are used to estimate a dense-field of matching vectors. This estimation process starts at the coarsest scale and recursively re-estimates the matching vectors at each finer scale, using the previous scale's estimate of the offset vector as a starting point. The outputs from the same bank of contrast filters are weighted by the square-root of the frequency and recombined linearly. The matching vectors are used to locate the coincident salient points in the two images. If the saliency of the points in either image is below a minimum, then the match vector associated with those points is dropped from the field of match vectors. Only those match vectors which correspond to salient points in both images are kept.

2.3 Constraint-line location

Even using the coincidence of salient points to prune the field of matching-vector estimates, the number of correspondences which remain is large and their location is generally error filled. To reduce the number, and to increase the reliability of the remaining set, we fit pairs of line segments to the coincident salient points using an extension to the Hough transform. These pairs of line segments, or constraint lines, are then used in morphing as a warping boundary: all the points of one of the pair of line segments are warped onto the points on the other segment.

Conceptually, we find the constraint lines by recursively finding the "current best fit" within the image cartoons, marking it and removing the salient points which support it. Since each constraint line is a pairing between a line segment in the first image and a line segment in the second image, the full search for the best fit would explore an eight-dimensional space. We instead use a suboptimal search of four, two-dimensional subspaces: first selecting the containing line in the starting image, then selecting the end points of the constraint line on that containing line; and repeating the process in the second image. While this approach to finding the constraint lines is not optimal, it appears to be reasonable, given our experimental results.

2.4 Image morphing

The work described here focussed on a problem which is known as geometry control in graphics applications, a subproblem of morphing. Adjusting transparency and velocity contributes to the "fading" effect that viewers are now used to [Wolberg90]. We used a commercial package [Gryphon92] to achieve this effect.

3 Results

To span the range in data between classical motion estimation and morphing, we consider three classes of image pairs. For each of these three classes, we show:

The third item in this set (namely, the direct warping using the original dense field of matching vectors) is included to illustrate the "unnaturalness" and variability of the match vectors.

The first image pairs which we consider are ones that are taken in rapid succession, such as in a single shot. This class of image pairs is used to study both the automatic generation of mesh points on a well-characterized correspondence field and the perceptual quality of the regenerated frames. Thus, the focus here is on image warping as a reconstruction method within a video compression application: the "measure of success" is the perceptual quality of the reconstructed image as compared to a reconstruction based solely on more classical block motion methods, using similar bit rates. Figure 2 shows the results of motion analysis across six frames (at 10 frames/second) and of image warping based on the motion estimate given by the concatenation of the five frame-to-frame estimates. Figure 3 shows the error images: that is, the absolute difference (multiplied by ten) between the original frames which were not transmitted and the frames which were resynthesized using automatic image warping.


Original images: reference and target.

Salient-point ("cartoon") images: reference and target

Constraint lines and mesh points: reference and target.

Direct warping: from reference to target

Results of morphing: 20%, 40%, 60%, and 80%
Figure 2: Image matching/warping within a video stream. This figure shows the analysis and re-synthesis of video frames within a shot. (Larger versions of all these images are linked to the small images shown here.)

Figure 3: Error images. This figure shows the ten times the absolute difference between the original, untransmitted frames and the frames which were resynthesized using automatic image warping. (Larger versions of all these images are linked to the small images shown here.)

Related work has shown that the prediction error of a coding scheme using image warping is usually less than using classical block motion methods [Nieweglowski93]. In that work, a simple edge detector was used for choosing the mesh points, following previous work on control grid interpolation [Sullivan91]. Even though mesh point placement was errorful, it was observed that the elimination of blocking artifacts provided by the warping technique resulted in improved perceptual quality in the test sequences.

The second class of image pairs which we consider is based the two images of the same object shot under significant non-rigid translation and deformation (Figure 4). This type of image pair is less widely used in motion estimation, due to the difficulty of using local measures to determine correspondence. This second class of image pairs is used to examine morphing as a component of video compression between frames that are widely separated in time. In addition, this type of image pairing is of interest in automatically generating intermediate frames in animation sequences.


Original images: reference and target.

Salient-point ("cartoon") images: reference and target

Constraint lines and mesh points: reference and target.

Direct warping: from reference to target

Results of morphing: 25%, 43%, 57%, and 75%
Figure 4: Image matching/warping at non-contiguous times. This figure summarizes the results of automatic image matching of a single face at distinct times. (Larger versions of all these images are linked to the small images shown here.)

The final class of image pairs considered are two images of distinct but "similar" objects, such as portrait views of different people (Figure 5). Such image pairs are the most difficult to match. This class of image pairs is used to study the automation of mesh-point placement in morphing sequences between distinct objects.


Original images: reference and target.

Salient-point ("cartoon") images: < reference and target

Constraint lines and mesh points: < reference and target.

Direct warping: from reference to target

Results of morphing: 25%, 43%, 57%, and 75%
Figure 5: Image matching/warping on distinct objects. This figure shows the results of automatic image matching between two different faces. (Larger versions of all these images are linked to the small images shown here.)

4 Conclusions

The importance of this research is in bridging the gap between motion estimation and image warping. Both techniques depend on the correspondence between points in image pairs: motion estimation in automating the matching process and image warping in recreating the intermediate images which somehow lie between the two images of the pair. Both fields of study can benefit from this coupling of techniques. It has been shown that the labor intensive process of choosing mesh points for morphing can be eased by using motion estimation and a natural transformation between images can be created by morphing using these points.

References

[Beier92] T. Beier, S. Neely. "Feature-Based Image Metamorphosis," Proc. SIGGRAPH'92, pp. 35-42, 1992.

[Covell91] M. Covell. "A New, Efficient Structure for the Short-Time Fourier Transform," Proc. ICASSP'91, pp. 2041-2044, 1991.

[Covell93] M. Covell. "Combining morphing and motion estimation," presentation at Interval Research Corporation, 8/18/93.

[Covell94] M. Covell, M. Withgott. "Spanning the gap between motion estimation and morphing," Proc. ICASSP'94, 1994.

[Grimson84] E. Grimson. Computational Experiments with Feature Based Stereo Algorithm., MIT AI Memo 762, 1984.

[Gryphon92] Gryphon Software, Morph 1.1, San Diego, CA. 1992.

[Land71] E. Land, J. McCann. "Lightness and retinex theory," J. Opt. Soc. Am., vol. 61, pp. 1-11, 1971.

[Marr82] D. Marr. Vision, W.H. Freeman and Company, 1982.

[Nieweglowski93] J. Nieweglowski, T. Campbell, P. Haavisto. "A Novel Video Coding Scheme Based on Temporal Prediction Using Digital Image Warping," IEEE Trans. Consumer Electronics, vol. 39, pp. 141-150, 1993.

[Pearson85] D. Pearson, J. Robinson. "Visual Communication at Very Low Data Rates," Proc. IEEE, vol. 73, pp. 795-812, 1985.

[Sullivan91] G. Sullivan, R. Baker. "Motion compensation for video compression using control grid interpolation," Proc. ICASSP'91, pp. 2713-2716, 1991.

[Wolberg90] G. Wolberg. Digital Image Warping, IEEE Computer Society Press, 1990.


Michele Covell
Margaret Withgott
Interval Research Corp.; 1801 Page Mill Rd, Bldg C; Palo Alto, CA 94304