Senior Staff Research Scientist
Bio I received my BS in electrical engineering from the University of Michigan and my MS and PhD from MIT in signal processing. I joined SRI International, in the area of active acoustic-noise control, and then Interval Research Corporation, where my research covered a wide range of topics in audio, image, and video processing. In 2000, I joined YesVideo and worked on faster-than-real-time video analysis. I moved to the Mobile Streaming Media group in HP Labs, as a key contributor in streaming-video services in 3G telephony networks. This work is listed as one of the top-40 accomplishments from HP Labs' 40-year history. I moved to Google, in the research group, in 2005, where I have focused on large-scale audio and video fingerprinting, identification, and retrieval. In addition to the publications described below, I have 23 granted US patents and 37 published US-patent applications, along with associated PCT filings. |
| ||||||||||||||||||||||||||||||||||||||||||||||
Publications
Locality Sensitive Hashing (LSH) is widely used for efficient retrieval of candidate matches in very large audio, video, and image systems. However, extremely large reference databases necessitate a guaranteed limit on the memory used by the table lookup itself, no matter how the entries crowd different parts of the signature space, a guarantee that LSH does not give. In this paper, we provide such guaranteed limits, primarily through the design of the LSH bands. When combined with data-adaptive bin splitting (needed on only 0.04% of the occupied bins), this approach provides the required guarantee on memory usage. At the same time, it avoids the reduced recall that more extensive use of bin splitting would give.
Locality-sensitive hashing (LSH) is used in many applications that require retrieval and evaluation of approximate nearest neighbors. The uniformity of the signature distributions has a strong effect on the number of candidates that must be considered. The effects of this uneven distribution are squared, since retrieval probes are taken from the same uneven distribution that resulted in uneven occupancy in the table itself. In this study, we introduce the idea of dimension grouping to intelligently design the hash functions that are used to index the LSH tables. This reduces the inefficiencies introduced by hashing real-world data that is noisy, structured, and most importantly is not independently or identically distributed. Through large-scale tests, we find that permutation-grouping dramatically increases the efficiency of the overall retrieval system by lowering the number of low-probability candidates that must be examined by 30-50%.
We want to be able to efficiently find similar items in large, high-dimensional datasets, for things like music, image, and video retrieval. Scaling lookups for any large corpus tends to lead to tree or hash-based approaches. Correctly structuring such trees or hash functions is made more difficult than in other large-corpus domains by an imprecise definition of similarity. In this work, we found a method to learn a similarity function from only weakly labeled positive examples. We used this similarity function as the basis of a hash function to severely constrain the number of points considered for each lookup. In testing on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) were ever considered for each lookup. To increase efficiency, we did no comparisons in the original high-dimensional space of points, yet still achieve nearly 99% accuracy on 5-second-long probes.
This paper extends the previous work on Waveprint (see below) to handle open-set detection and identification. To do this, we re-examine the best-ranked matches from Waveprint using temporal-ordering-based processing. The resulting system has excellent detection capabilities for small snippets of audio that have been degraded in various ways, including competing noise, poor recording quality, and cell-phone playback. The system is more accurate than the previous state-of-the-art system while being more efficient and flexible in memory usage and computation.
In this paper, we introduce Waveprint, a novel method for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.
We propose a method for detecting and precisely segmenting repeated sections of broadcast streams. The detection stage starts from acoustic matches and validates the hypothesized matches using the visual channel. The precise segmentation uses fine-grain acoustic match profiles to determine start and end-points. The approach is both efficient and robust to broadcast noise and differences in broadcaster signals. Our final results is nearly perfect, with better than 99% precision, at a recall rate of 95% for repeated advertisements.
We describe mass personalization, a framework for combining mass media with a highly personalized Web-based experience. We introduce four applications for mass personalization: personalized content layers, ad hoc social communities, real-time popularity ratings and virtual media library services. Using the ambient audio originating from the television, the four applications are available with no more effort than simple television channel surfing. Our audio identification system does not use dedicated interactive TV hardware and does not compromise the user's privacy. Feasibility tests of the proposed applications are provided both with controlled conversational interference and with "living-room" evaluations.
Streaming video to mobile devices like cellular phones is an important emerging application. However, mobile thin clients and telephony-infrastructure constraints limit the capabilities and interfaces available to the end user. This paper describes our design of an end-to-end interactive video telephony service. The challenges of providing interactive services is to remain compliant with the telco signaling and data protocols, without requiring software downloads onto the handset, while providing responsive, interactive controls. Our service provides VCR-like control of telephony video, maintains audio-video synchronization, and respects the video frame and bit-rate contracts of the telephony channel. This research was completed at HP Labs.
This paper describes a mathematical approach to deriving predictive models of streaming-server saturation (saturating/non-saturating) through hidden-variable estimation. We use a 240-dimensional measurement vector to estimate a low-dimensional vector of continuous valued hidden variables, with each hidden-variable dimension corresponding to a "discovered" resource on which the streaming server depends. The discovery process uses labelled calibration data, along with a POCS-like alternation of constraints, to create the best (in a total-least squares sense) predictors for server saturation, both under the current (unknown) set of client sessions and under some (proposed) increase in the client sessions. These models allow us to make admission-control decisions as well as to indicate which types of additional clients the streaming server can currently handle. (A description of the data-collection regime is available here.. A longer version of these two papers is available here.) This research was completed at HP Labs.
Methods for automatically managing the performance of computing services must estimate a performance model of that service. This paper explores properties that are necessary for performance model estimation of black-box computer systems when used together with adaptive feedback loops. It shows that the standard method of least-squares estimation often gives rise to models that make the control loop perform the opposite action of what is desired. This produces large oscillations and bad tracking performance. The paper evaluates what combination of input and output data provides models with the best properties for the control loop. Plus, it proposes three extensions to the controller that makes it perform well, even when the model estimated would have degraded performance.
Our proposed techniques are evaluated with an adaptive controller that provides latency targets for workloads on black-box computer services under a variety of conditions. The techniques are evaluated on two systems: a three-tier ecommerce site and a web server. Experimental results show that our best estimation approach improves the ability of the controller to meet the latency goals significantly. Previously oscillating workload latencies are with our techniques smooth around the latency targets. This research was completed at HP Labs.
This paper presents Media Services Architecture (MSA). MSA is a flexible, general architecture for requesting, configuring, and running services that operate on streaming audio and video as it flows through the network. MSA decomposes requested media services into modular processing components that may be distributed to servers throughout the network and which intercommunicate via standard streaming protocols. Use of standard protocols also affords seamless inter-operability between MSA and media content delivery networks. MSA manages media services by monitoring the networked servers and assigning service components to them in a manner that uses available computational and network resources efficiently. We describe some implemented example services to illustrate the concepts and benefits of the architecture. This research was completed at HP Labs.
Current mobile devices and wireless access allows delivery of video streams to a wide range of cellular devices. The wide range and variability of the network, processor, and display conditions on these devices will effectively require streaming video to be dynamically tailored to each client's changing constraints. Real-time compressed-domain video transcoding allows each live streaming session to be tailored to these changing environment in a practical and affordable manner. However, even with new advances in the video-transcoding efficiency, the computational, bandwidth, and scheduling requirements of live video processing results in the need for managed placement of these tasks, for best used of the distributed resources available with the network. In this paper, we discuss our approach to service-location management (SLM). We compare the performance alternative implementations of SLM resource monitoring. Finally, we present our conclusions on which of these alternate implementations is both most reliable and most extensible to the service of large numbers of mobile client requests. This research was completed at HP Labs.
This paper presents a novel, real-time, minimal-latency technique for dissolve detection which handles the widely varying camera techniques, expertise, and overall video quality seen in amateur, semi-professional, and professional video footage. We achieve 88% recall and 93% precision for dissolve detection. In contrast, on the same data set, at a similar recall rate (87%), DCD has more than 3 times the number of false positives, giving a precision of only 81% for dissolve detection. This research was completed at YesVideo, Inc.
This paper describes techniques to change the playback speed of MPEG-compressed audio, without first decompressing the audio file. There are two primary contributions in this paper. 1) We describe three techniques to perform time-scale modification in the maximally decimated domain. 2) We show how to infer the output of the auditory masking model on the new audio stream, using the information in the original file. This new FastMPEG algorithm is more than an order of magnitude more efficient then decompressing the audio, performing time-scale modification in the conventional time-domain, and then recompressing. Samples of our results can be found here . This research was completed at Interval Research Corporation.
This paper defines a local image transform based on cumulative similarity measures and shows it to enable efficient correspondence and tracking near occluding boundaries. Unlike traditional methods, this transform allows correspondences to be found when the only contrast present is the occluding boundary itself and when the sign of contrast along the boundary is possibly reversed. The transform is based on the idea of a cumulative similarity measure which characterizes the shape of local image homogeneity; both the value of an image at a particular point and the shape of the region with locally similar and connected values is captured. This representation is insensitive to structure beyond an occluding boundary but is sensitive to the shape of the boundary itself, which is often an important cue. We show results comparing this method to traditional least-squares and robust correspondence matching. This research was completed at Interval Research Corporation.
This paper develops an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. This research was completed at Interval Research Corporation.
This paper explores several approaches for articulated-pose estimation, assuming that video-rate depth information is available, from either stereo cameras or other sensors. We use these depth measurements in the traditional linear brightness constraint equation, as well as in a depth constraint equation. To capture the joint constraints, we combine the brightness and depth constraints with twist mathematics. We address several important issues in the formation of the constraint equations, including updating the body rotation matrix without using a first-order matrix approximation and removing the coupling between the rotation and translation updates. The resulting constraint equations are linear on a modified parameter set. After solving these linear constraints, there is a single closed-form non-linear transformation to return the updates to the original pose parameters. We show results for tracking body pose in oblique views of synthetic walking sequences and in moving-camera views of synthetic jumping-jack sequences. We also show results for tracking body pose in side views of a real walking sequence. This research was completed at Interval Research Corporation.
Dynamic countours, or snakes, provide an effective method for tracking complex moving objects for segmentation and recognition tasks, but have difficulty tracking occluding boundaries on cluttered backgrounds. To compensate for this shortcoming, dynamic contours often rely on detailed object-shape or -motion models to distinguish between the boundary of the tracked object and other boundaries in the background. In this paper, we present a complementary approach to detailed object models: We improve the discriminative power of the local image measurements that drive the tracking process. We describe a new robust external-energy term for dynamic contours that can track occluding boundaries without detailed object models. We show how our image model improves tracking in cluttered scenes, and describe how a fine-grained image-segmentation mask is created directly from the local image measurements used for tracking. This research was completed at Interval Research Corporation.
Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familarity with speech makes it difficult to generate synthetic but natural-sounding speech and synthetic but natural-looking lip-synced faces. One way to reduce the apparent unnaturalness of synthetic audible and visual speech is to modify natural (human-produced) speech. This approach relies on examples of natural speech and on simple models of how to take those examples apart and to put them back together to create new utterances.
We discuss two such techniques in depth. The first technique, Mach1, changes the overall timing of an utterance, with little loss in comprehensibility and with no change in the wording of or emphasis within what was said or in the identity of the voice. This ability to speed up (or slow down) speech will make speech a more malleable channel of communication. It gives the listener control over the amount of time that she spends listening to a given oration, even if the presentation of that material is prerecorded. The second technique, Video Rewrite, synthesizes images of faces, lip synced to a given utterance. This tool could be useful for reducing the data rate for video conferencing [Rao98], as well as for providing photorealistic avatars. This research was completed at Interval Research Corporation.
This paper describes Mach1 , a new approach to nonuniform time compression for speech. Mach1 was designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This paper describes the Mach1 technique and our listener-test results. Audio examples are provided on the web page referenced above. This research was completed at Interval Research Corporation.
Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.
Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack. This research was completed at Interval Research Corporation.
Eigen-points estimates the image-plane locations of fiduciary points on an objects. By estimating multiple locations simultaneously, eigen-points exploits the inter-dependence between these locations. This is done by associating neighboring, inter-dependent control-points with a model of the local appearance. The model of local appearance is used to find the feature in new unlabeled images. Control-point locations are then estimated from the appearance of this feature in the unlabeled image. The estimation is done using an affine manifold model of the coupling between the local appearance and the local shape.
Eigen-points uses models aimed specifically at recovering shape from image appearance. The estimation equations are solved non-iteratively, in a way that accounts for noise in the training data and the unlabeled images and that accounts for uncertainty in the distribution and dependencies within these noise sources. This research was completed at Interval Research Corporation.
This paper describes techniques to automatically morph from one sound to another. Audio morphing is accomplished by representing the sound in a multi-dimensional space that is warped or modified to produce a desired result. The multi-dimensional space encodes the spectral shape and pitch on orthogonal axes. After matching components of the sound, a morph smoothly interpolates the amplitudes to describe a new sound in the same perceptual space. Finally, the representation is inverted to produce a sound. This paper describes representations for morphing, techniques for matching, and algorithms for interpolating and morphing each sound component. Spectrographic images of a complete morph are shown at the end of the paper. This research was completed at Interval Research Corporation.
These papers address the problem of matching distinct but related objects across images, for applications such as view-based model capture, virtual presence, and morphing special effects. A new approach to match estimation is presented for estimating correspondances between distinct objects. This approach minimizes the effects of lighting variations and non-rigid deformations. A sparse set of features are used for alignment, followed by image-plane warping, or morphing, based on these constraints. This research was completed at Interval Research Corporation.
Motion estimation is an important and well-studied method for determining correspondences between images in video processing. In contrast with motion estimation, the use of image warping on image sequences is a new and developing field. To provide image warping effects, morphing algorithms rely on mesh points to align two images. The selection of these mesh points is a labor-intensive, manual process. This paper presents one approach to automating the selection of mesh points for morphing by recognizing this problem as one of determining image correspondences. Applications include video compression, animation sequence generation and high-quality time dilation of video. This research was completed at Interval Research Corporation.
Traditional signal processing by computer has relied on numerical methods. This chapter explores how symbolic representation and manipulation methods may be incorporated in the numerical processing of signals and how these methods may be used in the design and analysis of signal processing systems. Using symbolic manipulation, a high-level compiler was developed to exploit the underlying mathematics of the DSP systems. This compiler uncovered a new, unconventional approach to implementing a modulated filter bank. The details of this newly discovered algorithm were published independently (see the next paper). This chapter also describes one approach to recording and propagating computational cost within a DSP system. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
An implementation of a modulated filter bank is presented which, under some operating conditions, reduces the computational complexity from O(N log N) to O(N). The new structure is efficient when the application requires dense temporal (or spatial) sampling of the STFT output. One example of such an application is code-division, multiple-beam sonar. The computational complexity of this new approach is slightly lower than of the extended Goertzel algorithm. Unlike the extended Goertzel algorithm, which is only marginally stable, the new structure is unconditionally stable. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
A system is presented which enforces very low data rates, insures a low maximum system delay, and allows for inexpensive decoders. To achieve this, the gray-scale video conferencing sequences were converted to binary (black/white) images. Spatial and temporal differencing and statistical coding were then used for compression as were "focus-of-attention" methods of partial update. In this manner, faces and scenes were successfully encoded and decoded using simple processors at data rates below 19.2 kbaud (hard maximum) with no more than two frames of delay. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program.
An imaging and analysis regime is presented for automatically estimating MRI quality measures, such as slice thickness, separation, orientation, and focus. Results of the automated analysis for MRI system examples are in good agreement with expectations from theory and with more manual tests. This research was completed at U. Michigan Department of Radiological Physics and Engineering.