Michele Covell

Senior Staff Research Scientist

Bio

I received my BS in electrical engineering from the University of Michigan and my MS and PhD from MIT in signal processing. I joined SRI International, in the area of active acoustic-noise control, and then Interval Research Corporation, where my research covered a wide range of topics in audio, image, and video processing. In 2000, I joined YesVideo and worked on faster-than-real-time video analysis. I moved to the Mobile Streaming Media group in HP Labs, as a key contributor in streaming-video services in 3G telephony networks. This work is listed as one of the top-40 accomplishments from HP Labs' 40-year history.

	»	Google Research
	»	HP Labs Streaming Media Systems
	»	YesVideo
	»	Interval Research Corporation
	»	SRI International
	»	MIT DSPG
	»	University of Michigan

Personal interests

I moved to Google, in the research group, in 2005, where I focused for several years on large-scale audio and video fingerprinting, identification, and retrieval. For this work, I received two Google EMG awards --- one for innovation and one for financial impact. The Video Content Id system that we built from this research led to Youtube's 2013 Technical Emmy. More recently, I have been working in image and video compression. In addition to the publications described below, I have more than 90 granted US patents, along with associated PCT filings.

Publications

Johnston, Vincent, Minnen, Covell, et al. "Improved Lossy Image Compression with Priming and Spatially Adaptive Bit Rates for Recurrent Networks," IEEE International Conference on Computer Vision and Pattern Recognition, Salt Lake City UT, June 2018.
We propose a method for lossy image compression based on recurrent, convolutional neural networks that outperforms BPG (4:2:0), WebP, JPEG2000, and JPEG as measured by MS-SSIM. We introduce three improvements over previous research that lead to this state-of-the-art result. First, we show that training with a pixel-wise loss weighted by SSIM increases reconstruction quality according to several metrics. Second, we modify the recurrent architecture to improve spatial diffusion, which allows the network to more effectively capture and propagate image information through the s hidden state. Finally, in addition to lossless entropy coding, we use a spatially adaptive bit allocation algorithm to more efficiently use the limited number of bits to encode visually complex image regions. We evaluate our method on the Kodak and Tecnick image sets and compare against standard codecs as well recently published methods based on deep neural networks.
Minnen, Toderici, Covell, et al. "Spatially adaptive image compression using a tiled deep network." International Conference on Image Processing, Beijing China, Sep 2017.
Covell, et al. "Target-Quality Image Compression with Recurrent, Convolutional Neural Networks." May 2017
Deep neural networks represent a powerful class of function approximators that can learn to compress and reconstruct images. Existing image compression algorithms based on neural networks learn quantized representations with a constant spatial bit rate across each image. While entropy coding introduces some spatial variation, traditional codecs have benefited significantly by explicitly adapting the bit rate based on local image complexity and visual saliency. This paper introduces an algorithm that combines deep neural networks with quality-sensitive bit rate adaptation using a tiled network. We demonstrate the importance of spatial context prediction and show improved quantitative (PSNR) and qualitative (subjective rater assessment) results compared to a non-adaptive baseline and a recently published image compression model based on fully-convolutional neural networks.
Toderici, Vincent, Johnston, Hwang, Minnen, Shor, Covell, "Full Resolution Image Compression with Recurrent Neural Networks," IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu HI, July 2017.
This paper presents a set of full-resolution lossy image compression methods based on neural networks. Each of the architectures we describe can provide variable compression rates during deployment without requiring retraining of the network: each network need only be trained once. All of our architectures consist of a recurrent neural network (RNN)-based encoder and decoder, a binarizer, and a neural network for entropy coding. We compare RNN types (LSTM, associative LSTM) and introduce a new hybrid of GRU and ResNet. We also study "one-shot" versus additive reconstruction architectures and introduce a new scaled-additive framework. We compare to previous work, showing improvements of 4.3%-8.8% AUC (area under the rate-distortion curve), depending on the perceptual metric used. As far as we know, this is the first neural network architecture that is able to outperform JPEG at image compression across most bitrates on the rate-distortion curve on the Kodak dataset images, with and without the aid of entropy coding.
Toderici, O'Malley, Hwang, Vincent, Minnen, Baluja, Covell, Sukthankar, "Variable-Rate Image Compression with Recurrent Neural Networks," International Conference on Learning Representations, San Juan Puerto Rico, May 2016.
A large fraction of Internet traffic is now driven by requests from mobile devices with relatively small screens and often stringent bandwidth requirements. Due to these factors, it has become the norm for modern graphics-heavy websites to transmit low-resolution, low-bytecount image previews (thumbnails) as part of the initial page load process to improve apparent page responsiveness. Increasing thumbnail compression beyond the capabilities of existing codecs is therefore a current research focus, as any byte savings will significantly enhance the experience of mobile device users. Toward this end, we propose a general framework for variable-rate image compression and a novel architecture based on convolutional and deconvolutional LSTM recurrent networks. Our models address the main issues that have prevented autoencoder neural networks from competing with existing image compression algorithms: (1) our networks only need to be trained once (not per-image), regardless of input image dimensions and the desired compression rate; (2) our networks are progressive, meaning that the more bits are sent, the more accurate the image reconstruction; and (3) the proposed architecture is at least as efficient as a standard purpose-trained autoencoder for a given number of bits. On a large-scale benchmark of 32 x 32 thumbnails, our LSTM-based approaches provide better visual quality than (headerless) JPEG, JPEG2000 and WebP, with a storage size that is reduced by 10% or more.
Covell, et al. "Optimizing Transcoder Quality Targets Using a Neural Network with an Embedded Bitrate Model," IS&T International Symposium on Electronic Imaging, San Francisco CA, Feb 2016.
Like all modern internet-based video services, YouTube employs adaptive bitrate (ABR) streaming. Due to the computational expense of transcoding, the goal is to achieve a target bitrate for each ABR segment, without requiring multi-pass encoding. We extend the content-dependent model equation between bitrate and frame rate [Ma2012] to include CRF and frame size. We then attempt to estimate the content-dependent parameters used in the model equation, using simple summary features taken from the video segment and a novel neural-network layout. We show that we can estimate the correct quality-control parameter on 65% of our test cases without using a previous transcode of the video segment. If there is a previous transcode of the same segment available (using an inexpensive configuration), we increase our accuracy to 80%.
Covell, Baluja, Sukthankar, "Micro-Auction-Based Traffic-Light Control: Responsive, Local Decision Making," IEEE International Conference on Intelligent Transportation Systems, Gran Canaria Spain, Sept 2015.
Baluja, Covell, Sukthankar, "Approximating the Effects of Installed Traffic Lights: A Behaviorist Approach Based on Travel Tracks," IEEE International Conference on Intelligent Transportation Systems, Gran Canaria Spain, Sept 2015.
Baluja, Covell, Sukthankar, "Physical and Virtual Cell Phone Sensors for Traffic Control: Algorithms and Deployment Impact," IEEE Sensors Applications Symposium, Catania Italy, April 2016.
Baluja, Covell, Sukthankar, "Traffic Lights with Auction-Based Controllers: Algorithms and Real-World Data," arXiv:1702.01205, Feb 2017.
Real-time, responsive optimization of traffic flow serves to address important practical problems: reducing drivers' wasted time and improving city-wide efficiency, as well as reducing gas emissions and improving air quality. Much of the current research in traffic-light optimization relies on extending the capabilities of basic traffic lights to either communicate with each other or communicate with vehicles. However, before such capabilities become ubiquitous, opportunities exist to improve traffic lights by being more responsive to current traffic situations within the existing, deployed, infrastructure. In these papers, introduce a new approach to using local induction-loop data and show their improvement over the current observed delays in traffic flow. For better light control, we introduce micro-auctions based on local induction loop information; no other outside sources of information are assumed. To allow us to compare with current traffic, we create traffic-light logic for each light in our simulation which results in the best match to the observed travel times. We test our different light-control approaches on real-world data collected over a period of several weeks around the Mountain View, California area. In our simulations, the micro-auction mechanisms (based only on local sensor data) surpass longer-term planning approaches that rely on widely placed sensors and communications.
Baluja, Covell, Sukthankar, "The Virtues of Peer Pressure: A Simple Method for Discovering High-Value Mistakes," International Conference on Computer Analysis of Images and Patterns, Valleta Malta, Sept 2015.
Much of the recent success of neural networks can be attributed to the deeper architectures that have become prevalent. However, the deeper architectures often yield unintelligible solutions, require enormous amounts of labeled data, and still remain brittle and easily broken. In this paper, we present a method to efficiently and intuitively discover input instances that are misclassified by well-trained neural networks. As in previous studies, we can identify instances that are so similar to previously seen examples such that the transformation is visually imperceptible. Additionally, unlike in previous studies, we can also generate mistakes that are significantly different from any training sample, while, importantly, still remaining in the space of samples that the network should be able to classify correctly. This is achieved by training a basket of N "peer networks" rather than a single network. These are similarly trained networks that serve to provide consistency pressure on each other. When an example is found for which a single network, S, disagrees with all of the other N - 1 networks, which are consistent in their prediction, that example is a potential mistake for S. We present a simple method to find such examples and demonstrate it on two visual tasks. The examples discovered yield realistic images that clearly illuminate the weaknesses of the trained models, as well as provide a source of numerous, diverse, labeled-training samples.
Covell, Baluja, "Efficient and Accurate Label Propagation on Dynamic Graphs and Label Sets," IARIA International Journal on Advances in Networks and Services, 6(4):246-259, December 2013. (invited paper)
Covell, Baluja, "Efficient and Accurate Label Propagation on Large Graphs and Label Sets," Proc. International Conference on Advances in Multimedia, Venice, April 2013. (best-paper award)
Many web-based application areas must infer label distributions starting from a small set of sparse, noisy labels. Previous work has shown that graph-based propagation can be very effective at finding the best label distribution across nodes, starting from partial information and a weighted-connection graph. In their work on video recommendations, Baluja et al. showed high-quality results using Adsorption, a normalized propagation process. An important step in the original formulation of Adsorption was re-normalization of the label vectors associated with each node, between every propagation step. That interleaved normalization forced computation of all label distributions, in synchrony, in order to allow the normalization to be correctly determined. Interleaved normalization also prevented use of standard linear-algebra methods, like stabilized bi-conjugate gradient descent (BiCGStab) and Gaussian elimination. We show how to replace the interleaved normalization with a single pre-normalization, done once before the main propagation process starts, allowing use of selective label computation (label slicing) as well as large-matrix-solution methods. As a result, much larger graphs and label sets can be handled than in the original formulation and more accurate solutions can be found in fewer propagation steps. We further extend that work to handle graphs that change and expand over time. We report results from using pre-normalized Adsorption in topic labeling for web domains, using label slicing and BiCGStab. We also report results from using incremental updates on changing co-author network data. Finally, we discuss two options for handling mixed-sign (positive and negative) graphs and labels.
Portions of this work were also presented at University of Tokyo, November 2012.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Baluja, Covell, "Neighborhood Preserving Codes for Assigning Point Labels: Applications to Stochastic Search," International Conference on Computational Science, Barcelona Spain, June 2013.
Baluja, Covell, "Point Representation for Local Optimization: Towards Multi-Dimensional Gray Codes," IEEE International Congress on Evolutionary Computation, Cancun Mexico, June 2013.
Selecting a good representation of a solution space is vital to solving any search and optimization problem. In particular, once regions of high performance are found, having the property that small changes in the candidate solution correspond to searching nearby neighborhoods provides the ability to perform effective local optimization. To achieve this, it is common for stochastic search algorithms, such as stochastic hillclimbing, evolutionary algorithms (including genetic algorithms), and simulated annealing, to employ Gray Codes for encoding ordinal points or discretized real numbers. In these papers, we present novel methods to label similar and/or close points within arbitrary graphs with small Hamming distances. The resultant point labels can be seen as an approximate high-dimensional variant of Gray Codes with standard Gray Codes as a subset of the labels found. The labeling procedure is applicable to any task in which the solution requires that search algorithm to select a small subset of items out of many. Such tasks include vertex selection in graphs, knapsack-constrained item selection, bin packing, prototype selection for machine learning, and numerous scheduling problems, to name a few.
Seth, Covell, et al. "A Tale of Two (Similar) Cities: Inferring City Similarity Through Geo-Spatial Query Log Analysis," Proc. International Conference on Knowledge, Discovery, and Information Retrieval, Paris, Oct 2011.
Understanding the backgrounds and interest of the people who are consuming a piece of content, such as a news story, video, or music, is vital for the content producer as well the advertisers who rely on the content to provide a channel on which to advertise. We extend traditional search-engine query log analysis, which has primarily concentrated on analyzing either single or small groups of queries or users, to examining the complete query stream of very large groups of users -- the inhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau. Extensive experiments are provided.
Cui, Mathur, Covell, et al. "Example-based Image Compression," Proc. International Conference on Image Processing, Hong Kong, Sept 2010.
Current standard image-compression approaches rely on fairly simple predictions, using either block- or wavelet-based methods. While many more sophisticated texture-modeling approaches have been proposed, most do not provide a significant improvement in compression rate over theinhabitants of 13,377 cities across the United States. Query logs can be a good representation of the interests of the s inhabitants and a useful characterization of the city itself. Further, we demonstrate how query logs can be effectively used to gather city-level statistics sufficient for providing insights into the similarities and differences between cities. Cities that are found to be similar through the use of query analysis correspond well to the similar cities as determined through other large-scale and time-consuming direct measurement studies, such as those undertaken by the Census Bureau. Extensive experiments are provided. city current standards at a workable encoding complexity level. We re-examine this area, using example-based texture prediction. We find that we can provide consistent and significant improvements over JPEG, reducing the bit rate by more than 20% for many PSNR levels. These improvements require consideration of differences between residual energy and prediction/residual compressibility when selecting a texture prediction, as well as careful control of the computational complexity in encoding.
Baluja, Covell, "Beyond 'Near Duplicates': Learning Hash Codes for Efficient Similar-Image Retrieval," Proc. International Conference on Pattern Recognition, Istanbul, Aug 2010.
Finding similar images in a large database is an important, but often computationally expensive, task. In this paper, we present a two-tier similar-image retrieval system with the efficiency characteristics found in simpler systems designed to recognize nearduplicates. We compare the efficiency of lookups based on random projections and learned hashes to 100-times-more-frequent exemplar sampling. Both approaches significantly improve on the results from exemplar sampling, despite having significantly lower computational costs. Learned-hash keys provide the best result, in terms of both recall and efficiency.
Jing, Covell, et al. "Learning Query-Specific Distance Functions for Large-Scale Web Image Search," IEEE Trans. on Multimedia 15(8):2022-2034, Dec 2013.
Several search engines now allow a hybrid search approach: starting from a text-based query, the search can be refined by picking an example image and re-ranking based on image similarity. We have found that learning distinct distance functions for different search queries can improve this re-ranking. We propose scalable solutions to learning query-specific distance functions, using the query-log data for training. This paper evaluates this approach through comprehensive human evaluation, and compares the results to Google image search.
Jing, Covell, Rowley, "Comparison of Clustering Approaches for Summarizing Large Populations of Images," Proc. ICME Workshop on Visual Content Identification and Search, Singapore, July 2010
We compare different clustering approaches for selecting a set of search-result exemplar images, similar to what is used for Image Swirl (see below). Our evaluation covers both the "correctness" and efficiency of the clustering results. We evaluate these approaches on 900 diverse queries, each associated with 1000 web images, by comparing the examplars chosen by clustering to the top 20 images for that search term. Our results suggest that Affinity Propagation is effective in selecting exemplars that match the top search images but at high computational cost. We improve on these early results using a simple distribution-based selection filter on incomplete clustering results. This improvement allows us to use more computationally efficient approaches to clustering, such as Hierarchical Agglomerative Clustering (HAC) and Partitioning Around Medoids (PAM), while still reaching the same (or better) quality of results as were given by Affinity Propagation in the original study. The computational savings is significant since these alternatives are 7 to 27 times faster than Affinity Propagation.
Jing, Rowley, Wang, Tsai, Rosenberg, Covell, "Image Swirl: A Large-Scale Content-Based Image Visualization System," WWW 2012, Lyon France, April 2012
Jing, Rowley, Rosenberg, Wang, Covell, "Visualizing Web Images via Google Image Swirl," NIPS Worshop on Statistical Machine Learning for Visual Analytics, Vancouver, December 2009
Google Image Swirl, released in November 2009, organizes image search results based on their visual and semantic similarities and presents them in an intuitive exploratory interface. It clusters the top image search results for more than 200,000 queries and lets you explore the clusters and the relation between images. Once you find the group of images you're interested in, you can click on the thumbnail and a cluster of images will "swirl" into view. You can then further explore additional sub-groups within any cluster.
Also demonstrated at IEEE International Conference on Computer Vision and Pattern Recognition 2010.
Baluja, Covell, "Finding Images and Line Drawings in Document-Scanning Systems," Proc. IAPR International Conference on Document Analysis and Retrieval, Barcelona, July 2009.
This work addresses the problem of finding images and line-drawings in scanned pages. It is a crucial processing step in the creation of a large-scale system to detect and index images found in books and historic documents. Within the scanned pages that contain both text and images, the images are found through the use of local-feature extraction, applied across the full scanned page. This is followed by a novel learning system to categorize the local features into either text or image. The discrimination is based on using multiple classifiers trained via stochastic sampling of weak classifiers for each AdaBoost stage. The approach taken in sampling includes stochastic hill climbing across weak detectors, allowing us to reduce our classification error by as much as 25% relative to more naive stochastic sampling. Stochastic hill climbing in the weak classifier space is possible due to the manner in which we parameterize the weak classifier space. Through the use of this system, we improve image detection by finding more line-drawings, graphics, and photographs, as well as reducing the number of spurious detections due to misclassified text, discoloration, and scanning artifacts.
Covell, Baluja, "LSH Banding for Large-Scale Retrieval with Memory and Recall Constraints," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Taipei, April 2009.
Locality Sensitive Hashing (LSH) is widely used for efficient retrieval of candidate matches in very large audio, video, and image systems. However, extremely large reference databases necessitate a guaranteed limit on the memory used by the table lookup itself, no matter how the entries crowd different parts of the signature space, a guarantee that LSH does not give. In this paper, we provide such guaranteed limits, primarily through the design of the LSH bands. When combined with data-adaptive bin splitting (needed on only 0.04% of the occupied bins), this approach provides the required guarantee on memory usage. At the same time, it avoids the reduced recall that more extensive use of bin splitting would give.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Baluja, Covell, Ioffe, "Permutation Grouping: Intelligent Hash Function Design for Audio and Image Retrieval," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Las Vegas, April 2008.
Locality-sensitive hashing (LSH) is used in many applications that require retrieval and evaluation of approximate nearest neighbors. The uniformity of the signature distributions has a strong effect on the number of candidates that must be considered. The effects of this uneven distribution are squared, since retrieval probes are taken from the same uneven distribution that resulted in uneven occupancy in the table itself. In this study, we introduce the idea of dimension grouping to intelligently design the hash functions that are used to index the LSH tables. This reduces the inefficiencies introduced by hashing real-world data that is noisy, structured, and most importantly is not independently or identically distributed. Through large-scale tests, we find that permutation-grouping dramatically increases the efficiency of the overall retrieval system by lowering the number of low-probability candidates that must be examined by 30-50%.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Baluja, Covell, "Learning to Hash: Forgiving Hash Functions and Applications," Data Mining and Knowledge Discovery, (Springer Netherlands) 2008.
Baluja, Covell, "Learning 'Forgiving' Hash Functions: Algorithms and Large-Scale Tests," International Joint Conference on AI, Hyderabad India, January 2007.
We want to be able to efficiently find similar items in large, high-dimensional datasets, for things like music, image, and video retrieval. Scaling lookups for any large corpus tends to lead to tree or hash-based approaches. Correctly structuring such trees or hash functions is made more difficult than in other large-corpus domains by an imprecise definition of similarity. In this work, we found a method to learn a similarity function from only weakly labeled positive examples. We used this similarity function as the basis of a hash function to severely constrain the number of points considered for each lookup. In testing on a large real-world audio dataset, only a tiny fraction of the points (~0.27%) were ever considered for each lookup. To increase efficiency, we did no comparisons in the original high-dimensional space of points, yet still achieve nearly 99% accuracy on 5-second-long probes.
Covell, Baluja, "Known-Audio Detection using Waveprint: Spectrogram Fingerprinting by Wavelet Hashing," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hawaii, April 2007.
This paper extends the previous work on Waveprint (see below) to handle open-set detection and identification. To do this, we re-examine the best-ranked matches from Waveprint using temporal-ordering-based processing. The resulting system has excellent detection capabilities for small snippets of audio that have been degraded in various ways, including competing noise, poor recording quality, and cell-phone playback. The system is more accurate than the previous state-of-the-art system while being more efficient and flexible in memory usage and computation.
Also presented, by invitation, at International Workshop on Computer Vision, 2010.
Baluja, Covell, "Waveprint: Efficient Wavelet-Based Audio Fingerprinting," Pattern Recognition 41(11): 3467-3480, November 2008.
Baluja, Covell, "Content Fingerprinting Using Wavelets," Proc. IET Conference on Multimedia, London England, November 2006. (invited paper)
Baluja, Covell, "Audio Fingerprinting: Combining Computer Vision and Data-Stream Processing," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hawaii, April 2007.
In this paper, we introduce Waveprint, a novel method for audio identification. Waveprint uses a combination of computer-vision techniques and large-scale-data-stream processing algorithms to create compact fingerprints of audio data that can be efficiently matched. The resulting system has excellent identification capabilities for small snippets of audio that have been degraded in a variety of manners, including competing noise, poor recording quality, and cell-phone playback. We explicitly measure the tradeoffs between performance, memory usage, and computation through extensive experimentation.
Portions of this work were also presented at the IEEE DSP Workshop (Napa) plenary, August 2013.
Covell, Baluja, Fink, "Advertisement Detection and Replacement using Acoustic and Visual Repetition," IEEE Workshop on Multimedia Signal Processing, Victoria BC, October 2006.
Covell, Baluja, Fink, "Detecting Ads in Video Streams Using Acoustic and Visual Cues," IEEE Computer Magazine, December 2006.
We propose a method for detecting and precisely segmenting repeated sections of broadcast streams. The detection stage starts from acoustic matches and validates the hypothesized matches using the visual channel. The precise segmentation uses fine-grain acoustic match profiles to determine start and end-points. The approach is both efficient and robust to broadcast noise and differences in broadcaster signals. Our final results is nearly perfect, with better than 99% precision, at a recall rate of 95% for repeated advertisements.
Fink, Covell, Baluja, "Social- and Interactive-Television Applications Based on Real-Time Ambient-Audio Identification," EuroITV Athens Greece, May 2006. (best-paper award)
Fink, Covell, Baluja, "Mass Personalization: Social and Interactive Applications using Sound-Track Identification," Multimedia Tools and Applications 2008. (invited paper)
Fink, Covell, Baluja, "Coordinated Multi-Device Presentations: Ambient-Audio Identification," CRC Encyclopedia of Wireless and Mobile Communications (ed. Furht) 2008. (invited)
We describe mass personalization, a framework for combining mass media with a highly personalized Web-based experience. We introduce four applications for mass personalization: personalized content layers, ad hoc social communities, real-time popularity ratings and virtual media library services. Using the ambient audio originating from the television, the four applications are available with no more effort than simple television channel surfing. Our audio identification system does not use dedicated interactive TV hardware and does not compromise the user's privacy. Feasibility tests of the proposed applications are provided both with controlled conversational interference and with "living-room" evaluations.
Covell, Roy, Shen, Huve, "Time-Scale Modification for 3G-Telephony Video," IEEE Workshop on Multimedia Signal Processing Victoria BC, October 2006. (with demo video)
Covell, Roy, Shen, Huve, "3G Telephony Control Stack: Interactive Playback Control of Video," CRC Encyclopedia of Wireless and Mobile Communications (ed. Furht) 2008. (invited)
Streaming video to mobile devices like cellular phones is an important emerging application. However, mobile thin clients and telephony-infrastructure constraints limit the capabilities and interfaces available to the end user. This paper describes our design of an end-to-end interactive video telephony service. The challenges of providing interactive services is to remain compliant with the telco signaling and data protocols, without requiring software downloads onto the handset, while providing responsive, interactive controls. Our service provides VCR-like control of telephony video, maintains audio-video synchronization, and respects the video frame and bit-rate contracts of the telephony channel. This research was completed at HP Labs.
Covell, Roy, Seo, "Predictive Modeling of Streaming Servers," SIGMETRICS Workshop on Mathematical Performance Modeling and Analysis, Banff Canada, June 2005.
This paper describes a mathematical approach to deriving predictive models of streaming-server saturation (saturating/non-saturating) through hidden-variable estimation. We use a 240-dimensional measurement vector to estimate a low-dimensional vector of continuous valued hidden variables, with each hidden-variable dimension corresponding to a "discovered" resource on which the streaming server depends. The discovery process uses labelled calibration data, along with a POCS-like alternation of constraints, to create the best (in a total-least squares sense) predictors for server saturation, both under the current (unknown) set of client sessions and under some (proposed) increase in the client sessions. These models allow us to make admission-control decisions as well as to indicate which types of additional clients the streaming server can currently handle. (A description of the data-collection regime is available here.. A longer version of these two papers is available here.) This research was completed at HP Labs.
Karlsson, Covell, "Dynamic Black-Box Performance-Model Estimation for Self-Tuning Regulators," International Conference on Autonomic Computing, Seattle WA, June 2005.
Methods for automatically managing the performance of computing services must estimate a performance model of that service. This paper explores properties that are necessary for performance model estimation of black-box computer systems when used together with adaptive feedback loops. It shows that the standard method of least-squares estimation often gives rise to models that make the control loop perform the opposite action of what is desired. This produces large oscillations and bad tracking performance. The paper evaluates what combination of input and output data provides models with the best properties for the control loop. Plus, it proposes three extensions to the controller that makes it perform well, even when the model estimated would have degraded performance.
Our proposed techniques are evaluated with an adaptive controller that provides latency targets for workloads on black-box computer services under a variety of conditions. The techniques are evaluated on two systems: a three-tier ecommerce site and a web server. Experimental results show that our best estimation approach improves the ability of the controller to meet the latency goals significantly. Previously oscillating workload latencies are with our techniques smooth around the latency targets. This research was completed at HP Labs.
Harville, Covell, Wee, "An Architecture for Componentized, Network-Based Media Services," Proc. IEEE International Conference on Multimedia and Expo, Baltimore, MD, July 2003.
This paper presents Media Services Architecture (MSA). MSA is a flexible, general architecture for requesting, configuring, and running services that operate on streaming audio and video as it flows through the network. MSA decomposes requested media services into modular processing components that may be distributed to servers throughout the network and which intercommunicate via standard streaming protocols. Use of standard protocols also affords seamless inter-operability between MSA and media content delivery networks. MSA manages media services by monitoring the networked servers and assigning service components to them in a manner that uses available computational and network resources efficiently. We describe some implemented example services to illustrate the concepts and benefits of the architecture. This research was completed at HP Labs.
Roy, Covell, et al. "A System Architecture for Managing Mobile Streaming Media Services," Proc. IEEE Mobile Distributed Computing Workshop, Providence, RI, May 2003.
Current mobile devices and wireless access allows delivery of video streams to a wide range of cellular devices. The wide range and variability of the network, processor, and display conditions on these devices will effectively require streaming video to be dynamically tailored to each client's changing constraints. Real-time compressed-domain video transcoding allows each live streaming session to be tailored to these changing environment in a practical and affordable manner. However, even with new advances in the video-transcoding efficiency, the computational, bandwidth, and scheduling requirements of live video processing results in the need for managed placement of these tasks, for best used of the distributed resources available with the network. In this paper, we discuss our approach to service-location management (SLM). We compare the performance alternative implementations of SLM resource monitoring. Finally, we present our conclusions on which of these alternate implementations is both most reliable and most extensible to the service of large numbers of mobile client requests. This research was completed at HP Labs.
Covell, Ahmad, "Analysis-by-Synthesis Dissolve Detection," Proc. IEEE International Conference on Image Processing, Rochester, NY, Sept 23-25 2002.
This paper presents a novel, real-time, minimal-latency technique for dissolve detection which handles the widely varying camera techniques, expertise, and overall video quality seen in amateur, semi-professional, and professional video footage. We achieve 88% recall and 93% precision for dissolve detection. In contrast, on the same data set, at a similar recall rate (87%), DCD has more than 3 times the number of false positives, giving a precision of only 81% for dissolve detection. This research was completed at YesVideo, Inc.
Covell, et al. "FastMPEG: Time-scale modification of Bit-Compressed Audio Information," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.
This paper describes techniques to change the playback speed of MPEG-compressed audio, without first decompressing the audio file. There are two primary contributions in this paper. 1) We describe three techniques to perform time-scale modification in the maximally decimated domain. 2) We show how to infer the output of the auditory masking model on the new audio stream, using the information in the original file. This new FastMPEG algorithm is more than an order of magnitude more efficient then decompressing the audio, performing time-scale modification in the conventional time-domain, and then recompressing. Samples of our results can be found here . This research was completed at Interval Research Corporation.
Darrell, Covell, "Correspondence with Cumulative Similarity Transforms," IEEE Trans. on Pattern Analysis and Machine Interactions 23(2):222-227, Feb 2001.
This paper defines a local image transform based on cumulative similarity measures and shows it to enable efficient correspondence and tracking near occluding boundaries. Unlike traditional methods, this transform allows correspondences to be found when the only contrast present is the occluding boundary itself and when the sign of contrast along the boundary is possibly reversed. The transform is based on the idea of a cumulative similarity measure which characterizes the shape of local image homogeneity; both the value of an image at a particular point and the shape of the region with locally similar and connected values is captured. This representation is insensitive to structure beyond an occluding boundary but is sensitive to the shape of the boundary itself, which is often an important cue. We show results comparing this method to traditional least-squares and robust correspondence matching. This research was completed at Interval Research Corporation.
Slaney, Covell, "Face Synch: a Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks," Proc. Neural Information Processing Society 13 , MIT Press, 2001. (presented at NIPS 2000)
This paper develops an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. This research was completed at Interval Research Corporation.
Covell, et al. "Articulated-Pose Estimation using Brightness- and Depth-Constancy Constraints," Proc. IEEE Computer Vision and Pattern Recognition, Hilton Head Island SC, June 2000.
This paper explores several approaches for articulated-pose estimation, assuming that video-rate depth information is available, from either stereo cameras or other sensors. We use these depth measurements in the traditional linear brightness constraint equation, as well as in a depth constraint equation. To capture the joint constraints, we combine the brightness and depth constraints with twist mathematics. We address several important issues in the formation of the constraint equations, including updating the body rotation matrix without using a first-order matrix approximation and removing the coupling between the rotation and translation updates. The resulting constraint equations are linear on a modified parameter set. After solving these linear constraints, there is a single closed-form non-linear transformation to return the updates to the original pose parameters. We show results for tracking body pose in oblique views of synthetic walking sequences and in moving-camera views of synthetic jumping-jack sequences. We also show results for tracking body pose in side views of a real walking sequence. This research was completed at Interval Research Corporation.
Covell, Darrell, "Dynamic Occluding Contours: A New External-Energy Term for Snakes," Proc. IEEE Computer Vision and Pattern Recognition, Fort Collins CO, June 1999, vol 2, p 232-238.
Dynamic countours, or snakes, provide an effective method for tracking complex moving objects for segmentation and recognition tasks, but have difficulty tracking occluding boundaries on cluttered backgrounds. To compensate for this shortcoming, dynamic contours often rely on detailed object-shape or -motion models to distinguish between the boundary of the tracked object and other boundaries in the background. In this paper, we present a complementary approach to detailed object models: We improve the discriminative power of the local image measurements that drive the tracking process. We describe a new robust external-energy term for dynamic contours that can track occluding boundaries without detailed object models. We show how our image model improves tracking in cluttered scenes, and describe how a fine-grained image-segmentation mask is created directly from the local image measurements used for tracking. This research was completed at Interval Research Corporation.
Covell, et al. "Modification of Audible and Visual Speech," Signal Processing for Multimedia. 1998 (invited)
Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familarity with speech makes it difficult to generate synthetic but natural-sounding speech and synthetic but natural-looking lip-synced faces. One way to reduce the apparent unnaturalness of synthetic audible and visual speech is to modify natural (human-produced) speech. This approach relies on examples of natural speech and on simple models of how to take those examples apart and to put them back together to create new utterances.
We discuss two such techniques in depth. The first technique, Mach1, changes the overall timing of an utterance, with little loss in comprehensibility and with no change in the wording of or emphasis within what was said or in the identity of the voice. This ability to speed up (or slow down) speech will make speech a more malleable channel of communication. It gives the listener control over the amount of time that she spends listening to a given oration, even if the presentation of that material is prerecorded. The second technique, Video Rewrite, synthesizes images of faces, lip synced to a given utterance. This tool could be useful for reducing the data rate for video conferencing [Rao98], as well as for providing photorealistic avatars. This research was completed at Interval Research Corporation.
Covell, et al. "Mach1 for Nonuniform Time-Scale Modification of Speech: Theory, Technique, and Comparisons," IRC TR 1997-061, Interval Research Corporation, 1997.
Covell, et al. "Mach1: Nonuniform Time-Scale Modification of Speech," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Seattle WA, May 12-15 1998, vol 1 p 349-352.
This paper describes Mach1 , a new approach to nonuniform time compression for speech. Mach1 was designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was no significant comprehension loss with increasing Mach1 compression rates. In A-B preference tests, Mach1-compressed speech was chosen 95% of the time. This paper describes the Mach1 technique and our listener-test results. Audio examples are provided on the web page referenced above. This research was completed at Interval Research Corporation.
Bregler, Covell, Slaney, "Video Rewrite: Driving Visual Speech with Audio," ACM Computer Graphics Proc. SIGGRAPH 97, Los Angeles, CA, Aug 3-8 1997, p 353-360.
Bregler, Covell, Slaney, "Video Rewrite: Visual Speech Synthesis from Video," Proc. ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece, Sept 26-27 1997, p 153-156.
Bregler, Covell, Slaney, "Video Rewrite: Photorealistic Synthetic Lip Sync," Proc. INA Imagina, Monte Carlo, Monaco, March 4-6 1998, p 193-203. (invited)
Video Rewrite uses existing footage to create automatically new video of a person mouthing words that she did not speak in the original footage. This technique is useful in movie dubbing, for example, where the movie sequence can be modified to sync the actors' lip motions to the new soundtrack.
Video Rewrite uses computer-vision techniques to track points on the speaker's mouth in the training footage, and morphing techniques to combine these mouth gestures into the final video sequence. The new video combines the dynamics of the original actor's articulations with the mannerisms and setting dictated by the background footage. Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack. This research was completed at Interval Research Corporation.
Covell, "Eigen-points: control-point location using principal component analyses," Proc. IEEE International Conference on Automatic Face and Gesture Recognition, Killington VT, Oct 14-16 1996, p 122-127.
Covell, Bregler, "Eigen-points," Proc. IEEE International Conference on Image Processing, Lausanne, Switzerland, Sept 16-19 1996, vol 3 p 471-474.
Eigen-points estimates the image-plane locations of fiduciary points on an objects. By estimating multiple locations simultaneously, eigen-points exploits the inter-dependence between these locations. This is done by associating neighboring, inter-dependent control-points with a model of the local appearance. The model of local appearance is used to find the feature in new unlabeled images. Control-point locations are then estimated from the appearance of this feature in the unlabeled image. The estimation is done using an affine manifold model of the coupling between the local appearance and the local shape.
Eigen-points uses models aimed specifically at recovering shape from image appearance. The estimation equations are solved non-iteratively, in a way that accounts for noise in the training data and the unlabeled images and that accounts for uncertainty in the distribution and dependencies within these noise sources. This research was completed at Interval Research Corporation.
Slaney, Covell, Lassiter, "Automatic Audio Morphing," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta GA, May 7-10 1996, vol 2 p 1001-1004.
This paper describes techniques to automatically morph from one sound to another. Audio morphing is accomplished by representing the sound in a multi-dimensional space that is warped or modified to produce a desired result. The multi-dimensional space encodes the spectral shape and pitch on orthogonal axes. After matching components of the sound, a morph smoothly interpolates the amplitudes to describe a new sound in the same perceptual space. Finally, the representation is inverted to produce a sound. This paper describes representations for morphing, techniques for matching, and algorithms for interpolating and morphing each sound component. Spectrographic images of a complete morph are shown at the end of the paper. This research was completed at Interval Research Corporation.
Covell, "Autocorrespondence: Feature-based Match Estimation for Image Metamorphosis," Proc. IEEE Workshop on Image and Multi-dimensional Signal Processing, Belize City, Belize, March 3-6 1996, p 120-121.
Covell, "Autocorrespondence: Feature-based Match Estimation and Image Metamorphosis," Proc. IEEE International Conference on Systems, Man and Cybernetics, Vancouver BC, Oct 22-25 1995, vol 3 p 2736-2741.
These papers address the problem of matching distinct but related objects across images, for applications such as view-based model capture, virtual presence, and morphing special effects. A new approach to match estimation is presented for estimating correspondances between distinct objects. This approach minimizes the effects of lighting variations and non-rigid deformations. A sparse set of features are used for alignment, followed by image-plane warping, or morphing, based on these constraints. This research was completed at Interval Research Corporation.
Covell, Withgott, "Spanning the Gap between Motion Estimation and Morphing," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Adelaide, Australia, April 19-22 1994, vol 5 p 213-216.
Motion estimation is an important and well-studied method for determining correspondences between images in video processing. In contrast with motion estimation, the use of image warping on image sequences is a new and developing field. To provide image warping effects, morphing algorithms rely on mesh points to align two images. The selection of these mesh points is a labor-intensive, manual process. This paper presents one approach to automating the selection of mesh points for morphing by recognizing this problem as one of determining image correspondences. Applications include video compression, animation sequence generation and high-quality time dilation of video. This research was completed at Interval Research Corporation.
Covell, Myers, Oppenheim, Computer-Aided Algorithm Design and Rearrangement, Chapter 2 (p 30-87) in Symbolic and Knowledge-Based Signal Processing, Oppenheim & Nawab (eds.), Prentice-Hall, 1992.
Covell, "An Algorithm Design Environment for Signal Processing," Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Albuquerque NM, April 3-6 1990, vol 3 p 1779-1782.
Covell, "An Algorithm Design Environment for Signal Processing, " MIT RLE TR 549, MIT, 1989.
Traditional signal processing by computer has relied on numerical methods. This chapter explores how symbolic representation and manipulation methods may be incorporated in the numerical processing of signals and how these methods may be used in the design and analysis of signal processing systems. Using symbolic manipulation, a high-level compiler was developed to exploit the underlying mathematics of the DSP systems. This compiler uncovered a new, unconventional approach to implementing a modulated filter bank. The details of this newly discovered algorithm were published independently (see the next paper). This chapter also describes one approach to recording and propagating computational cost within a DSP system. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
Covell, Richardson, "A New, Efficient Structure for the Short-Time Fourier Transform, with an Application in Code-Division Sonar Imaging, " Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, Canada, May 14-17 1991, vol 3 p 2041-2044.
An implementation of a modulated filter bank is presented which, under some operating conditions, reduces the computational complexity from O(N log N) to O(N). The new structure is efficient when the application requires dense temporal (or spatial) sampling of the STFT output. One example of such an application is code-division, multiple-beam sonar. The computational complexity of this new approach is slightly lower than of the extended Goertzel algorithm. Unlike the extended Goertzel algorithm, which is only marginally stable, the new structure is unconditionally stable. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program, from the Advanced Research Projects Agency, and from Sanders Associates, Inc.
Covell, Lim. "Low data-rate video conferencing," Proc. SPIE - The International Society for Optical Engineering, Cambridge MA, Sept 15-16 1986, vol 707 p 75-90.
A system is presented which enforces very low data rates, insures a low maximum system delay, and allows for inexpensive decoders. To achieve this, the gray-scale video conferencing sequences were converted to binary (black/white) images. Spatial and temporal differencing and statistical coding were then used for compression as were "focus-of-attention" methods of partial update. In this manner, faces and scenes were successfully encoded and decoded using simple processors at data rates below 19.2 kbaud (hard maximum) with no more than two frames of delay. This research was completed at MIT, with funding from the National Science Foundation Fellowship Program.
Covell, et al. "Automated analysis of multiple performance characteristics in magnetic resonance imaging systems," Medical Physics, 13(6): 815-823, Nov-Dec 1986.
An imaging and analysis regime is presented for automatically estimating MRI quality measures, such as slice thickness, separation, orientation, and focus. Results of the automated analysis for MRI system examples are in good agreement with expectations from theory and with more manual tests. This research was completed at U. Michigan Department of Radiological Physics and Engineering.
Covell, "Likelihood Ratio Approximation for Mixed Gaussian and Poisson Processes, " MIT-LL TR 27L-0019, MIT-Lincoln Laboratory, 1986.