 |
Computational Models of Vision

Summary: Eero Simoncelli constructs computational models of vision that are consistent with the properties of the visual world, the requirements of visual tasks, and the constraints of biological implementation.
I study vision by constructing computational models of biological visual processes. A successful model should be consistent with the observed behaviors of biological systems but should also take into account the properties of the visual environment and be based on a specific (albeit often hypothetical) functional goal or task. To fully understand biological visual processing, I find it valuable to implement engineering systems that solve the same computational problems. Thus, my work is inherently interdisciplinary, requiring empirical study of the structure of visual environments, construction of mathematical theories for representation and processing of that structure, implementation and simulation of biologically plausible instantiations of these theories, and physiological or psychophysical investigations that are motivated by the theories. Below, I describe a few recent projects.
Statistical Modeling of Visual Images Much of my research efforts have been focused on issues of image representation. It has long been assumed that visual systems are adapted, at evolutionary, developmental, and behavioral timescales, to the images to which they are exposed. Since not all images are equally likely, it is natural to assume that the system should be able to process best those images that occur most frequently. Thus, it is the statistical properties of the environment that are relevant for sensory processing. Such concepts are fundamental in engineering disciplines: compression, transmission, and enhancement of images all rely heavily on statistical models.
How can we determine the likelihood of occurrence of a given image, or portion of an image? The problem is difficult, because the set of all images is enormous. So we start with a few simplifying assumptions. First, structures in images occur at arbitrary sizes (dependent on distance from the observer), and it is thus intuitively sensible to analyze image content simultaneously at multiple scales. My work in this area began many years ago with a study of multiscale, multiorientation representations (now commonly referred to as "wavelets''). The basic properties of these representations bear a strong resemblance to the receptive fields derived from physiological measurements of neurons in primary visual cortex. Second, we examine local properties of images when decomposed into multiple scales. That is, we look for regularly occurring structures in localized patches of image. In examining such patches over a large collection of natural images, we have uncovered surprising regularities that may be described using parametric probability models. We have applied these models to classical engineering problems of compression and noise removal and achieved state-of-the-art results.
These probability models also have a direct implication for neural representation. If one assumes (following the British physiologist Horace Barlow) that neurons in a population strive to produce statistically independent responses, the model we have developed suggests that the optimal representation of an image should proceed by decomposing with multiscale-oriented functions, followed by a divisive gain-control mechanism. Specifically, the response of each "neuron'' should be divided by a weighted linear combination of the responses of cells at adjacent locations, orientations, and scales. Such divisive mechanisms (often called normalization models) have been widely used to account for the nonlinear response properties of neurons in primary visual cortex. As such, our statistical model provides the first theoretical justification for these cortical normalization models.
Perhaps more importantly, the statistical measurements from natural images can be used to determine the optimal parameters of the model. We have used this observation to "derive'' models that can account for recent physiological data on suppression from beyond the classical receptive field. These models also make predictions about the physiological and perceptual effects of adaptation. We have found that the same structures are present in natural sounds and that analogous models may be derived and compared with neurons of the auditory system. Thus, these models provide an opportunity for us to test directly (through physiological predictions and comparisons) the ecological hypothesis that neural computations are optimally matched to the statistics of the environment.
Functional Characterization of Neural Response The functional properties of sensory neurons have been traditionally summarized using "receptive fields." But these do not provide a complete description of the response properties unless one makes additional simplifying assumptions (e.g., linearity). Furthermore, as sensory neuroscience research has been extended to areas that are farther removed from the sensory input, it has become increasingly difficult to describe the receptive fields of neurons, because it is difficult to construct parametric stimuli that elicit responses. Recently my laboratory has been developing new forms of stimuli and data analysis techniques for probing and characterizing neurons, specifically techniques for identifying and estimating various forms of nonlinear response behavior, such as short-timescale gain adjustments or nonlinearities associated with spike generation. We have applied these methodologies to data from mammalian retina and have used them to refine and test models of spiking in retinal ganglion cells. We are also using the statistical models described above to explore the generation of stochastic stimuli with "naturalistic" properties, which are then used as stimuli to characterize neurons in the ventral stream of the mammalian cortex.
Visual Motion Estimation When a person moves within his or her environment, the visual image projected onto the retina changes accordingly. These changes may be described as two-dimensional translations of the local intensity pattern. Physiological and psychophysical experiments have established that mammalian visual systems contain mechanisms that are sensitive to such local translational motions, and theoretical and computational studies have confirmed that such translations carry important information about the environment. The basic model for motion representation that I have developed comes from a classical estimation-theoretic formulation of the problem. Assuming that the light intensity pattern falling on the retina over time undergoes local translational motion, and assuming a slight preference for slower speed interpretations, one can derive an optimal method of estimating image velocities. A computer implementation of this method produces a state-of-the-art algorithm for visual motion estimation, useful in various image-processing or computer vision tasks. Surprisingly, the method also provides an excellent description of human perception of local image velocity. In particular, we have shown that this model can account for human psychophysical data regarding the perception of a variety of moving patterns, as well as motion aftereffects.
The method may also be instantiated as a physiological model for motion representation. This model is constructed in two stages of identical architecture, corresponding to neurons in visual cortical areas known as V1 and MT. The commonality of structure in the two stages is an attractive feature of the model, since it is often noted that the pattern of connections is similar across a variety of cortical areas. Computations in each stage are based on a linear receptive field, followed by a rectifying nonlinearity, and divisive normalization (in which the response of each cell is divided by the summed responses of other cells). The population response of the V1 neurons provides a distributed encoding of local spatiotemporal orientation, and the population response of the MT neurons provides a distributed encoding of the local image velocity.
Simulations demonstrate that this two-stage model is remarkably consistent with a broad set of single-cell physiological data recorded in area MT. In addition, we have also developed novel stochastic stimuli for which this model is an ideal detector. We have used such stimuli to successfully characterize and probe the response properties of MT neurons. By examining human detection performance for such stimuli, we have produced strong evidence for the existence of such mechanisms in the human visual system.
This research has also been partially supported by the National Institutes of Health, a National Science Foundation CAREER grant, an Alfred P. Sloan Fellowship, and the Sloan-Swartz Center for Theoretical Visual Neuroscience at New York University.
Last updated July 23, 2008
|
 |
|
 |