HomeResearchComputational Models of Vision

Our Scientists

Computational Models of Vision

Research Summary

Eero Simoncelli constructs computational models of vision that are consistent with the properties of the visual world, the requirements of visual tasks, and the constraints of biological implementation.

Our sensory systems provide us with a remarkably reliable interpretation of the world, allowing us to make predictions and perform difficult tasks with surprising accuracy. How do these capabilities arise from the underlying neural circuitry? Specifically, how do populations of neurons encode sensory information, and how do subsequent populations extract that information for recognition, decisions, and action? From a more theoretical perspective, why do sensory systems use these particular representations? Our research aims to answer these questions, through a combination of computational theory and modeling, coupled with perceptual and physiological experiments. Our endeavors can be categorized into three general classes.

Optimal Encoding of Visual Information

It has long been assumed that visual systems are adapted, at evolutionary, developmental, and behavioral timescales, to the images to which they are exposed. Since not all images are equally likely, it is natural to assume that these systems use their limited resources to process best those images that occur most frequently. Thus, it is the statistical properties of the environment that are relevant for sensory processing. Such concepts are fundamental in engineering disciplines—compression, transmission, and enhancement of images all rely heavily on statistical models. Since the mid 1990s, we have developed successively more powerful models describing the statistical properties of local regions of natural images, demonstrated the power of these models by using them to develop state-of-the-art solutions to classical engineering problems of compression and noise removal, and used them in parallel to understand the structure and function of sensory systems.

Generally speaking, efficient coding of visual information depends on the probabilistic structure of the input, as well as the noise properties and resource limitations of the system. In the retina, for example, we find that subpopulations of retinal ganglion cells, whose output fibers form the optic nerve, cover the full spatial extent of the retina in a highly efficient manner, producing responses with moderate correlation so as to compensate for the deleterious effects of noise. The spiking responses of these cells also appear to be highly efficient, reserving metabolically costly, high-output levels for infrequently occurring events.

We have also been able to show that local interactions between neurons play a critical role in enhancing coding efficiency. Decades ago, the British physiologist Horace Barlow pointed out that the simplest forms of coding efficiency imply that neurons in a population should strive to reduce the statistical dependencies present in their inputs. The structures of natural images are such that responses of independently computed model neurons exhibit striking statistical dependencies. We have shown that these dependencies cannot be removed through simple (i.e., linear) interactions and recombinations, but require a nonlinear operation whereby responses are normalized by the pooled responses of similar neurons. Such local gain-control mechanisms have been documented in response properties of neurons in the early stages of the visual system, and we have shown that the statistical measurements from natural images can be used to determine the optimal parameters of the model. We have used this observation to "derive'' models that can account for a broad range of physiological data. We have found that the same structures are present in natural sounds and that analogous models may be derived and compared with neurons of the auditory system.

Most recently, we have shown that in an efficient population, the selectivity of neurons for visual attributes should be inversely proportional to the probability of occurrence of those attributes. This arrangement also places strong limitations on perceptual capabilities: the discriminability of these attributes should be proportional to probability of their occurrence. These results are remarkable for their simplicity and provide clear and testable predictions. We have verified, through comparisons to electrophysiological measurements in monkeys, and perceptual measurements in humans, that these predictions are supported for a number of visual attributes (local orientation, spatial frequency, temporal frequency, and velocity), as well as two auditory attributes (frequency and modulation frequency). These results provide strong support for the ecological hypothesis that neural computations are well matched to the statistics of the environment.

Experimental Characterization of Neural and Perceptual Responses

Our models for sensory serve as precise instantiations of scientific hypotheses and must therefore be tested and refined through comparison to experimental measurements. A component of our work is aimed at developing new experimental paradigms, including novel stimuli and analysis methods, for such experiments.

In retina, we find that a general linear model (GLM), in which spiking responses arise from the superposition of a filtered stimulus signal, a feedback signal (embodying refractoriness and other forms of suppression excitation derived from the spike history), and a lateral connectivity signal (embodying influences from the spiking activity of other cells), provides a remarkably precise account of spike timing. We have developed methods for efficient and accurate determination of model parameters, and in a collaboration with E. J. Chichilnisky's lab (Salk Institute), we have fit these models to data from simultaneously recorded populations and used the fitted models to study the nature of information embedded in these spike trains.

In V1, in collaboration with J. Anthony Movshon's lab (New York University), we have used stochastic stimuli to characterize both the stimulus selectivity and gain-control signals of cells. The result is a unified model that expresses the surprising richness of computation across the population of V1 neurons. We have used these properties to construct a model of visual discriminability (the structural similarity index) that has become a de facto standard for assessing the visibility of errors in photographic images. The model thus serves as an engine for advancing engineering methodology (e.g., building better image compression or transmission systems), as well as scientific progress.

We have also been vigorously pursuing methods for describing neural responses in visual areas beyond V1, which can be loosely partitioned into two "pathways"—dorsal and ventral. The dorsal pathway (often called the motion pathway) contains neurons that respond to moving stimuli. Since the early 1990s, we have been developing and refining a physiological model for motion representation in the middle temporal (MT) dorsal area. This model is constructed in two stages of identical architecture, corresponding to neurons in V1 and MT. The commonality of structure in the two stages is an attractive feature of the model, since it is often noted that the local circuitry is similar across cortical areas. Selectivity of neurons in each stage is primarily determined by linear receptive fields, whose responses are then shaped by a rectifying nonlinearity, and divisive normalization (in which the response of each cell is divided by the summed responses of other cells). Simulations demonstrate that this two-stage model is remarkably consistent with a broad set of single-cell physiological data recorded in area MT. More recently, we have developed targeted stochastic motion stimuli that allow us to characterize the specific properties of individual MT neurons in terms of their V1 afferents. By examining human detection performance for such stimuli, we have produced strong evidence for the existence of such mechanisms in the human visual system.

The representations of the ventral stream (often called the form pathway) are not as well understood but are generally believed to be involved in the analysis and recognition of visual patterns. We have developed a model for the representation of visual texture and used it to synthesize texture images that humans perceive as similar. By coupling this model with known receptive field properties of the ventral stream (specifically, the growth of receptive field size with eccentricity), we have been able to generate new forms of stimuli that exhibit severe peripheral distortion (scrambling of visual patterns, and a complete loss of recognizability) but are indistinguishable from intact photographs. We have used perceptual experiments to determine the sizes of neural receptive fields underlying these ambiguities, which allows us to identify the locus of this representation as area V2. Physiological experiments are currently under way (in collaboration with the Movshon lab) to elucidate these neural mechanisms.

Finally, we have developed a model for the representation of auditory “textures” - sounds that arise from the superposition of many similar acoustic events (examples include rain, a swarm of insects, or an audience applauding). The model begins with known processing stages from the auditory periphery and culminates with the measurement of simple statistics of these stages. To test this model, we generate synthetic sounds engineered to match the statistics of real-world textures, and demonstrate that humans can recognize these nearly as well as their real-world counterparts. We hypothesize that the statistical measurements in our model reflect sensitivities of downstream neural populations, and thus provide a framework for further physiological investigations.

Decoding of Neural Representations: Perceptual Inference and Bias

Our everyday experience deludes us into believing that perception is a direct reflection of the physical world around us. But scientists have recognized for centuries that it is more akin to a process of inference, in which incoming measurements are fused with internal expectations. In the 20th century, this concept was formalized in theories of statistical inference, and since the early 1990s I have used this framework to understand the means by which percepts arise from neural responses. An interesting example arises in the perception of retinal motion. If one assumes that the light intensity pattern falling on a local patch of retina is undergoing translational motion, that the neural representation of this information is noisy, and that in the absence of visual information, the distribution of retinal velocities that are typically encountered is broad but centered at zero (no motion), one can derive an optimal method of estimating image velocities. Although optimal, the resulting estimates are strongly biased toward slower speeds when the incoming stimulus is weakened (e.g., at low contrast).

This behavior is also seen in humans, and we have used perceptual measurements to determine the internal preferences of human observers. We have obtained analogous results for human perception of local orientation, where observer preferences for horizontal and vertical orientations are well matched to their prevalence in the natural world. The inferential computations required for these percepts are compatible with the simple neural models described above, and our current work (both theoretical and experimental) aims to elucidate the means by which prior preferences can be learned and embedded in neural populations.

This research has also been partially supported by the National Institutes of Health, a National Science Foundation CAREER grant, an Alfred P. Sloan Fellowship, and the Sloan-Swartz Center for Theoretical Visual Neuroscience at New York University.

As of May 30, 2012

Scientist Profile

Investigator
New York University
Computational Biology, Neuroscience