Eero Simoncelli constructs computational models of vision that are consistent with the properties of the visual world, the requirements of visual tasks, and the constraints of biological implementation.
Our sensory systems provide a stable and reliable interpretation of the world around us, allowing us to make predictions, recognize patterns, and perform difficult tasks with remarkable accuracy. How do these capabilities arise from the underlying neural circuitry? Specifically, how do populations of neurons encode sensory information, and how do subsequent populations extract that information for recognition, decisions, and action? From a more theoretical perspective, why do sensory systems use these particular representations? Our research aims to answer these questions, through computational theory and modeling, coupled with perceptual and physiological experiments. Our endeavors can be categorized into three general classes of research.
Encoding of Visual Information
It has long been assumed that visual systems are adapted, at evolutionary, developmental, and behavioral timescales, to the images to which they are exposed. Since not all images are equally likely, it is natural to assume that these systems use their limited resources to process best those images that occur most frequently. Thus, it is the statistical properties of the environment that are relevant for sensory processing. Such concepts are fundamental in engineering disciplines—compression, transmission, and enhancement of images all rely heavily on statistical models. Decades ago, the British physiologist Horace Barlow pointed out that such considerations also play a role in shaping biological sensory systems. Since the mid 1990s, we have developed successively more powerful models describing the statistical properties of local regions of natural images, and used these models both to develop state-of-the-art solutions to engineering problems, and to understand the structure and function of biological sensory systems.
In the retina, for example, we find that optimizing a model of retinal ganglion cells (whose output fibers form the optic nerve) so as to maximize the transmission of visual information leads to a solution in which the cells are partitioned into two separate sub-populations. Each sub-population consists of cells that respond to light intensity in local spatial regions, and the cells within each sub-population cover the full spatial extent of the retina. One sub-population responds to light intensities brighter than the background level, and one to those below the background level, and the response amplitudes of the two sub-populations scale differently with signal intensity, as has been measured physiologically.
We have also been able to show that local interactions between cortical neurons play a critical role in enhancing coding efficiency. The structures of natural images are such that responses of independently computed model neurons exhibit striking statistical dependencies. We have shown that these dependencies cannot be removed through simple (i.e., linear) interactions and recombinations, but require a nonlinear operation whereby responses are normalized by the pooled responses of similar neurons. Such local gain-control mechanisms have been documented in response properties of neurons in the early stages of the visual system, and we have shown that the statistical measurements from natural images can be used to determine the optimal parameters of the model. We have used this observation to "derive'' models that can account for a broad range of physiological data. We have found that the same structures are present in natural sounds and that analogous models may be derived and compared with neurons of the auditory system.
Most recently, we have shown that in an efficient population, the selectivity of neurons for visual attributes should be inversely proportional to the probability of occurrence of those attributes. This arrangement also places strong limitations on perceptual capabilities: the discriminability of these attributes should be proportional to probability of their occurrence. These results are remarkable for their simplicity and provide clear and testable predictions. We have verified, through comparisons to electrophysiological measurements in monkeys, and perceptual measurements in humans, that these predictions are supported for a number of visual attributes (local orientation, spatial frequency, temporal frequency, and velocity), as well as two auditory attributes (frequency and modulation frequency). These results provide strong support for the ecological hypothesis that neural computations are well matched to the statistics of the environment.
Experimental Characterization of Neural and Perceptual Responses
Our models for sensory serve as precise instantiations of scientific hypotheses and must therefore be tested and refined through comparison to experimental measurements. A component of our work is aimed at developing new experimental paradigms, including novel stimuli and analysis methods, for such experiments.
In retina, we find that a generalied linear model (GLM), in which spiking responses arise from the superposition of a filtered stimulus signal, a feedback signal (embodying refractoriness and other forms of suppression excitation derived from the spike history), and a lateral connectivity signal (embodying influences from the spiking activity of other cells), provides a remarkably precise account of spike timing. We have developed methods for efficient and accurate determination of model parameters, and in a collaboration with Liam Paninski (Columbia U) and E. J. Chichilnisky (Stanford U), we have fit these models to data from simultaneously recorded populations and used the fitted models to study the nature of information embedded in these spike trains.
In V1, in collaboration with J. Anthony Movshon's lab (New York University), we have used stochastic stimuli to characterize both the stimulus selectivity and gain-control signals of cells. The result is a unified model that expresses the surprising richness of computation across the population of V1 neurons.
We have also been vigorously pursuing methods for describing neural responses in visual areas beyond V1, which can be loosely partitioned into two "pathways"—dorsal and ventral. The dorsal pathway (often called the motion pathway) contains neurons that respond to moving stimuli. Since the early 1990s, we have been developing and refining a physiological model for motion representation in the middle temporal (MT) dorsal area. This model is constructed in two stages of identical architecture, corresponding to neurons in V1 and MT. The commonality of structure in the two stages is an attractive feature of the model, since it is often noted that the local circuitry is similar across cortical areas. Selectivity of neurons in each stage is primarily determined by linear receptive fields, whose responses are then shaped by a rectifying nonlinearity, and divisive normalization (in which the response of each cell is divided by the summed responses of other cells). Simulations demonstrate that this two-stage model is remarkably consistent with a broad set of single-cell physiological data recorded in area MT. More recently, we have developed targeted stochastic motion stimuli that allow us to characterize the specific properties of individual MT neurons in terms of their V1 afferents. By examining human detection performance for such stimuli, we have produced strong evidence for the existence of such mechanisms in the human visual system.
The representations of the ventral stream (often called the form pathway) are not as well understood but are generally believed to be involved in the analysis and recognition of visual patterns. We have developed a model for the representation of visual texture and used it to synthesize texture images that humans perceive as similar. By coupling this model with known receptive field properties of the ventral stream (specifically, the growth of receptive field size with eccentricity), we have been able to generate new forms of stimuli that exhibit severe peripheral distortion (scrambling of visual patterns, and a complete loss of recognizability) but are indistinguishable from intact photographs. We have used perceptual experiments to determine the sizes of neural receptive fields underlying these ambiguities, which allows us to identify the locus of this representation as area V2. This led to a series of physiological experiments (in collaboration with the Movshon lab) in which we have used these stimuli to expose response properties of V2 neurons that are different from those in area V1.
Finally, we have developed a model for the representation of auditory “textures” - sounds that arise from the superposition of many similar acoustic events (examples include rain, a swarm of insects, or an audience applauding). The model begins with known processing stages from the auditory periphery and culminates with the measurement of simple statistics of these stages. To test this model, we generate synthetic sounds engineered to match the statistics of real-world textures, and demonstrate that humans can recognize these nearly as well as their real-world counterparts. We hypothesize that the statistical measurements in our model reflect sensitivities of downstream neural populations, and thus provide a framework for further physiological investigations.
Decoding of Neural Representations: Perceptual Implications
Our everyday experience deludes us into believing that perception is a direct reflection of the physical world around us. But scientists have recognized for centuries that it is more akin to a process of inference, in which incoming measurements are fused with internal expectations. In the 20th century, this concept was formalized in theories of statistical inference, and since the early 1990s I have used this framework to understand the means by which percepts arise from neural responses. An interesting example arises in the perception of retinal motion. If one assumes that the light intensity pattern falling on a local patch of retina is undergoing translational motion, that the neural representation of this information is noisy, and that in the absence of visual information, the distribution of retinal velocities that are typically encountered is broad but centered at zero (no motion), one can derive an optimal method of estimating image velocities. Although optimal, the resulting estimates are strongly biased toward slower speeds when the incoming stimulus is weakened (e.g., at low contrast).
This behavior is also seen in humans, and we have used perceptual measurements to determine the internal preferences of human observers. We have obtained analogous results for human perception of local orientation, where observer preferences for horizontal and vertical orientations are well matched to their prevalence in the natural world. The inferential computations required for these percepts are compatible with the simple neural models described above, and our current work (both theoretical and experimental) aims to elucidate the means by which prior preferences can be learned and embedded in neural populations.
Finally, the structure of neural representations has direct implications for our ability to perceive changes or distortions in sensory stimuli. We've developed models for quantifying the perceptual quality of distorted images, and these have become widely adopted throughout the image and video processing engineering communities, as a means of assessing, calibrating and improving systems that must transmit, store or manipulate visual images. In Fall of 2015, we received an Emmy Award, in recognition of the impact this work has had on the television industry.
This research has also been partially supported by the National Institutes of Health, the National Science Foundation, and the Simons Foundation.
As of March 2016