
Sarah Schwettmann Thesis Defense: Generalizable Representations for Vision in Biological and Artificial Neural Networks
Description
Zoom link: https://mit.zoom.us/j/3639522280
Speaker: Sarah Schwettmann, Torralba & Tenenbaum Labs
Abstract:
This thesis makes empirical and methodological progress toward closing the representational gap between human perception and generative models.
Human vision is characteristically flexible and generalizable. One of the persistent challenges of vision science is understanding the underlying representations that allow us to recognize objects and scene attributes across a diversity of environments. A central framework for identifying such representations is inverse graphics, which hypothesizes that the brain achieves robust scene understanding from image data by inverting generative models to recover their latent parameters. I demonstrate that we can directly test the biological plausibility of generative models by uncovering relevant neural representations in the human brain. For instance, if physical reasoning were to be implemented in the brain as probabilistic simulations of a mental physics engine, we would expect neural representations of physical properties like object mass to be abstract and invariant––useful as inputs to a forward model of objects and their dynamics. I present the first evidence that this is indeed the case: fMRI decoding analyses in brain regions implicated in intuitive physics reveal mass representations that generalize across variations in physical scene, material, friction, and motion energy.
We can describe real-world physical scene and object understanding as inverse graphics because we know how to formalize the forward graphics model in a meaningful way, e.g. a physics engine, such that vision inverts it. However, this is not the case with other attributes of visual scenes such as their style or mood, where the relationship between what is experienced and what would be considered the image data is difficult to formalize, not sufficiently explained by optical principles or models of physics that can be inverted. How do we begin to get traction on how humans experience higher-level aspects of visual scenes, or recognize and appreciate meaningful structure that may be difficult to articulate?
I argue that large and flexible generative models for computer vision––that learn structure entirely from data––offer a promising setting for probing computational representations of human-interpretable concepts at different levels of abstraction. Attempts to interpret deep networks have traditionally searched only for predetermined sets of concepts, limiting what representations they can discover. I introduce a more data-driven approach to the interpretation question: a framework for building shared vocabularies, represented by deep networks and salient to humans, from the ground up. I present a procedure that uses human annotations to discover an open-ended set of visual concepts, ranging from low-level features of individual objects to high-level attributes of visual scenes, in the same representational space. In a series of experiments with human participants, I show that concepts learned with this approach are reliable and freely composable: generalizing across scenes and observers, and enabling fine-grained manipulation of image style and content. Next I introduce a learned captioning model that maps patterns of neuron activation to natural language strings, making it possible to generate open-ended, compositional descriptions of neuron function. These approaches enable us to map between visual concepts in model representations and human perception, analyse models, and synthesize novel scenes that extrapolate dimensions of visual experience that are meaningful to observers.