Compositional simulation in perception and cognition
Description
Despite rapid recent progress in machine perception and models of biological
perception, fundamental questions remain open. In particular, the paradigm
underlying these advances, pattern recognition, requires large amounts of
training data and struggles to generalize to situations outside the domain of
training. In this thesis, I focus on a broad class of perceptual concepts --
those that are generated by the composition of multiple causal processes, in
this case certain physical interactions -- that human use essentially and
effortlessly in making sense of the world, but for which any specific instance
is extremely rare in our experience. Pattern recognition, or any strongly
learning-based approach, might then be an inappropriate way to understand
people's perceptual inferences. I propose an alternative approach, compositional
simulation, that can in principle account for these inferences, and I show in
practice that it provides both qualitative and quantitative explanatory value
for several experimental settings.
Consider a box and a number of marbles in the box, and imagine trying to guess
how many there are based on the sound produced when the box is shaken. I
demonstrate that human observers are quite good at this task, even for subtle
numerical differences. Compositional simulation hypothesizes that people succeed
by leveraging internal causal models: they simulate the physical collisions that
would result from shaking the box (in a particular way), and what those
collisions would sound like, for different numbers of marbles. They then compare
their simulated sounds with the sound they heard. Crucially these simulation
models can generalize to a wide range of percepts, even those never before
experienced, by exploiting the compositional structure of the causal processes
being modeled, in terms of objects and their interactions, and physical dynamics
and auditory events. Because the motion of the box is a key ingredient in
physical simulation, I hypothesize that people can take cues to motion into
account in our task; I give evidence that people do.
I also consider the domain of unfamiliar objects covered by cloth. a similar
mechanism should enable successful recognition even for unfamiliar covered
objects (like airplanes). I show that people can succeed in the recognition
task, even when the shape of the object is very different when covered.
Finally, I show how compositional simulation provides a way to "glue together"
the data received by perception (images and sounds) with the contents of
cognition (objects). I apply compositional simulation to two cognitive domains:
children's intuitive exploration (obtaining quantitative prediction of
exploration time), and causal inference from audiovisual information.