Yudi Xie Thesis Defense: Neural Network Models of Objects in Visual Cognition
Description
Time: April 15 (Wed), 2026, 2 pm - 3 pm
Location: Singleton Auditorium (MIT 46-3002)
Zoom (if needed): https://mit.zoom.us/j/99375917132
Thesis advisor: Prof. James DiCarlo
Abstract:
Objects play an important role in the way we perceive and understand the world through vision. My thesis presents three research projects that use task-optimized neural network models to gain new insights into how the brain and mind represent, remember, and reason about objects from visual inputs.
First, we showed that training convolutional neural networks to estimate object spatial information, such as object position and pose, rather than categories, can also lead to representations aligned with the primate ventral visual stream. Furthermore, spatial tasks and category-trained networks developed similar representations, especially in their early and middle layers. Our results suggest that ventral stream function may be versatile, and that one should not assume it is optimized for object categorization alone.
Second, we studied the origin of capacity limitation in working memory by developing image-computable neural network models. We found that human-like capacity limitations in visual working memory can be qualitatively explained by models with visual encoders pre-trained on natural images. In contrast, models without realistic constraints on sensory encoding did not exhibit human-like memory limitations. Our results suggest that limitations in visual working memory capacity may be partly due to constraints on realistic sensory encoding.
Third, we studied how humans reason about occluded objects in vision and models that can account for this process. We developed a novel visual reasoning task that intuitively involves strategies such as imagining compatible shapes and ruling out alternative hypotheses. We found that purely feedforward convolutional neural networks can perform this task well and capture human biases in some conditions without being explicitly trained to do so. Our findings suggest that purely feedforward models could be powerful, challenging the view that mechanisms beyond feedforward processing are definitely needed.
Taken together, these projects show various ways in which neural network models can provide new insights into problems in visual cognition and neuroscience across different levels of analysis.