Does the primate ventral stream need cortical feedback to compute rapid object identity inferences?
Description
The primate ventral visual stream for object recognition contains prominent corticocortical feedback connections. However, the most accurate models of online, rapid (<200 ms) inference in the ventral stream are largely feedforward (hierarchical convolutional neural networks, HCNN). Might the appropriate inclusion of feedback connections in those models improve their explanatory power? We reasoned that, the impact of feedback connections would be most easily revealed in neural population activity at the top of the ventral visual hierarchy (inferior temporal cortex, IT), because the IT representation benefits from feedback connections along the entire hierarchy.
Because prior work shows that linear decoders accurately model IT’s estimate of object identity, we could look for a neural signature that would imply a computationally-critical role of feedback in online inference. Specifically, we hypothesized that, for images that require feedback circuits to resolve objects, IT’s estimate of object identity should emerge later (relative to other images). To discover such images, we behaviorally tested both synthetic images and photographs to obtain two groups of images — those for which object identity is easily extracted by the primate brain, but not solved by HCNNs (“challenge” images), and those that primates and models easily solve (control images). We then recorded IT population activity in two monkeys while they performed core object identity estimation (100 ms viewing) on each image (1360 images, 10 possible objects, randomly interleaved to neutralize attention).
We found that, in both monkeys, IT’s solution (linear decode) for challenge images took ~20 msec longer to emerge than control images. This difference could not be explained by differences in neural latency, firing rates, or low-level image properties such as contrast. These results imply the importance of feedback in ventral stream object inference, and the image-by-image differences constrain the next generation of ventral stream models.