Computational & Systems Neuroscience (Cosyne) 2021, 18, Online
Deep neural networks (DNNs) optimized for real-world object classification are the leading models predicting neural responses in inferior temporal cortex (IT). However, further optimizing DNN classification accuracy produces a saturating trend for predicting IT neural responses -- potentially because pure performance optimization favors representations explicitly encoding information about object class at the expense of representing other sources of image-by-image variance. Here, we performed an extensive meta-analysis of current DNNs to identify representational properties underlying neural predictivity.
By examining an array of representational metrics -- including classification performance, sparsity, and dimensionality -- we identified two properties of DNN representations that were highly predictive of their matches to IT neural data: factorization of scene-to-scene variance from (1) viewpoint changes, induced by taking crops of an image or varying camera position in a video and (2) appearance transforms, induced by varying lighting and color. Factorizing (as opposed to being invariant to) scene viewpoint and appearance both matched or exceeded ImageNet classification accuracy in predicting the best models of high-level visual cortex across four datasets tested -- two neural datasets in monkeys and two human fMRI datasets. Importantly, metric predictivity generalized across a diverse range of DNNs with varied architectures and objectives (n=47 models).
Consistent with these insights, we found that the models best matching neural data were self-supervised models optimized via contrastive learning for criteria similar to scene viewpoint and appearance factorization. These models improved upon the neural fits of architecture-matched controls trained for object classification. Based on our observations, we were able to simplify contrastive objective functions to bring them closer to biological plausibility while still yielding representations that predicted neural data well. Thus, our results revise the idea that IT is best explained through the lens of invariant object classification by suggesting new candidate normative principles guiding representations in high-level visual cortex.