Hierarchical feedforward processing makes object identity explicit at the highest stages of the ventral visual stream. We leveraged this computational goal to study the fine-scale temporal dynamics of neural populations in posterior and anterior inferior temporal cortex (pIT, aIT) during face detection. As expected, we found that a neural spiking preference for natural over distorted face images was rapidly produced, first in pIT and then in aIT. Strikingly, in the next 30 milliseconds of processing, this pattern of selectivity in pIT completely reversed, while selectivity in aIT remained unchanged. Although these dynamics were difficult to explain from a pure feedforward perspective, a model class computing errors through feedback closely matched the observed neural dynamics and parsimoniously explained a range of seemingly disparate IT neural response phenomena. This new perspective augments the standard model of online vision by suggesting that neural signals of states (e.g. likelihood of a face being present) are intermixed with the error signals found in deep hierarchical networks.