Abstract
Our visual surroundings are highly complex. Despite this, we understand and navigate them effortlessly. This requires a complex series of transformations resulting in representations that not only span low- to high-level visual features (e.g., contours, textures, object parts and objects), but likely also reflect co-occurrence statistics of objects in real-world scenes. Here, so-called anchor objects reflect clustering statistics in real-world scenes, anchoring predictions towards frequently co-occuring smaller objects, while so-called diagnostic objects predict the larger semantic context. We investigate which of these properties underly scene understanding across two dimensions – realism and categorisation – using scenes generated from Generative Adversarial Networks (GANs) which naturally vary along these dimensions. We show that anchor objects and mainly high-level features extracted from a range of pre-trained deep neural networks (DNNs) drove realism both at first glance and after initial processing. Categorisation performance was mainly determined by diagnostic objects, regardless of realism and DNN features, also at first glance and after initial processing. Our results are testament to the visual system’s ability to pick up on reliable, category specific sources of information that are flexible towards disturbances across the visual feature hierarchy.