The First Perception Test Challenge

Workshop at ICCV, October 2023


With the rise of large multimodal models (e.g. Flamingo, BeIT-3, GPT-4), integrated perception systems that can achieve human level scene understanding may be on the horizon. Making progress towards this ambitious goal requires robust and comprehensive evaluation benchmarks and strategies to reveal the strengths and weaknesses (including biases) of these models and guide research. There are many benchmarks in the multimodal space that have led to amazing progress in the field, but each one targets restricted aspects of perception: image benchmarks exclude temporal aspects; visual question-answering tends to focus mostly on image-level semantic understanding; object tracking tasks generally capture lower-level appearance of individual objects, like colour or texture. Some important aspects are poorly covered (e.g. memory skills or Physics understanding).

The proposed challenge-workshop aims to benchmark multimodal perception models by organising a competition around the Perception Test benchmark (blog, github). The Perception Test is a diagnostic benchmark created by DeepMind to counteract some of the limitations of existing benchmarks mentioned above by comprehensively probing the abilities of multimodal models across video, audio, and text modalities, in four skill areas (Memory, Abstraction, Physics, Semantics), four types of reasoning (descriptive, explanatory, predictive, counterfactual), and six computational tasks (multiple-choice video-QA, grounded video-QA, object tracking, point tracking, action localisation, sound localisation). The training and public test set were released in October 2022, and the held-out test set will be released together with the evaluation server for this competition.

You can try yourself the Perception Test here.

Check the Perception Test github repo for details about the data and annotations format, baselines, and metrics.

Check the Computer Perception workshop at ECCV2022 for recorded talks and slides introducing the Perception Test benchmark.


We will host the first Perception Test challenge with the following tracks:

  • single object tracking
  • point tracking
  • temporal action localisation
  • temporal sound localisation
  • multi-choice video question-answering
  • grounded video question-answering

Challenge server and held-out test set coming soon.

Details about the prizes coming soon.


  • June 15th, 2023: Challenge server goes live with data from the validation split
  • Beginning of August 2023: Held-out test split released
  • September 15th, 2023: Deadline for submissions
  • September 22nd, 2023 : Winners announced
  • October 3rd, 2023: Challenge-workshop at ICCV2023, Paris


Viorica Pătrăucean

Research Scientist at DeepMind.
Research: computer vision, scalable learning, biologically plausible learning.

Joao Carreira

Research Scientist at DeepMind.
Research: video processing, general perception systems.

Dima Damen

Bristol University
Professor of Computer Vision.
Research: computer vision, video understanding, perception benchmarks.

Andrew Zisserman

University of Oxford
Professor of Computer Vision Engineering at Oxford and a Royal Society Research Professor.
Research: computer vision, machine learning.