With the rise of large multimodal models (e.g. Flamingo, BeIT-3, GPT-4), integrated perception systems that can achieve human level scene understanding may be on the horizon. Making progress towards this ambitious goal requires robust and comprehensive evaluation benchmarks and strategies to reveal the strengths and weaknesses (including biases) of these models and guide research. There are many benchmarks in the multimodal space that have led to amazing progress in the field, but each one targets restricted aspects of perception: image benchmarks exclude temporal aspects; visual question-answering tends to focus mostly on image-level semantic understanding; object tracking tasks generally capture lower-level appearance of individual objects, like colour or texture. Some important aspects are poorly covered (e.g. memory skills or Physics understanding).
The proposed challenge-workshop aims to benchmark multimodal perception models by organising a competition around the Perception Test benchmark (blog, github). The Perception Test is a diagnostic benchmark created by DeepMind to counteract some of the limitations of existing benchmarks mentioned above by comprehensively probing the abilities of multimodal models across video, audio, and text modalities, in four skill areas (Memory, Abstraction, Physics, Semantics), four types of reasoning (descriptive, explanatory, predictive, counterfactual), and six computational tasks (multiple-choice video-QA, grounded video-QA, object tracking, point tracking, action localisation, sound localisation). The training and public test set were released in October 2022, and the held-out test set will be released together with the evaluation server for this competition.
You can try yourself the Perception Test here.
Check the Perception Test github repo for details about the data and annotations format, baselines, and metrics.
Check the Computer Perception workshop at ECCV2022 for recorded talks and slides introducing the Perception Test benchmark.
We will host the first Perception Test challenge with the following tracks (Check the links to access the eval.ai challenges)
Prizes totalling 15k EUR are available across all challenges.
We received 475 submissions from 63 teams across all six tracks. We awarded runner-up and best performance per track, plus 2 awards for most novel submissions across tracks.
Single Object Tracking
Single Point Tracking
Temporal Action Localisation
Temporal Sound Localisation
Multiple-Choice Video Question-Answering
Grounded Video Question-Answering
Most novel submissions