The First Perception Test Challenge

Workshop at ICCV, October 3rd (AM), 2023

Workshop Stream

Overview

With the rise of large multimodal models (e.g. Flamingo, BeIT-3, GPT-4), integrated perception systems that can achieve human level scene understanding may be on the horizon. Making progress towards this ambitious goal requires robust and comprehensive evaluation benchmarks and strategies to reveal the strengths and weaknesses (including biases) of these models and guide research. There are many benchmarks in the multimodal space that have led to amazing progress in the field, but each one targets restricted aspects of perception: image benchmarks exclude temporal aspects; visual question-answering tends to focus mostly on image-level semantic understanding; object tracking tasks generally capture lower-level appearance of individual objects, like colour or texture. Some important aspects are poorly covered (e.g. memory skills or Physics understanding).

The proposed challenge-workshop aims to benchmark multimodal perception models by organising a competition around the Perception Test benchmark (blog, github). The Perception Test is a diagnostic benchmark created by DeepMind to counteract some of the limitations of existing benchmarks mentioned above by comprehensively probing the abilities of multimodal models across video, audio, and text modalities, in four skill areas (Memory, Abstraction, Physics, Semantics), four types of reasoning (descriptive, explanatory, predictive, counterfactual), and six computational tasks (multiple-choice video-QA, grounded video-QA, object tracking, point tracking, action localisation, sound localisation). The training and public test set were released in October 2022, and the held-out test set will be released together with the evaluation server for this competition.

You can try yourself the Perception Test here.

Check the Perception Test github repo for details about the data and annotations format, baselines, and metrics.

Check the Computer Perception workshop at ECCV2022 for recorded talks and slides introducing the Perception Test benchmark.

Challenge

We will host the first Perception Test challenge with the following tracks (Check the links to access the eval.ai challenges)

Prizes totalling 15k EUR are available across all challenges.

The First Perception Test Challenge winners

We received 475 submissions from 63 teams across all six tracks. We awarded runner-up and best performance per track, plus 2 awards for most novel submissions across tracks.

Single Object Tracking

  • Best performance: Team X-Works (Baojun Li, Jiamian Huang, Tao Liu) [report]
  • Runner-up: Team sth (Limin Wang, Gangshan Wu, Yutao Cui, Tianhui Song) [report]

Single Point Tracking

  • Best performance: Team NJUST_KMG_Point (Hongpeng Pan, Yang Yang, Zhongtian Fu, Yuxuan Zhang, Shian Du, Yi Xu, and Xiangyang Ji) [report]
  • Runner-up: Team THETEAM (Han Zang, Tianyang Xu, Xue-Feng Zhu, Xiao-Jun Wu, Josef Kittler) [report]

Temporal Action Localisation

  • Best performance: Team CTCV (Xinmeng Zuo, Yuting Zhang, Ruijie Zhao, Jiang Liu and Hao Sun) [report]
  • Runner-up: Team OpenGVLab (Jiashuo Yu, Guo Chen, Yizhuo Li, Yali Wang, Limin Wang, Yu Qiao) [report]

Temporal Sound Localisation

  • Best performance: Team OpenGVLab (Jiashuo Yu, Guo Chen, Yizhuo Li, Yali Wang, Limin Wang, Yu Qiao) [report]
  • Runner-up: Team NJUST_KMG (Yurui Huang, Shuo Chen, Xinyan Wang, Yang Yang) [report]

Multiple-Choice Video Question-Answering

  • Best performance: Team hsslab_inspur (Baoyu Fan, Runze Zhang, Xiaochuan Li, Lu Liu, Li Wang, Zhenhua Guo, Yaqian Zhao, Rengang Li) [report]
  • Runner-up: Team TTgogogo (Dongshuai Li, Chenglei Dai) [report]

Grounded Video Question-Answering

  • Best performance: Team NJUST--KMG (Hailiang Zhang, Dian Chao, Zhihao Guan, Weili Guo, Yang Yang) [report]
  • Runner-up: Not awarded

Most novel submissions

  • 2nd place in Single Point Tracking: Team THETEAM (Han Zang, Tianyang Xu, Xue-Feng Zhu, Xiao-Jun Wu, Josef Kittler) [report]
  • 3rd place in Temporal Sound Localisation: Team JNU_boat (Linze Li, Rongchang Li, Tianyang Xu, Xiao-Jun Wu, Josef Kittler) [report]

Timeline

  • June 15th - August 1st, 2023: Challenge server goes live with data from the validation split
  • August 1st, 2023: Held-out test split released
  • September 15th, 2023: Deadline for submissions
  • September 22nd, 2023 : Winners announced
  • October 3rd, 2023: Challenge-workshop at ICCV2023, Paris

Workshop

Agenda

  • 09:00 - 09:15 Welcome and introduction
  • 09:15 - 09:45 Overview of Perception Test
  • 09:45 - 10:15 Keynote: Derek Hoiem - Measuring and Improving Learning Ability
  • 10:15 - 10:45 Coffee break
  • 10:45 - 11:00 Challenges overview and winner announcement
  • 11:00 - 11:45 Oral presentations from the challenge winners
  • 11:45 - 12:15 Keynote: Rohit Girdhar - Evaluating Next-Gen Perception Models
  • 12:15 - 13:00 Roundtable and closing notes

Speakers

Organizers

Viorica Pătrăucean

Deepmind
Research Scientist at DeepMind.
Research: computer vision, scalable learning, biologically plausible learning.

Joao Carreira

Deepmind
Research Scientist at DeepMind.
Research: video processing, general perception systems.

Dima Damen

Bristol University
Professor of Computer Vision.
Research: computer vision, video understanding, perception benchmarks.

Andrew Zisserman

University of Oxford
Professor of Computer Vision Engineering at Oxford and a Royal Society Research Professor.
Research: computer vision, machine learning.

Joe Heyward

Deepmind
Research Engineer at Deepmind.
Research: computer vision.