The Second Perception Test Challenge

Workshop at ECCV 2024, September 29, AM, Room: Suite 7

Overview

Following the successful 2023 iteration, we organise the second Perception Test Challenge with the goal of benchmarking multimodal perception models on the Perception Test (blog, github) - a diagnostic benchmark created by Google DeepMind to comprehensively probe the abilities of multimodal models across:

three modalities: video, audio, and text
four skill areas: Memory, Abstraction, Physics, Semantics
four types of reasoning: Descriptive, Explanatory, Predictive, Counterfactual
six computational tasks: multiple-choice video-QA, grounded video-QA, object tracking, point tracking, action localisation, sound localisation

You can try yourself the Perception Test here.

Check the Perception Test github repo for details about the data and annotations format, baselines, and metrics.

Check the Computer Perception workshop at ECCV2022 for recorded talks and slides introducing the Perception Test benchmark.

Check the First Perception Test challenge for details of the previous challenge.

Perception Test overview slides from the 2024 workshop here.

Contact: viorica at google.com, perception-test at google.com

Challenge

The Second Perception Test challenge includes the 6 original Perception Test tasks, plus an additional task focused on hour-long videoQA. (Check the links to access the eval.ai challenge pages)

We offer cash prizes totalling EUR 20K to top competitors across tasks, with special awards for models that complete multiple/all tasks under zero-shot evaluation regime.

Timeline

June 10th, 2024: Challenge server goes live with data from the validation split
July 1st, 2024: Held-out test split released
September 14th, 2024: Deadline for submissions
September 22nd, 2024 : Winners announced
September 29th, 2024: Challenge-workshop at ECCV2024, Milan

The Second Perception Test Challenge winners

We received 680 submissions from 123 teams across all seven tracks. We awarded runner-up and best performance per track.

Single Object Tracking

Best performance: Team NJUST-THU (Zhiqiang Zhong, Yang Yang, Fengqiang Wan, Henglu Wei, Xiangyang Ji) [report]
Runner-up: Team FAUgeddaboudit (Amin Heydarshahi, Shubhaankar Gupta, Bernhard Egger) [report]

Single Point Tracking

Best performance: Team SV (Hengzhi Zhang from Ricoh Software Research Center Beijing Co., Ltd) [report]
Runner-up: Team NJUST_kmg (Yuxuan Zhang, Pengsong Niu, Kun Yu, Qingguo Chen, Yang Yang) [report]

Temporal Action Localisation

Best performance: Team NJUST--_KMG (Yinan Han, Qingyuan Jiang, Hongming Mei, Yang Yang, Jinhui Tang) [report]
Runner-up: Team AITC (Songlian Li, Zitao Gao, Huili Huang, Xinlong Sun) [report]

Temporal Sound Localisation

Best performance: Team NJUST_KMG0 (Haowei Gu, Weihao Zhu, Yang Yang) [report]
Runner-up: Team JNU-Boat (Linze Li, Rongchang Li, Cong Wu, Tianyang Xu, Xiao-Jun Wu, Josef Kittler) [report]

Multiple-Choice Video Question-Answering

Best performance: Team SEU-2023 (Yingzhe Peng, Yixiao Yuan, Zitian Ao, Huapeng Zhou, Kangqi Wang, Qipeng Zhu, Xu Yang) [report]
Runner-up: Team TTgogogo (Dongshuai Li, Xingxian Liu, Fuyu Lv) [report]

Grounded Video Question-Answering

Best performance: Team Research newbie (Yi-Jing Wu, Jo-Ting Chen, Hsing-Chen Lee, Jun-Cheng Chen) [report]
Runner-up: Team UCF_CRCV(Joseph Fioresi, Tina Tran, Mubarak Shah)[report]

Hour-Long Video Question-Answering

Best performance: Team blackmonkey (Bozheng Li, Yangguang Ji, Yongliang Wu, Jiawang Cao, Wenbo Zhu, Jay Wu, Xu Yang) [report]
Runner-up: Team JJ_James (Yi Lu, Licheng Tang, Yuyang Sun, Wenyu Zhang, Weiheng Chi, Yalun Dai, Jing Wang) [report]

Workshop

Agenda (Room: Suite 7)

09:30 - 09:45 Welcome and introduction
09:45 - 10:15 Overview of Perception Test
10:15 - 10:45 Keynote: Abhinav Gupta
10:45 - 11:15 Coffee break
11:15 - 11:30 Challenges overview and winner announcement
11:30 - 12:00 Oral presentations from the challenge winners
12:00 - 12:30 Keynote: Josh Tenenbaum
12:30 - 13:00 Roundtable and closing notes

Speakers

Abhinav Gupta

Carnegie Mellon University

Associate Professor at Carnegie Mellon University, focused on scaling up learning by building self-supervised, lifelong and interactive learning systems.

Josh Tenenbaum

Professor at MIT, focusing on the computational basis of human cognition: learning concepts, judging similarity, inferring causal connections, forming perceptual representations, and more.

Organizers

Joe Heyward

Google Deepmind

Research Engineer at Google Deepmind.
Research: computer vision.

Joao Carreira

Google Deepmind

Research Scientist at Google DeepMind.
Research: video processing, general perception systems.

Dima Damen

Bristol University

Professor of Computer Vision.
Research: computer vision, video understanding, perception benchmarks.

Andrew Zisserman

University of Oxford

Professor of Computer Vision Engineering at Oxford and a Royal Society Research Professor.
Research: computer vision, machine learning.

Viorica Pătrăucean

Google Deepmind

Research Scientist at Google DeepMind.
Research: computer vision, scalable learning, biologically plausible learning.