Center for Vision, Cognition, Learning, and Autonomy, UCLA1
Jet Propulsion Laboratory, Caltech2
School of EECS, Oregon State University3
With the advent of drones, aerial video analysis is becoming increasingly important; yet, it has received scant attention in the literature. This project addresses a new problem of parsing low-resolution aerial videos of large spatial areas, in terms of grouping and assigning roles to people and objects engaged in events, and recognizing these events. Due to low resolution and top-down views, person detection and tracking – the standard input to recent approaches to event recognition – are very unreliable. We address these challenges with a novel framework aimed at conducting joint inference of the above tasks, as reasoning about each in isolation typically fails in our setting. Given noisy tracklets of people and detections of large objects and scene surfaces (e.g., building, grass), we use a spatiotemporal AND-OR graph to drive our joint inference, using Markov Chain Monte Carlo and dynamic programming. We introduce a new formalism of deformable templates characterizing latent sub-events. For evaluation, we have collected a new set of aerial videos using a hex-rotor flying over picnic areas rich with group events. Our results demonstrate that we successfully address above inference tasks under challenging conditions.
Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. (Oral) [pdf] [supp] [slides] [video]
@inproceedings{ShuCVPR15, title = {Joint Inference of Groups, Events and Human Roles in Aerial Videos}, author = {Tianmin Shu and Dan Xie and Brandon Rothrock and Sinisa Todorovic and Song-Chun Zhu}, year = {2015}, booktitle = {CVPR} }
At UCLA, we assembled a new low-cost hex-rotor with a GoPro camera, which is able to eliminate the high frequency vibration of the camera and hold in air autonomously through a GPS and a barometer. It can also fly 20 ∼ 90m above the ground and stays 5 minutes in air. We use this hex-rotor to take a set of videos with some plots at a park where the terrain is interesting: hiking routes, parking lots, camping sites, picnic areas with shelters, restrooms, tables, trash bins and BBQ ovens. By detecting/tracking humans and objects in the videos, we can recognize events such as BBQ, queuing, exchanging objects, loading/unloading, etc.
We have collected some events with scripts involving the interactions between humans and objects at two different sites. The original videos are pre-processed, including camera calibration and frame registration. After pre-processing, there are totally 27 videos in the dataset, the length of which ranges from 2 minutes to 5 minutes. We annotate the hierarchical semantic information of objects, roles, events and groups in the videos.
Image of our hex-rotor in the air with a GoPro camera.
A frame of the original aerial videos from site A.
A frame of the original aerial videos from site B.
The annotation in our dataset includes individuals, objects, groups, events. human roles and goals (destinations). There are 12 events, 18 human roles, 12 object categories. (In our CVPR 2015 paper, we didn't investigate the "Inspection Hide" group events and inidvidual goals.)
The dataset is available for free to researchers from academic institutions (e.g., universities, government research labs, etc.) for non-commercial purposes. In order to obtain the dataset, you need to submit the request form with the information of your affiliated organization. The information provided on this form will not be distributed. Upon receipt and approval of your request, we will send you the download instruction as soon as we can.
We greatly appreciate Emails about bugs or suggestions.
Please cite this paper if you use the dataset:
Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic and Song-Chun Zhu. Joint inference of groups, events and human roles in aerial videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.