HiEve
2D Box Tracking
Pose
2D Keypoints
Action/Event Detection
|...
许可协议: Unknown

Overview

This grand challenge aims to advance large-scale human-centric video analysis in complex events using multimedia techniques. We propose the largest existing dataset (named as Human-in-Events or HiEve) for understanding human motion, pose, and action in a variety of realistic events, especially crowd & complex events. Four challenging tasks are established on our dataset, which encourages researches to address the very challenging and realistic problems in human-centric analysis. Our challenge will benefit researches in a wide range of multimedia and computer vision areas including multimedia analysis and multimedia content analysis.

Data Collection

We start by selecting several crowded places with complex and diverse events for video collection. In total, our video sequences are collected from 9 different scenes: airport, dining hall, indoor, jail, mall, square, school, station and street. Most of these videos are selected from our own private sequences and contain complex interactions between persons. Then, to further increase the variety and complexity of behaviors in videos, we searched some videos recording unusual scenes (e.g. jail, factory) and anomalous events (e.g. fighting, earthquake, robbery) on YouTube. For each scene, we keep several videos captured at different sites and with different types of events happening to ensure the diversity of scenarios. Moreover, data redundancy is avoided through manual checking. In order to protect the privacy of relevant personnel and units, we blurred the faces and the key text in the videos. Finally, 32 real-world video sequences in different scenes are collected, each containing one or more complex events. These video sequences are split into training and testing set of 19 and 13 videos elaborately so that both sets cover all the scenes but with different camera angles or sites.

Data Annotation

In our dataset, the bounding-boxes, keypointbased poses, human identities, and human actions are all manually annotated. The annotation procedure is as follows:
First, similar to the MOT dataset, we annotate bounding boxes for all moving pedestrians (e.g. running, walking, fighting, riding) and static people (e.g. standing, sitting, lying). A unique track ID is assigned to each person until it moves out of the camera field-of-view.
Second, we annotate poses for each person in the entire video. Different from PoseTrack and COCO, our annotated pose for each body contains 14 key-points (Figure 2a): nose, chest, shoulders, elbows, wrists, hips, knees, ankles. Specially, we skip pose annotation which falls into any of the following conditions: (1) heavy occlusion (2) area of the bounding box is less than 500 pixels. Figure 2b presents some pose and bounding-box annotation examples.
Third, we annotate actions of all individuals in every 20 frames in a video. For group actions, we assign the action label to each group member involved in this group activity. In total, we defined 14 action categories: walking-alone, walkingtogether, running-alone, running-together, riding, sittingtalking, sitting-alone, queuing, standing-alone, gathering, fighting, fall-over, walking-up-down-stairs, crouching-bowing. Finally, all annotations are double-checked to ensure their quality.

Citation

@misc{lin2020human,
      title={Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in
Complex Events},
      author={Weiyao Lin and Huabin Liu and Shizhan Liu and Yuxi Li and Guo-Jun Qi and
Rui Qian and Tao Wang and Nicu Sebe and Ning Xu and Hongkai Xiong and Mubarak Shah},
      year={2020},
      eprint={2005.04490},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
立即开始构建AI