Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
KAUST DepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Computer Science Program
Electrical Engineering Program
Visual Computing Center (VCC)
KAUST Grant NumberOSR-CRG2017-3405
Online Publication Date2018-10-05
Print Publication Date2018
Permanent link to this recordhttp://hdl.handle.net/10754/630233
MetadataShow full item record
AbstractState-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the actions being searched for. To address this need, we propose the new problem of action spotting in video, which we define as finding a specific action in a video while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in spotting and finding action instances in video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the action spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but it also accurately finds human activities with 30.8% mAP.
CitationAlwassel H, Caba Heilbron F, Ghanem B (2018) Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization. Lecture Notes in Computer Science: 253–269. Available: http://dx.doi.org/10.1007/978-3-030-01240-3_16.
SponsorsThis publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2017-3405.
Conference/Event name15th European Conference on Computer Vision, ECCV 2018
Showing items related by title, author, creator and subject.
DAPs: Deep Action Proposals for Action UnderstandingEscorcia, Victor; Caba Heilbron, Fabian; Niebles, Juan Carlos; Ghanem, Bernard (Lecture Notes in Computer Science, Springer Nature, 2016-09-17) [Conference Paper]Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.
Trajectory-based Fisher kernel representation for action recognition in videosAtmosukarto, Indriyati; Ghanem, Bernard; Ahuja, Narendra (Institute of Electrical and Electronics Engineers (IEEE), 2012) [Conference Paper]Action recognition is an important computer vision problem that has many applications including video indexing and retrieval, event detection, and video summarization. In this paper, we propose to apply the Fisher kernel paradigm to action recognition. The Fisher kernel framework combines the strengths of generative and discriminative models. In this approach, given the trajectories extracted from a video and a generative Gaussian Mixture Model (GMM), we use the Fisher Kernel method to describe how much the GMM parameters are modified to best fit the video trajectories. We experiment in using the Fisher Kernel vector to create the video representation and to train an SVM classifier. We further extend our framework to select the most discriminative trajectories using a novel MIL-KNN framework. We compare the performance of our approach to the current state-of-the-art bag-of-features (BOF) approach on two benchmark datasets. Experimental results show that our proposed approach outperforms the state-of-the-art method  and that the selected discriminative trajectories are descriptive of the action class.
Learning a strong detector for action localization in videosZhang, Yongqiang; Ding, Mingli; Bai, Yancheng; Liu, Dandan; Ghanem, Bernard (Pattern Recognition Letters, Elsevier BV, 2019-10-09) [Article]We address the problem of spatio-temporal action localization in videos in this paper. Current state-of-the-art methods for this challenging task rely on an object detector to localize actors at frame-level firstly, and then link or track the detections across time. Most of these methods commonly pay more attention to leveraging the temporal context of videos for action detection while ignoring the importance of the object detector itself. In this paper, we prove the importance of the object detector in the pipeline of action localization, and propose a strong object detector for better action localization in videos, which is based on the single shot multibox detector (SSD) framework. Different from SSD, we introduce an anchor refine branch at the end of the backbone network to refine the input anchors, and add a batch normalization layer before concatenating the intermediate feature maps at frame-level and after stacking feature maps at clip-level. The proposed strong detector have two contributions: (1) reducing the phenomenon of missing target objects at frame-level; (2) generating deformable anchor cuboids for modeling temporal dynamic actions. Extensive experiments on UCF-Sports, J-HMDB and UCF-101 validate our claims, and we outperform the previous state-of-the-art methods by a large margin in terms of frame-mAP and video-mAP, especially at a higher overlap threshold.