• Login
    View Item 
    •   Home
    • Theses and Dissertations
    • PhD Dissertations
    • View Item
    •   Home
    • Theses and Dissertations
    • PhD Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of KAUSTCommunitiesIssue DateSubmit DateThis CollectionIssue DateSubmit Date

    My Account

    Login

    Quick Links

    Open Access PolicyORCID LibguideTheses and Dissertations LibguideSubmit an Item

    Statistics

    Display statistics

    Towards Richer Video Representation for Action Understanding

    • CSV
    • RefMan
    • EndNote
    • BibTex
    • RefWorks
    Thumbnail
    Name:
    HumamAlwasselDissertation.pdf
    Size:
    14.66Mb
    Format:
    PDF
    Description:
    PhD Dissertation
    Download
    Type
    Dissertation
    Authors
    Alwassel, Humam cc
    Advisors
    Ghanem, Bernard cc
    Committee members
    Zisserman, Andrew
    Salama, Khaled N. cc
    Elhoseiny, Mohamed cc
    Program
    Computer Science
    KAUST Department
    Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division
    Date
    2023-01
    Permanent link to this record
    http://hdl.handle.net/10754/690551
    
    Metadata
    Show full item record
    Abstract
    With video data dominating the internet traffic, it is crucial to develop automated models that can analyze and understand what humans do in videos. Such models must solve tasks such as action classification, temporal activity localization, spatiotemporal action detection, and video captioning. This dissertation aims to identify the challenges hindering the progress in human action understanding and propose novel solutions to overcome these challenges. We identify three challenges: (i) the lack of tools to systematically profile algorithms' performance and understand their strengths and weaknesses, (ii) the expensive cost of large-scale video annotation, and (iii) the prohibitively large memory footprint of untrimmed videos, which forces localization algorithms to operate atop precomputed temporally-insensitive clip features. To address the first challenge, we propose a novel diagnostic tool to analyze the performance of action detectors and compare different methods beyond a single scalar metric. We use our tool to analyze the top action localization algorithm and conclude that the most impactful aspects to work on are: devising strategies to handle temporal context around the instances better, improving the robustness with respect to the instance absolute and relative size, and proposing ways to reduce the localization errors. Moreover, our analysis finds that the lack of agreement among annotators is not a significant roadblock to attaining progress in the field. We tackle the second challenge by proposing novel frameworks and algorithms that learn from videos with incomplete annotations (weak supervision) or no labels (self-supervision). In the weakly-supervised scenario, we study the temporal action localization task on untrimmed videos where only a weak video-level label is available. We propose a novel weakly-supervised method that uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. In the self-supervised setup, we study learning from unlabeled videos by exploiting the strong correlation between the visual frames and the audio signal. We propose a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps our model utilize the semantic correlation and the differences between the two modalities, resulting in the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture. Finally, the third challenge stems from localization methods using precomputed clip features extracted from video encoders typically trained for trimmed action classification tasks. Such features tend to be temporally insensitive, i.e., background (no action) segments can have similar representations to foreground (action) segments from the same untrimmed video. These temporally-insensitive features make it harder for the localization algorithm to learn the target task and thus negatively impact the final performance. We propose to mitigate this temporal insensitivity through a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information.
    Citation
    Alwassel, H. (2023). Towards Richer Video Representation for Action Understanding [KAUST Research Repository]. https://doi.org/10.25781/KAUST-6529U
    DOI
    10.25781/KAUST-6529U
    ae974a485f413a2113503eed53cd6c53
    10.25781/KAUST-6529U
    Scopus Count
    Collections
    PhD Dissertations; Computer Science Program; Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division

    entitlement

     
    DSpace software copyright © 2002-2023  DuraSpace
    Quick Guide | Contact Us | KAUST University Library
    Open Repository is a service hosted by 
    Atmire NV
     

    Export search results

    The export option will allow you to export the current search results of the entered query to a file. Different formats are available for download. To export the items, click on the button corresponding with the preferred download format.

    By default, clicking on the export buttons will result in a download of the allowed maximum amount of items. For anonymous users the allowed maximum amount is 50 search results.

    To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

    After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.