Extracting Semantic and Geometric Information in Images and Videos using GANs
Permanent link to this recordhttp://hdl.handle.net/10754/690240
MetadataShow full item record
AbstractThe success of Generative Adversarial Networks (GANs) has resulted in unprecedented quality both for image generation and manipulation. Recent state-of-the-art GANs (e.g., the StyleGAN series) have demonstrated outstanding results in photo-realistic image generation. In this dissertation, we explore the latent space properties, including image manipulation, extraction of 3D properties, and performing various weakly supervised and unsupervised downstream tasks using StyleGAN and its derivative architectures. First, we study the images' projection into StyleGAN's latent space and analyze the properties of embedded images in a proposed extended $W+$ latent space. Second, we demonstrate rich semantic interpretations of the images in the latent space, which indirectly creates a compelling semantic understanding of the underlying latent space. Specifically, we combine $W+$ space with Noise space optimization and tensor manipulations to enable high-quality reconstruction and local editing. For example, we can perform image inpainting where these regularized latent spaces reconstruct the image's content, and the details of the missing regions are filled by the GAN prior. Next, we study if a 2D image-based GAN learns a meaningful semantic model and 3D properties in an image. Using our analysis, we can extract a plausible interpretation of 3D geometry, lighting, materials, and other semantic attributes of the source images by modeling the latent space using conditional continuous normalizing flows. As a result, we can perform non-linear sequential edits on the source image without affecting the quality and identity of the image. Furthermore, we propose a technique to extract underlying latent space properties using an unsupervised method to generalize our analysis on unseen datasets where human knowledge is limited. Specifically, we use an information-rich visual-linguistic model, CLIP, trained on internet scale data of image-text pairs. The proposed framework extracts, labels, and projects important directions into the GAN latent space without human supervision. Finally, inspired by the findings of our analysis, we investigate additional related unexplored questions: Can we perform foreground object segmentation? Can an image-based GAN be used to edit videos? Can we generate view-consistent editable 3D animations? Investigating these research questions helps us use GANs to tackle a spectrum of tasks outside the usual image generation task. Specifically, we propose a technique to segment foreground objects from the generated images using the information stored in the StyleGAN feature maps. This framework can be used to create synthetic datasets, which can be used to train existing supervised segmentation networks. Then, we study the regularized $W+$, activation $S$, and Fourier feature $F_f$ spaces to embed and edit videos in the image-based StyleGAN3, a variant of StyleGAN. We can generate high-quality videos at $1024^2$ resolution using a single image and driving videos. Finally, we propose a framework for domain adaptation in 3D-GANs that can link latent spaces of different models together. We build upon EG3D, a 3D-GAN derived from StyleGAN, to enable the generation, editing, and animation of personalized 3D avatars. Technically, we propose a method to align the camera distribution of two domains i.e., faces and avatars. Then we propose a method for domain adaptation in 3D-GANs using texture, geometric, and depth regularization with an option to model more exaggerated geometries. Finally, we propose a method to link and project real faces into the 3D artistic domain. These frameworks allow us to develop tools distilled from an unconditional GAN for unsupervised image segmentation, video editing, and personalized 3D animation generation and manipulation with state-of-the-art performance. We create these tools without needing extra annotated object segmentation, video, or 3D data.
CitationAbdal, R. (2023). Extracting Semantic and Geometric Information in Images and Videos using GANs [KAUST Research Repository]. https://doi.org/10.25781/KAUST-X2Z79