Recent Submissions

  • Generalization of the Orthodiagonal Involutive Type of Kokotsakis Flexible Polyhedra

    Aikyn, Alisher; Liu, Yang; Lyakhov, Dmitry; Pottmann, Helmut; Michels, Dominik L. (arXiv, 2023-03-19) [Preprint]
    In this paper we introduce and study a remarkable class of mechanisms formed by a 3×3 arrangement of rigid and skew quadrilateral faces with revolute joints at the common edges. These Kokotsakis-type mechanisms with a quadrangular base and non-planar faces are a generalization of Izmestiev's orthodiagonal involutive type of Kokotsakis polyhedra formed by planar quadrilateral faces. Our algebraic approach yields a complete characterization of all complexes of the orthodiagonal involutive type. It is shown that one has 8 degrees of freedom to construct such mechanisms. This is illustrated by several examples, including cases that are not possible with planar faces.
  • Large Numerical Aperture Metalens with High Modulation Transfer Function

    Zhang, Jian; Dun, Xiong; Zhu, Jingyuan; Zhang, Zhanyi; Feng, Chao; Wang, Zhanshan; Heidrich, Wolfgang; Cheng, Xinbin (ACS Photonics, American Chemical Society (ACS), 2023-03-14) [Article]
    Large numerical aperture (NA) lenses with high modulation transfer functions (MTFs) promise high image resolution for advanced optical imaging. However, it is challenging to achieve a high MTF using traditional large-NA lenses, which are fundamentally limited by the amplitude mismatch. In contrast, metasurfaces are promising for realizing amplitude and phase matching for ideal lenses. However, current metalenses are mostly based on a phase-only (PO) profile because the strong coupling among the metaatoms in large-NA lenses makes perfect amplitude matching quite challenging to realize. Here, we derive a phase-and-amplitude (PA) profile that approaches the theoretical MTF limit for large-NA lenses and use interferometric unit cells combined with a segmented sampling approach to achieve the desired amplitude and phase control. For the first time, we show that the amplitude does not require a perfect match; realizing the trend of the required amplitude is sufficient to significantly increase the MTF of a large-NA lens. We demonstrated a 0.9 NA cylindrical metalens at 940 nm with a Struve ratio (SR), which describes how close the MTF is to the upper limit, increasing from 0.68 to 0.90 compared with the PO metalens. Experimentally, we achieved an SR of 0.77 for the 0.9 NA lens, which is even 0.09 higher than the simulated SR of the PO metalens. Our investigation provides new insights for large-NA lenses and has potential applications in high-image-resolution optical systems.
  • ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

    Zhu, Deyao; Chen, Jun; Haydarov, Kilichbek; Shen, Xiaoqian; Zhang, Wenxuan; Elhoseiny, Mohamed (arXiv, 2023-03-12) [Preprint]
    Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. By keeping acquiring new visual information from BLIP-2's answers, ChatCaptioner is able to generate more enriched image descriptions. We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Our results demonstrate that ChatCaptioner's captions are significantly more informative, receiving three times as many votes from human evaluators for providing the most image information. Besides, ChatCaptioner identifies 53% more objects within the image than BLIP-2 alone measured by WordNet synset matching.
  • ELF: Federated Langevin Algorithms with Primal, Dual and Bidirectional Compression

    Karagulyan, Avetik; Richtarik, Peter (arXiv, 2023-03-08) [Preprint]
    Federated sampling algorithms have recently gained great popularity in the community of machine learning and statistics. This paper studies variants of such algorithms called Error Feedback Langevin algorithms (ELF). In particular, we analyze the combinations of EF21 and EF21-P with the federated Langevin Monte-Carlo. We propose three algorithms: P-ELF, D-ELF, and B-ELF that use, respectively, primal, dual, and bidirectional compressors. We analyze the proposed methods under Log-Sobolev inequality and provide non-asymptotic convergence guarantees.
  • Aberration-Aware Depth-from-Focus

    Yang, Xinge; Fu, Qiang; Elhoseiny, Mohammed; Heidrich, Wolfgang (arXiv, 2023-03-08) [Preprint]
    Computer vision methods for depth estimation usually use simple camera models with idealized optics. For modern machine learning approaches, this creates an issue when attempting to train deep networks with simulated data, especially for focus-sensitive tasks like Depth-from-Focus. In this work, we investigate the domain gap caused by off-axis aberrations that will affect the decision of the best-focused frame in a focal stack. We then explore bridging this domain gap through aberration-aware training (AAT). Our approach involves a lightweight network that models lens aberrations at different positions and focus distances, which is then integrated into the conventional network training pipeline. We evaluate the generality of pretrained models on both synthetic and real-world data. Our experimental results demonstrate that the proposed AAT scheme can improve depth estimation accuracy without fine-tuning the model or modifying the network architecture.
  • Recognizing the Shape and Size of Tundra Lakes in Synthetic Aperture Radar (SAR) Images Using Deep Learning Segmentation

    Demchev, Denis; Sudakow, Ivan; Khodos, Alexander; Abramova, Irina; Lyakhov, Dmitry; Michels, Dominik L. (Remote Sensing, MDPI AG, 2023-02-26) [Article]
    Permafrost tundra contains more than twice as much carbon as is currently in the atmosphere, and it is warming six times as fast as the global mean. Tundra lakes dynamics is a robust indicator of global climate processes, and is still not well understood. Satellite data, particularly, from synthetic aperture radar (SAR) is a suitable tool for tundra lakes recognition and monitoring of their changes. However, manual analysis of lake boundaries can be slow and inefficient; therefore, reliable automated algorithms are required. To address this issue, we propose a two-stage approach, comprising instance deep-learning-based segmentation by U-Net, followed by semantic segmentation based on a watershed algorithm for separating touching and overlapping lakes. Implementation of this concept is essential for accurate sizes and shapes estimation of an individual lake. Here, we evaluated the performance of the proposed approach on lakes, manually extracted from tens of C-band SAR images from Sentinel-1, which were collected in the Yamal Peninsula and Alaska areas in the summer months of 2015–2022. An accuracy of 0.73, in terms of the Jaccard similarity index, was achieved. The lake’s perimeter, area and fractal sizes were estimated, based on the algorithm framework output from hundreds of SAR images. It was recognized as lognormal distributed. The evaluation of the results indicated the efficiency of the proposed approach for accurate automatic estimation of tundra lake shapes and sizes, and its potential to be used for further studies on tundra lake dynamics, in the context of global climate change, aimed at revealing new factors that could cause the planet to warm or cool.
  • ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat, Shariq Farooq; Birkl, Reiner; Wofk, Diana; Wonka, Peter; Müller, Matthias (arXiv, 2023-02-23) [Preprint]
    This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains.
  • LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks

    Li, Guohao; Xu, Mengmeng; Giancola, Silvio; Thabet, Ali; Ghanem, Bernard (IEEE, 2023-02-22) [Conference Paper]
    Point cloud architecture design has become a crucial problem for deep learning in 3D. Several efforts have been made to manually design architectures targeting high accuracy in point cloud tasks such as classification, segmentation, and detection. Recent progress in automatic Neural Architecture Search (NAS) minimizes the human effort in network design and optimizes architectures for high performance. However, those efforts fail to consider crucial factors such as latency during inference, which is of high importance in time-critical and hardware-bounded applications like self-driving cars, robot navigation, and mobile applications. In this paper, we introduce a new NAS framework, dubbed LC-NAS, that searches for point cloud architectures constrained to a target latency. We implement a novel latency constraint formulation for the trade-off between accuracy and latency in our architecture search. Contrary to previous works, our latency loss enables us to find the best architecture with latency near a specific target value, which is crucial when the end task is to be deployed in a limited hardware setting. Extensive experiments show that LC-NAS is able to find state-of-the-art architectures for point cloud classification in ModelNet40 with a minimal computational cost. We also show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy. Finally, we show how our searched architectures easily transfer to the part segmentation task on PartNet, where we achieve state-of-the-art results with significantly lower latency.
  • TAMUNA: Accelerated Federated Learning with Local Training and Partial Participation

    Condat, Laurent Pierre; Malinovsky, Grigory; Richtarik, Peter (arXiv, 2023-02-20) [Preprint]
    In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and communication with a distant server. Communication, which can be slow and costly, is the main bottleneck in this setting. To accelerate distributed gradient descent, the popular strategy of local training is to communicate less frequently; that is, to perform several iterations of local computations between the communication steps. A recent breakthrough in this field was made by Mishchenko et al. (2022): their Scaffnew algorithm is the first to probably benefit from local training, with accelerated communication complexity. However, it was an open and challenging question to know whether the powerful mechanism behind Scaffnew would be compatible with partial participation, the desirable feature that not all clients need to participate to every round of the training process. We answer this question positively and propose a new algorithm, which handles local training and partial participation, with state-of-the-art communication complexity.
  • Physics-Informed Deep Neural Network for Backward-in-Time Prediction: Application to Rayleigh–Bénard Convection

    Hammoud, Mohamad Abed El Rahman; Alwassel, Humam; Ghanem, Bernard; Knio, Omar; Hoteit, Ibrahim (Artificial Intelligence for the Earth Systems, American Meteorological Society, 2023-02-14) [Article]
    Backward-in-time predictions are needed to better understand the underlying dynamics of physical fluid flows and improve future forecasts. However, integrating fluid flows backward in time is challenging because of numerical instabilities caused by the diffusive nature of the fluid systems and nonlinearities of the governing equations. Although this problem has been long addressed using a non-positive diffusion coefficient when integrating backward, it is notoriously inaccurate. In this study, a physics-informed deep neural network (PI-DNN) is presented to predict past states of a dissipative dynamical system from snapshots of the subsequent evolution of the system state. The performance of the PI-DNN is investigated using several systematic numerical experiments and the accuracy of the backward-in-time predictions is evaluated in terms of different error metrics. The proposed PI-DNN can predict the previous state of the Rayleigh–Bénard convection with an 8-time step average normalized ℓ2-error of less than 2% for a turbulent flow at a Rayleigh number of 105.
  • Federated Learning with Regularized Client Participation

    Malinovsky, Grigory; Horváth, Samuel; Burlachenko, Konstantin; Richtarik, Peter (arXiv, 2023-02-07) [Preprint]
    Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communication round. In our research, we propose a new technique and design a novel regularized client participation scheme. Under this scheme, each client joins the learning process every R communication rounds, which we refer to as a meta epoch. We have found that this participation scheme leads to a reduction in the variance caused by client sampling. Combined with the popular FedAvg algorithm (McMahan et al., 2017), it results in superior rates under standard assumptions. For instance, the optimization term in our main convergence bound decreases linearly with the product of the number of communication rounds and the size of the local dataset of each client, and the statistical term scales with step size quadratically instead of linearly (the case for client sampling with replacement), leading to better convergence rate O(1T2) compared to O(1T), where T is the total number of communication rounds. Furthermore, our results permit arbitrary client availability as long as each client is available for training once per each meta epoch.
  • BUSIFusion: Blind Unsupervised Single Image Fusion of Hyperspectral and RGB Images

    Li, Jiabao; Li, Yuqi; Wang, Chong; Ye, Xulun; Heidrich, Wolfgang (IEEE Transactions on Computational Imaging, Institute of Electrical and Electronics Engineers (IEEE), 2023-02-06) [Article]
    Hyperspectral images (HSIs) provide rich spectral information that has been widely used in numerous computer vision tasks. However, their low spatial resolution often prevents their use in applications such as image segmentation and recognition. Fusing low-resolution HSIs with high-resolution RGB images to reconstruct high-resolution HSIs has attracted great research attention recently. In this paper, we propose an unsupervised blind fusion network that operates on a single HSI and RGB image pair and requires neither known degradation models nor any training data. Our method takes full advantage of an unrolling network and coordinate encoding to provide a state-of-the-art HSI reconstruction. It can also estimate the degradation parameters relatively accurately through the neural representation and implicit regularization of the degradation model. The experimental results demonstrate the effectiveness of our method both in simulations and in our real experiments. The proposed method outperforms other state-of-the-art nonblind and blind fusion methods on two popular HSI datasets.
  • Real-Time Evaluation in Online Continual Learning: A New Paradigm

    Ghunaim, Yasir; Bibi, Adel; Alhamoud, Kumail; Alfarra, Motasem; Hammoud, Hasan Abed Al Kader; Prabhu, Ameya; Torr, Philip H. S.; Ghanem, Bernard (arXiv, 2023-02-02) [Preprint]
    Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We hypothesize that under this new evaluation paradigm, computationally demanding CL approaches may perform poorly on streams with a varying distribution. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.
  • High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance

    Sadiev, Abdurakhmon; Danilova, Marina; Gorbunov, Eduard; Horváth, Samuel; Gidel, Gauthier; Dvurechensky, Pavel; Gasnikov, Alexander; Richtarik, Peter (arXiv, 2023-02-02) [Preprint]
    During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as the boundedness of the gradient noise variance or of the objective's gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central α-th moment for α∈(1,2] in the following setups: (i) smooth non-convex / Polyak-Lojasiewicz / convex / strongly convex / quasi-strongly convex minimization problems, (ii) Lipschitz / star-cocoercive and monotone / quasi-strongly monotone variational inequalities. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes studied in stochastic optimization.
  • Curriculum Learning for ab initio Deep Learned Refractive Optics

    Yang, Xinge; Fu, Qiang; Heidrich, Wolfgang (arXiv, 2023-02-02) [Preprint]
    Deep lens optimization has recently emerged as a new paradigm for designing computational imaging systems, however it has been limited to either simple optical systems consisting of a single DOE or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a deep lens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces, therefore overcoming the need for a good initial design. We demonstrate this approach with the fully-automatic design of an extended depth-of-field computational camera in a cellphone-style form factor, highly aspherical surfaces, and a short back focal length.
  • 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models

    Zhang, Biao; Tang, Jiapeng; Niessner, Matthias; Wonka, Peter (arXiv, 2023-02-01) [Preprint]
    We introduce 3DShape2VecSet, a novel shape representation for neural fields designed for generative diffusion models. Our shape representation can encode 3D shapes given as surface models or point clouds, and represents them as neural fields. The concept of neural fields has previously been combined with a global latent vector, a regular grid of latent vectors, or an irregular grid of latent vectors. Our new representation encodes neural fields on top of a set of vectors. We draw from multiple concepts, such as the radial basis function representation and the cross attention and self-attention function, to design a learnable representation that is especially suitable for processing with transformers. Our results show improved performance in 3D shape encoding and 3D shape generative modeling tasks. We demonstrate a wide variety of generative applications: unconditioned generation, category-conditioned generation, text-conditioned generation, point-cloud completion, and image-conditioned generation.
  • Riemannian Geometry for Scientific Visualization

    Hadwiger, Markus; Theußl, Thomas; Rautek, Peter (ACM, 2023-01-31) [Conference Paper]
    This tutorial introduces the most important basics of Riemannian geometry and related concepts with a specific focus on applications in scientific visualization. The main concept in Riemannian geometry is the presence of a Riemannian metric on a differentiable manifold, comprising a second-order tensor field that defines an inner product in each tangent space that varies smoothly from point to point. Technically, the metric is what allows defining and computing distances and angles in a coordinate-independent manner. However, even more importantly, it in a sense is really the major structure (on top of topological considerations) that defines the space where scientific data, such as scalar, vector, and tensor fields live.
  • Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

    Zhu, Deyao; Wang, Yuhui; Schmidhuber, Juergen; Elhoseiny, Mohamed (arXiv, 2023-01-30) [Preprint]
    Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, these methods typically require action information to be logged during data collection, which can be difficult or even impossible in some practical cases. In this paper, we investigate the potential of using action-free offline datasets to improve online reinforcement learning, name this problem Reinforcement Learning with Action-Free Offline Pretraining (AFP-RL). We introduce Action-Free Guide (AF-Guide), a method that guides online training by extracting knowledge from action-free offline datasets. AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT. Experimental results show that AF-Guide can improve sample efficiency and performance in online training thanks to the knowledge from the action-free offline dataset.
  • Catalyst Acceleration of Error Compensated Methods Leads to Better Communication Complexity

    Qian, Xun; Dong, Hanze; Zhang, Tong; Richtarik, Peter (arXiv, 2023-01-24) [Preprint]
    Communication overhead is well known to be a key bottleneck in large scale distributed learning, and a particularly successful class of methods which help to overcome this bottleneck is based on the idea of communication compression. Some of the most practically effective gradient compressors, such as TopK, are biased, which causes convergence issues unless one employs a well designed {\em error compensation/feedback} mechanism. Error compensation is therefore a fundamental technique in the distributed learning literature. In a recent development, Qian et al (NeurIPS 2021) showed that the error-compensation mechanism can be combined with acceleration/momentum, which is another key and highly successful optimization technique. In particular, they developed the error-compensated loop-less Katyusha (ECLK) method, and proved an accelerated linear rate in the strongly convex case. However, the dependence of their rate on the compressor parameter does not match the best dependence obtainable in the non-accelerated error-compensated methods. Our work addresses this problem. We propose several new accelerated error-compensated methods using the {\em catalyst acceleration} technique, and obtain results that match the best dependence on the compressor parameter in non-accelerated error-compensated methods up to logarithmic terms.
  • Knowledge-aware Global Reasoning for Situation Recognition

    Yu, Weijiang; Wang, Haofan; Li, Guohao; Xiao, Nong; Ghanem, Bernard (IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), 2023-01-23) [Article]
    The task of situation recognition aims to solve the visual reasoning problem with the ability to predict the activity happening (salient action) in an image and the nouns of all associated semantic roles playing in the activity. This poses severe challenges due to long-tailed data distributions and local class ambiguities. Prior works only propagate the local noun-level features on one single image without utilizing global information. We propose a Knowledge-aware Global Reasoning (KGR) framework to endow neural networks with the capability of adaptive global reasoning over nouns by exploiting diverse statistical knowledge. Our KGR is a local-global architecture, which consists of a local encoder to generate noun features using local relations and a global encoder to enhance the noun features via global reasoning supervised by an external global knowledge pool. The global knowledge pool is created by counting the pairwise relationships of nouns in the dataset. In this paper, we design an action-guided pairwise knowledge as the global knowledge pool based on the characteristic of the situation recognition task. Extensive experiments have shown that our KGR not only achieves state-of-the-art results on a large-scale situation recognition benchmark, but also effectively solves the long-tailed problem of noun classification by our global knowledge.

View more