### Recent Submissions

• #### Adaptive Tikhonov strategies for stochastic ensemble Kalman inversion

(arXiv, 2021-10-18) [Preprint]
• #### Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

(arXiv, 2021-10-15) [Preprint]
We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.
• #### Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks

(arXiv, 2021-10-15) [Preprint]
We present a new method for one shot domain adaptation. The input to our method is trained GAN that can produce images in domain A and a single reference image I_B from domain B. The proposed algorithm can translate any output of the trained GAN from domain A to domain B. There are two main advantages of our method compared to the current state of the art: First, our solution achieves higher visual quality, e.g. by noticeably reducing overfitting. Second, our solution allows for more degrees of freedom to control the domain gap, i.e. what aspects of image I_B are used to define the domain B. Technically, we realize the new method by building on a pre-trained StyleGAN generator as GAN and a pre-trained CLIP model for representing the domain gap. We propose several new regularizers for controlling the domain gap to optimize the weights of the pre-trained StyleGAN generator to output images in domain B instead of domain A. The regularizers prevent the optimization from taking on too many attributes of the single reference image. Our results show significant visual improvements over the state of the art as well as multiple applications that highlight improved control.
• #### A High-Throughput Skim-sequencing Approach for Genotyping, Dosage Estimation and Identifying Translocations

(Research Square Platform LLC, 2021-10-15) [Preprint]
The development of next generation sequencing (NGS) enabled a shift from array-based genotyping to high-throughput genotyping by directly sequencing genomic libraries. Even though whole genome sequencing was initially too costly for routine analysis in large populations, such as those utilized for breeding or genetic studies, continued advancements in genome sequencing and bioinformatics have provided the opportunity to utilize whole-genome information. As new sequencing platforms can routinely provide high-quality sequencing data for sufficient genome coverage, a limitation comes in the time and high cost of library construction when multiplexing a large number of samples. Here we describe a high-throughput whole-genome skim-sequencing (skim-seq) approach that can be utilized for a broad range of genotyping and genomic characterization. Using optimized low-volume Illumina Nextera chemistry, we developed a skim-seq method and combined up to 960 samples in one multiplex library using dual index barcoding. With the dual-index barcoding, the number of samples for multiplexing can be adjusted depending on amount of data required and extended to 3,072 samples or more. Panels of double haploid wheat lines (Triticum aestivum, CDC Stanley x CDC Landmark), wheat-barley (T. aestivum x Hordeum vulgare) and wheat-wheatgrass (Triticum durum x Thinopyrum intermedium) introgression lines as well as known monosomic wheat stocks were genotyped using the skim-seq approach. Bioinformatics pipelines were developed for various applications where sequencing coverage ranged from 1x down to 0.01x per sample. Using reference genomes, we detected chromosome dosage, identified aneuploidy, and karyotyped introgression lines from the low coverage skim-seq data. Leveraging the recent advancements in genome sequencing, skim-seq provides an effective and low-cost tool for routine genotyping and genetic analysis, which can track and identify introgressions and genomic regions of interest in genetics research and applied breeding programs.
• #### On the Formation of Hydrogen Peroxide in Water Microdroplets

(arXiv, 2021-10-14) [Preprint]
Recent reports on the formation of hydrogen peroxide (H2O2) in water microdroplets produced via pneumatic spraying or capillary condensation have garnered significant attention. How covalent bonds in water could break under such conditions challenges our textbook understanding of physical chemistry and the water substance. While there is no definitive answer, it has been speculated that ultrahigh electric fields at the air-water interface are responsible for this chemical transformation. Here, we resolve this mystery via a comprehensive experimental investigation of H2O2 formation in (i) water microdroplets sprayed over a range of liquid flowrates, the (shearing) air flow rates, and the air composition (ii) water microdroplets condensed on hydrophobic substrates formed via hot water or humidifier under controlled air composition. Specifically, we assessed the contributions of the evaporative concentration and shock waves in sprays and the effects of trace O3(g) on the H2O2 formation. Glovebox experiments revealed that the H2O2 formation in water microdroplets was most sensitive to the air-borne ozone (O3) concentration. In the absence of O3(g), we could not detect H2O2(aq) in sprays or condensates (detection limit ≥250 nM). In contrast, microdroplets exposed to atmospherically relevant O3(g) concentration (10–100 ppb) formed 2–30 μM H2O2(aq); increasing the gas–liquid surface area, mixing, and contact duration increased H2O2(aq) concentration. Thus, the mystery is resolved –the water surface facilitates the O3(g) mass transfer, which is followed by the chemical transformation of O3(aq) into H2O2(aq). These findings should also help us understand the implications of this chemistry in natural and applied contexts.
• #### The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

(arXiv, 2021-10-14) [Preprint]
Despite successes across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depth. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.
• #### Study on the Effect of Size on InGaN Red Micro-LEDs

(Research Square Platform LLC, 2021-10-13) [Preprint]
In this research, five sizes (100⊆100, 75⊆75, 50⊆50, 25⊆25, 10⊆10 µm2) of InGaN red micro-light emitting diode (LED) dies are produced using laser-based direct writing and maskless technology. It is observed that with increasing injection current, the smaller the size of the micro-LED, the more obvious the blue shift of the emission wavelength. When the injection current is increased from 0.1 to 1 mA, the emission wavelength of the 10×10 µm2 micro-LED is shifted from 617.15 to 576.87 nm. The obvious blue shift is attributed to the stress release and high current density injection. Moreover, the output power density is very similar for smaller chip micro-LEDs at the same injection current density. This behavior is different from AlGaInP micro-LEDs. The sidewall defect is more easily repaired by passivation, which is similar to the behavior of blue micro-LEDs. The results indicate that the red InGaN epilayer structure provides an opportunity to realize the full color LEDs fabricated by GaN-based LEDs.
• #### Ego4D: Around the World in 3,000 Hours of Egocentric Video

(arXiv, 2021-10-13) [Preprint]
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/
• #### Relation-aware Video Reading Comprehension for Temporal Language Grounding

(arXiv, 2021-10-12) [Preprint]
Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be released soon.
• #### Sanctuary lost: a cyber-physical warfare in space

(arXiv, 2021-10-12) [Preprint]
Over the last decades, space has grown from a purely scientific struggle, fueled by the desire to demonstrate superiority of one regime over the other, to an anchor point of the economies of essentially all developed countries. Many businesses depend crucially on satellite communication or data acquisition, not only for defense purposes, but increasingly also for day-to-day applications. However, although so far space faring nations refrained from extending their earth-bound conflicts into space, this critical infrastructure is not as invulnerable as common knowledge suggests. In this paper, we analyze the threats space vehicles are exposed to and what must change to mitigate them. In particular, we shall focus on cyber threats, which may well be mounted by small countries and terrorist organizations, whose incentives do not necessarily include sustainability of the space domain and who may not be susceptible to the threat of mutual retaliation on the ground. We survey incidents, highlight threats and raise awareness from general preparedness for accidental faults, which is already widely spread within the space community, to preparedness and tolerance of both accidental and malicious faults (such as targeted attacks by cyber terrorists and nation-state hackers).
• #### SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary

(arXiv, 2021-10-12) [Preprint]
Sports game summarization aims to generate news articles from live text commentaries. A recent state-of-the-art work, SportsSum, not only constructs a large benchmark dataset, but also proposes a two-step framework. Despite its great contributions, the work has three main drawbacks: 1) the noise existed in SportsSum dataset degrades the summarization performance; 2) the neglect of lexical overlap between news and commentaries results in low-quality pseudo-labeling algorithm; 3) the usage of directly concatenating rewritten sentences to form news limits its practicability. In this paper, we publish a new benchmark dataset SportsSum2.0, together with a modified summarization framework. In particular, to obtain a clean dataset, we employ crowd workers to manually clean the original dataset. Moreover, the degree of lexical overlap is incorporated into the generation of pseudo labels. Further, we introduce a reranker-enhanced summarizer to take into account the fluency and expressiveness of the summarized news. Extensive experiments show that our model outperforms the state-of-the-art baseline.
• #### Weak-strong uniqueness for Maxwell-Stefan systems

(arXiv, 2021-10-11) [Preprint]
The weak-strong uniqueness for Maxwell--Stefan systems and some generalized systems is proved. The corresponding parabolic cross-diffusion equations are considered in a bounded domain with no-flux boundary conditions. The key points of the proofs are various inequalities for the relative entropy associated to the systems and the analysis of the spectrum of a quadratic form capturing the frictional dissipation. The latter task is complicated by the singular nature of the diffusion matrix. This difficulty is addressed by proving its positive definiteness on a subspace and using the Bott--Duffin matrix inverse. The generalized Maxwell--Stefan systems are shown to cover several known cross-diffusion systems for the description of tumor growth and physical vapor deposition processes.
• #### Enhancing Fracture Network Characterization: A Data-Driven, Outcrop-Based Analysis

(Wiley, 2021-10-11) [Preprint]
The stochastic discrete fracture network (SDFN) model is a practical approach to model complex fracture systems in the subsurface. However, it is impossible to validate the correctness and quality of an SDFN model because the comprehensive subsurface structure is never known. We utilize a pixel-based fracture detection algorithm to digitize 80 published outcrop maps of different scales at different locations. The key fracture properties, including fracture lengths, orientations, intensities, topological structures, clusters and flow are then analyzed. Our findings provide significant justifications for statistical distributions used in SDFN modellings. In addition, the shortcomings of current SDFN models are discussed. We find that fracture lengths follow multiple (instead of single) power-law distributions with varying exponents. Large fractures tend to have large exponents, possibly because of a small coalescence probability. Most small-scale natural fracture networks have scattered orientations, corresponding to a small κ value (κ<3) in a von Mises--Fisher distribution. Large fracture systems collected in this research usually have more concentrated orientations with large κ values. Fracture intensities are spatially clustered at all scales. A fractal spatial density distribution, which introduces clustered fracture positions, can better capture the spatial clustering than a uniform distribution. Natural fracture networks usually have a significant proportion of T-type nodes, which is unavailable in conventional SDFN models. Thus a rule-based algorithm to mimic the fracture growth and form T-type nodes is necessary. Most outcrop maps show good topological connectivity. However, sealing patterns and stress impact must be considered to evaluate the hydraulic connectivity of fracture networks.
• #### Graph Models for Biological Pathway Visualization: State of the Art and Future Challenges

(arXiv, 2021-10-10) [Preprint]
The concept of multilayer networks has become recently integrated into complex systems modeling since it encapsulates a very general concept of complex relationships. Biological pathways are an example of complex real-world networks, where vertices represent biological entities, and edges indicate the underlying connectivity. For this reason, using multilayer networks to model biological knowledge allows us to formally cover essential properties and theories in the field, which also raises challenges in visualization. This is because, in the early days of pathway visualization research, only restricted types of graphs, such as simple graphs, clustered graphs, and others were adopted. In this paper, we revisit a heterogeneous definition of biological networks and aim to provide an overview to see the gaps between data modeling and visual representation. The contribution will, therefore, lie in providing guidelines and challenges of using multilayer networks as a unified data structure for the biological pathway visualization.
• #### Genomic and metabolic adaptations of biofilms to ecological windows of opportunities in glacier-fed streams

(Cold Spring Harbor Laboratory, 2021-10-08) [Preprint]
Microorganisms dominate life in cryospheric ecosystems. In glacier-fed streams (GFSs), ecological windows of opportunities allow complex microbial biofilms to develop and transiently form the basis of the food web, thereby controlling key ecosystem processes. Here, using high-resolution metagenomics, we unravel strategies that allow biofilms to seize this opportunity in an ecosystem otherwise characterized by harsh environmental conditions. We found a diverse microbiome spanning the entire tree of life and including a rich virome. Various and co-existing energy acquisition pathways point to diverse niches and the simultaneous exploitation of available resources, likely fostering the establishment of complex biofilms in GFSs during windows of opportunity. The wide occurrence of rhodopsins across metagenome-assembled genomes (MAGs), besides chlorophyll, highlights the role of solar energy capture in these biofilms. Concomitantly, internal carbon and nutrient cycling between photoautotrophs and heterotrophs may help overcome constraints imposed by the high oligotrophy in GFSs. MAGs also revealed mechanisms potentially protecting bacteria against low temperatures and high UV-radiation. The selective pressure of the GFS environment is further highlighted by the phylogenomic analysis, differentiating the representatives of the genus Polaromonas, an important component of the GFS microbiome, from those found in other ecosystems. Our findings reveal key genomic underpinnings of adaptive traits that contribute to the success of complex biofilms to exploit environmental opportunities in GFSs, now rapidly changing owing to global warming.
• #### Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

(arXiv, 2021-10-07) [Preprint]
Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across these and other applications, it is necessary to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Randk; MASHA1) and contractive (such as Topk; MASHA2) compressors. We empirically validate our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers.
• #### EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

(arXiv, 2021-10-07) [Preprint]
• #### Permutation Compressors for Provably Faster Distributed Nonconvex Optimization

(arXiv, 2021-10-07) [Preprint]
We study the MARINA method of Gorbunov et al. (2021) – the current state-of-theart distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and the reliance on independent stochastic communication compression operators, which leads to a reduction in the number of transmitted bits within each communication round. In this paper we i) extend the theory of MARINA to support a much wider class of potentially correlated compressors, extending the reach of the method beyond the classical independent compressors setting, ii) show that a new quantity, for which we coin the name Hessian variance, allows us to significantly refine the original analysis of MARINA without any additional assumptions, and iii) identify a special class of correlated compressors based on the idea of random permutations, for which we coin the term PermK, the use of which leads to O(√n) (resp. O(1 + d/√n)) improvement in the theoretical communication complexity of MARINA in the low Hessian variance regime when d ≥ n (resp. d ≤ n), where n is the number of workers and d is the number of parameters describing the model we are learning. We corroborate our theoretical results with carefully engineered synthetic experiments with minimizing the average of nonconvex quadratics, and on autoencoder training with the MNIST dataset.
• #### Run Time Assurance for Safety-Critical Systems: An Introduction to Safety Filtering Approaches for Complex Control Systems

(arXiv, 2021-10-07) [Preprint]
Run Time Assurance (RTA) Systems are online verification mechanisms that filter an unverified primary controller output to ensure system safety. The primary control may come from a human operator, an advanced control approach, or an autonomous control approach that cannot be verified to the same level as simpler control systems designs. The critical feature of RTA systems is their ability to alter unsafe control inputs explicitly to assure safety. In many cases, RTA systems can functionally be described as containing a monitor that watches the state of the system and output of a primary controller, and a backup controller that replaces or modifies control input when necessary to assure safety. An important quality of an RTA system is that the assurance mechanism is constructed in a way that is entirely agnostic to the underlying structure of the primary controller. By effectively decoupling the enforcement of safety constraints from performance-related objectives, RTA offers a number of useful advantages over traditional (offline) verification. This article provides a tutorial on developing RTA systems.
• #### A Lagged Particle Filter for Stable Filtering of certain High-Dimensional State-Space Models

(arXiv, 2021-10-02) [Preprint]
We consider the problem of high-dimensional filtering of state-space models (SSMs) at discrete times. This problem is particularly challenging as analytical solutions are typically not available and many numerical approximation methods can have a cost that scales exponentially with the dimension of the hidden state. Inspired by lag-approximation methods for the smoothing problem, we introduce a lagged approximation of the smoothing distribution that is necessarily biased. For certain classes of SSMs, particularly those that forget the initial condition exponentially fast in time, the bias of our approximation is shown to be uniformly controlled in the dimension and exponentially small in time. We develop a sequential Monte Carlo (SMC) method to recursively estimate expectations with respect to our biased filtering distributions. Moreover, we prove for a class of non-i.i.d.~SSMs that as the dimension $d\rightarrow\infty$ the cost to achieve a stable mean square error in estimation, for classes of expectations, is of $\mathcal{O}(Nd^2)$ per-unit time, where $N$ is the number of simulated samples in the SMC algorithm. Our methodology is implemented on several challenging high-dimensional examples including the conservative shallow-water model.