Recent Submissions

• In silico exploration of Red Sea Bacillus genomes for natural product biosynthetic gene clusters

(Springer Nature, 2018-05-22)
BackgroundThe increasing spectrum of multidrug-resistant bacteria is a major global public health concern, necessitating discovery of novel antimicrobial agents. Here, members of the genus Bacillus are investigated as a potentially attractive source of novel antibiotics due to their broad spectrum of antimicrobial activities. We specifically focus on a computational analysis of the distinctive biosynthetic potential of Bacillus paralicheniformis strains isolated from the Red Sea, an ecosystem exposed to adverse, highly saline and hot conditions.ResultsWe report the complete circular and annotated genomes of two Red Sea strains, B. paralicheniformis Bac48 isolated from mangrove mud and B. paralicheniformis Bac84 isolated from microbial mat collected from Rabigh Harbor Lagoon in Saudi Arabia. Comparing the genomes of B. paralicheniformis Bac48 and B. paralicheniformis Bac84 with nine publicly available complete genomes of B. licheniformis and three genomes of B. paralicheniformis, revealed that all of the B. paralicheniformis strains in this study are more enriched in nonribosomal peptides (NRPs). We further report the first computationally identified trans-acyltransferase (trans-AT) nonribosomal peptide synthetase/polyketide synthase (PKS/ NRPS) cluster in strains of this species.ConclusionsB. paralicheniformis species have more genes associated with biosynthesis of antimicrobial bioactive compounds than other previously characterized species of B. licheniformis, which suggests that these species are better potential sources for novel antibiotics. Moreover, the genome of the Red Sea strain B. paralicheniformis Bac48 is more enriched in modular PKS genes compared to B. licheniformis strains and other B. paralicheniformis strains. This may be linked to adaptations that strains surviving in the Red Sea underwent to survive in the relatively hot and saline ecosystems.
• GLAM: Glycogen-derived Lactate Absorption Map for visual analysis of dense and sparse surface reconstructions of rodent brain structures on desktop systems and virtual environments

(Elsevier BV, 2018-05-21)
Human brain accounts for about one hundred billion neurons, but they cannot work properly without ultrastructural and metabolic support. For this reason, mammalian brains host another type of cells called “glial cells”, whose role is to maintain proper conditions for efficient neuronal function. One type of glial cell, astrocytes, are involved in particular in the metabolic support of neurons, by feeding them with lactate, one byproduct of glucose metabolism that they can take up from blood vessels, and store it under another form, glycogen granules. These energy-storage molecules, whose morphology resembles to spheres with a diameter ranging 10–80 nanometers roughly, can be easily recognized using electron microscopy, the only technique whose resolution is high enough to resolve them. Understanding and quantifying their distribution is of particular relevance for neuroscientists, in order to understand where and when neurons use energy under this form. To answer this question, we developed a visualization technique, dubbed GLAM (Glycogen-derived Lactate Absorption Map), and customized for the analysis of the interaction of astrocytic glycogen on surrounding neurites in order to formulate hypotheses on the energy absorption mechanisms. The method integrates high-resolution surface reconstruction of neurites, astrocytes, and the energy sources in form of glycogen granules from different automated serial electron microscopy methods, like focused ion beam scanning electron microscopy (FIB-SEM) or serial block face electron microscopy (SBEM), together with an absorption map computed as a radiance transfer mechanism. The resulting visual representation provides an immediate and comprehensible illustration of the areas in which the probability of lactate shuttling is higher. The computed dataset can be then explored and quantified in a 3D space, either using 3D modeling software or virtual reality environments. Domain scientists have evaluated the technique by either using the computed maps for formulating functional hypotheses or for planning sparse reconstructions to avoid excessive occlusion. Furthermore, we conducted a pioneering user study showing that immersive VR setups can ease the investigation of the areas of interest and the analysis of the absorption patterns in the cellular structures.
• Isotropic Surface Remeshing without Large and Small Angles

(Institute of Electrical and Electronics Engineers (IEEE), 2018-05-18)
We introduce a novel algorithm for isotropic surface remeshing which progressively eliminates obtuse triangles and improves small angles. The main novelty of the proposed approach is a simple vertex insertion scheme that facilitates the removal of large angles, and a vertex removal operation that improves the distribution of small angles. In combination with other standard local mesh operators, e.g., connectivity optimization and local tangential smoothing, our algorithm is able to remesh efficiently a low-quality mesh surface. Our approach can be applied directly or used as a post-processing step following other remeshing approaches. Our method has a similar computational efficiency to the fastest approach available, i.e., real-time adaptive remeshing [1]. In comparison with state-of-the-art approaches, our method consistently generates better results based on evaluations using different metrics.
• Battling Latency in Modern Wireless Networks

(Institute of Electrical and Electronics Engineers (IEEE), 2018-05-15)
Buffer sizing has a tremendous effect on the performance of Wi-Fi based networks. Choosing the right buffer size is challenging due to the dynamic nature of the wireless environment. Over buffering or ‘bufferbloat’ may produce unacceptable endto-end delays. On the other hand, small buffers may limit the performance gains that can be obtained with various IEEE 802.11n/ac enhancements, such as frame aggregation. We propose Wireless Queue Management (WQM), a novel, practical, and lightweight queue management scheme for wireless networks. WQM adapts the buffer size based on the wireless link characteristics and the network load. Furthermore, it accounts for aggregates length when deciding on the optimal buffer size. We evaluate WQM using our 10 nodes wireless testbed. WQM reduces the end-to-end delay by an order of magnitude compared to the default buffer size in Linux while achieving similar network throughput. Also, WQM outperforms state of the art bufferbloat solutions, namely CoDel and PIE. WQM achieves 7× less latency compared to PIE, and 2× compared to CoDel at the cost of 8% drop in goodput in the worst case. Further, WQM improves network fairness as it limits the ability of a single flow to saturate the buffers.
• DroidEnsemble: Detecting Android Malicious Applications with Ensemble of String and Structural Static Features

(Institute of Electrical and Electronics Engineers (IEEE), 2018-05-11)
Android platform has dominated the Operating System of mobile devices. However, the dramatic increase of Android malicious applications (malapps) has caused serious software failures to Android system and posed a great threat to users. The effective detection of Android malapps has thus become an emerging yet crucial issue. Characterizing the behaviors of Android applications (apps) is essential to detecting malapps. Most existing work on detecting Android malapps was mainly based on string static features such as permissions and API usage extracted from apps. There also exists work on the detection of Android malapps with structural features, such as Control Flow Graph (CFG) and Data Flow Graph (DFG). As Android malapps have become increasingly polymorphic and sophisticated, using only one type of static features may result in false negatives. In this work, we propose DroidEnsemble that takes advantages of both string features and structural features to systematically and comprehensively characterize the static behaviors of Android apps and thus build a more accurate detection model for the detection of Android malapps. We extract each app’s string features, including permissions, hardware features, filter intents, restricted API calls, used permissions, code patterns, as well as structural features like function call graph. We then use three machine learning algorithms, namely, Support Vector Machine (SVM), k-Nearest Neighbor (kNN) and Random Forest (RF), to evaluate the performance of these two types of features and of their ensemble. In the experiments, We evaluate our methods and models with 1386 benign apps and 1296 malapps. Extensive experimental results demonstrate the effectiveness of DroidEnsemble. It achieves the detection accuracy as 95.8% with only string features and as 90.68% with only structural features. DroidEnsemble reaches the detection accuracy as 98.4% with the ensemble of both types of features, reducing 9 false positives and 12 false negatives compared to the results with only string features.
• SupportNet: a novel incremental learning framework through deep learning and support data

(Cold Spring Harbor Laboratory, 2018-05-08)
Motivation: In most biological data sets, the amount of data is regularly growing and the number of classes is continuously increasing. To deal with the new data from the new classes, one approach is to train a classification model, e.g., a deep learning model, from scratch based on both old and new data. This approach is highly computationally costly and the extracted features are likely very different from the ones extracted by the model trained on the old data alone, which leads to poor model robustness. Another approach is to fine tune the trained model from the old data on the new data. However, this approach often does not have the ability to learn new knowledge without forgetting the previously learned knowledge, which is known as the catastrophic forgetting problem. To our knowledge, this problem has not been studied in the field of bioinformatics despite its existence in many bioinformatic problems. Results: Here we propose a novel method, SupportNet, to solve the catastrophic forgetting problem efficiently and effectively. SupportNet combines the strength of deep learning and support vector machine (SVM), where SVM is used to identify the support data from the old data, which are fed to the deep learning model together with the new data for further training so that the model can review the essential information of the old data when learning the new information. Two powerful consolidation regularizers are applied to ensure the robustness of the learned model. Comprehensive experiments on various tasks, including enzyme function prediction, subcellular structure classification and breast tumor classification, show that SupportNet drastically outperforms the state-of-the-art incremental learning methods and reaches similar performance as the deep learning model trained from scratch on both old and new data. Availability: Our program is accessible at: \url{https://github.com/lykaust15/SupportNet}.
• Use of unmanned aerial vehicles for efficient beach litter monitoring

(Elsevier BV, 2018-05-05)
A global beach litter assessment is challenged by use of low-efficiency methodologies and incomparable protocols that impede data integration and acquisition at a national scale. The implementation of an objective, reproducible and efficient approach is therefore required. Here we show the application of a remote sensing based methodology using a test beach located on the Saudi Arabian Red Sea coastline. Litter was recorded via image acquisition from an Unmanned Aerial Vehicle, while an automatic processing of the high volume of imagery was developed through machine learning, employed for debris detection and classification in three categories. Application of the method resulted in an almost 40 times faster beach coverage when compared to a standard visual-census approach. While the machine learning tool faced some challenges in correctly detecting objects of interest, first classification results are promising and motivate efforts to further develop the technique and implement it at much larger scales.
• DeepPVP: phenotype-based prioritization of causative variants using deep learning

(Cold Spring Harbor Laboratory, 2018-05-02)
Background: Prioritization of variants in personal genomic data is a major challenge. Recently, computational methods that rely on comparing phenotype similarity have shown to be useful to identify causative variants. In these methods, pathogenicity prediction is combined with a semantic similarity measure to prioritize not only variants that are likely to be dysfunctional but those that are likely involved in the pathogenesis of a patient's phenotype. Results: We have developed DeepPVP, a variant prioritization method that combined automated inference with deep neural networks to identify the likely causative variants in whole exome or whole genome sequence data. We demonstrate that DeepPVP performs significantly better than existing methods, including phenotype-based methods that use similar features. DeepPVP is freely available at https://github.com/bio-ontology-research-group/phenomenet-vp Conclusions: DeepPVP further improves on existing variant prioritization methods both in terms of speed as well as accuracy.
• OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants

(Cold Spring Harbor Laboratory, 2018-05-02)
Purpose: An increasing number of Mendelian disorders have been identified for which two or more variants in one or more genes are required to cause the disease, or significantly modify its severity or phenotype. It is difficult to discover such interactions using existing approaches. The purpose of our work is to develop and evaluate a system that can identify combinations of variants underlying oligogenic diseases in individual whole exome or whole genome sequences. Methods: Information that links patient phenotypes to databases of gene-phenotype associations observed in clinical research can provide useful information and improve variant prioritization for Mendelian diseases. Additionally, background knowledge about interactions between genes can be utilized to guide and restrict the selection of candidate disease modules. Results: We developed OligoPVP, an algorithm that can be used to identify variants in oligogenic diseases and their interactions, using whole exome or whole genome sequences together with patient phenotypes as input. We demonstrate that OligoPVP has significantly improved performance when compared to state of the art pathogenicity detection methods. Conclusions: Our results show that OligoPVP can efficiently detect oligogenic interactions using a phenotype-driven approach and identify etiologically important variants in whole genomes.
• Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes

(Cold Spring Harbor Laboratory, 2018-04-30)
In the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease's (or patient's) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse. Results: We developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprising of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network.
• Discriminative Transfer Learning for General Image Restoration

(Institute of Electrical and Electronics Engineers (IEEE), 2018-04-30)
Recently, several discriminative learning approaches have been proposed for effective image restoration, achieving convincing trade-off between image quality and computational efficiency. However, these methods require separate training for each restoration task (e.g., denoising, deblurring, demosaicing) and problem condition (e.g., noise level of input images). This makes it time-consuming and difficult to encompass all tasks and conditions during training. In this paper, we propose a discriminative transfer learning method that incorporates formal proximal optimization and discriminative learning for general image restoration. The method requires a single-pass discriminative training and allows for reuse across various problems and conditions while achieving an efficiency comparable to previous discriminative approaches. Furthermore, after being trained, our model can be easily transferred to new likelihood terms to solve untrained tasks, or be combined with existing priors to further improve image restoration quality.
• RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

(Public Library of Science (PLoS), 2018-04-26)
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
• Transcriptional landscape of Mycobacterium tuberculosis infection in macrophages

(Springer Nature, 2018-04-24)
Mycobacterium tuberculosis (Mtb) infection reveals complex and dynamic host-pathogen interactions, leading to host protection or pathogenesis. Using a unique transcriptome technology (CAGE), we investigated the promoter-based transcriptional landscape of IFNγ (M1) or IL-4/IL-13 (M2) stimulated macrophages during Mtb infection in a time-kinetic manner. Mtb infection widely and drastically altered macrophage-specific gene expression, which is far larger than that of M1 or M2 activations. Gene Ontology enrichment analysis for Mtb-induced differentially expressed genes revealed various terms, related to host-protection and inflammation, enriched in up-regulated genes. On the other hand, terms related to dis-regulation of cellular functions were enriched in down-regulated genes. Differential expression analysis revealed known as well as novel transcription factor genes in Mtb infection, many of them significantly down-regulated. IFNγ or IL-4/IL-13 pre-stimulation induce additional differentially expressed genes in Mtb-infected macrophages. Cluster analysis uncovered significant numbers, prolonging their expressional changes. Furthermore, Mtb infection augmented cytokine-mediated M1 and M2 pre-activations. In addition, we identified unique transcriptional features of Mtb-mediated differentially expressed lncRNAs. In summary we provide a comprehensive in depth gene expression/regulation profile in Mtb-infected macrophages, an important step forward for a better understanding of host-pathogen interaction dynamics in Mtb infection.
• 665 Nail lesions in 30 old inbred mouse strains

(Elsevier BV, 2018-04-19)
• Weighted Low-Rank Approximation of Matrices and Background Modeling

(arXiv, 2018-04-15)
We primarily study a special a weighted low-rank approximation of matrices and then apply it to solve the background modeling problem. We propose two algorithms for this purpose: one operates in the batch mode on the entire data and the other one operates in the batch-incremental mode on the data and naturally captures more background variations and computationally more effective. Moreover, we propose a robust technique that learns the background frame indices from the data and does not require any training frames. We demonstrate through extensive experiments that by inserting a simple weight in the Frobenius norm, it can be made robust to the outliers similar to the $\ell_1$ norm. Our methods match or outperform several state-of-the-art online and batch background modeling methods in virtually all quantitative and qualitative measures.
• A Multilayer Perceptron-Based Impulsive Noise Detector with Application to Power-Line-Based Sensor Networks

(Institute of Electrical and Electronics Engineers (IEEE), 2018-04-10)
For power-line-based sensor networks, impulsive noise (IN) will dramatically degrade the data transmission rate in the power line. In this paper, we present a multilayer perceptron (MLP)-based approach to detect IN in orthogonal frequency-division multiplexing (OFDM)-based baseband power line communications (PLCs). Combining the MLP-based IN detection method with the outlier detection theory allows more accurate identification of the harmful residual IN. For OFDM-based PLC systems, the high peak-to-average power ratio (PAPR) of the received signal makes detection of harmful residual IN more challenging. The detection mechanism works in an iterative receiver that contains a pre-IN mitigation and a post-IN mitigation. The pre-IN mitigation is meant to null the stronger portion of IN, while the post-IN mitigation suppresses the residual portion of IN using an iterative process. Compared with previously reported IN detectors, the simulation results show that our MLP-based IN detector improves the resulting bit error rate (BER) performance.
• Supervised Convolutional Sparse Coding

(arXiv, 2018-04-08)
Convolutional Sparse Coding (CSC) is a well-established image representation model especially suited for image restoration tasks. In this work, we extend the applicability of this model by proposing a supervised approach to convolutional sparse coding, which aims at learning discriminative dictionaries instead of purely reconstructive ones. We incorporate a supervised regularization term into the traditional unsupervised CSC objective to encourage the final dictionary elements to be discriminative. Experimental results show that using supervised convolutional learning results in two key advantages. First, we learn more semantically relevant filters in the dictionary and second, we achieve improved image reconstruction on unseen data.
• Parallel trajectory similarity joins in spatial networks

(Springer Nature, 2018-04-04)
The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider two cases of trajectory similarity joins (TS-Joins), including a threshold-based join (Tb-TS-Join) and a top-k TS-Join (k-TS-Join), where the objects are trajectories of vehicles moving in road networks. Given two sets of trajectories and a threshold θ, the Tb-TS-Join returns all pairs of trajectories from the two sets with similarity above θ. In contrast, the k-TS-Join does not take a threshold as a parameter, and it returns the top-k most similar trajectory pairs from the two sets. The TS-Joins target diverse applications such as trajectory near-duplicate detection, data cleaning, ridesharing recommendation, and traffic congestion prediction. With these applications in mind, we provide purposeful definitions of similarity. To enable efficient processing of the TS-Joins on large sets of trajectories, we develop search space pruning techniques and enable use of the parallel processing capabilities of modern processors. Specifically, we present a two-phase divide-and-conquer search framework that lays the foundation for the algorithms for the Tb-TS-Join and the k-TS-Join that rely on different pruning techniques to achieve efficiency. For each trajectory, the algorithms first find similar trajectories. Then they merge the results to obtain the final result. The algorithms for the two joins exploit different upper and lower bounds on the spatiotemporal trajectory similarity and different heuristic scheduling strategies for search space pruning. Their per-trajectory searches are independent of each other and can be performed in parallel, and the mergings have constant cost. An empirical study with real data offers insight in the performance of the algorithms and demonstrates that they are capable of outperforming well-designed baseline algorithms by an order of magnitude.
• Protecting multi-party privacy in location-aware social point-of-interest recommendation

(Springer Nature, 2018-04-04)
Point-of-interest (POI) recommendation has attracted much interest recently because of its significant business potential. Data used in POI recommendation (e.g., user-location check-in matrix) are much more sparse than that used in traditional item (e.g., book and movie) recommendation, which leads to more serious cold start problem. Social POI recommendation has proved to be an effective solution, but most existing works assume that recommenders have access to all required data. This is very rare in practice because these data are generally owned by different entities who are not willing to share their data with others due to privacy and legal concerns. In this paper, we first propose PLAS, a protocol which enables effective POI recommendation without disclosing the sensitive data of every party getting involved in the recommendation. We formally show PLAS is secure in the semi-honest adversary model. To improve its performance. We then adopt the technique of cloaking area by which expensive distance computation over encrypted data is replaced by cheap operation over plaintext. In addition, we utilize the sparsity of check-ins to selectively publish data, thus reducing encryption cost and avoiding unnecessary computation over ciphertext. Experiments on two real datasets show that our protocol is feasible and can scale to large POI recommendation problems in practice.
• UHD Video Transmission over Bi-Directional Underwater Wireless Optical Communication

(Institute of Electrical and Electronics Engineers (IEEE), 2018-04-02)