Recent Submissions

• Decision and Inhibitory Trees for Decision Tables with Many-Valued Decisions

(2018-06-06) [Dissertation]
Committee members: Bajic, Vladimir B.; Zhang, Xiangliang; Boros, Endre
Decision trees are one of the most commonly used tools in decision analysis, knowledge representation, machine learning, etc., for its simplicity and interpretability. We consider an extension of dynamic programming approach to process the whole set of decision trees for the given decision table which was previously only attainable by brute-force algorithms. We study decision tables with many-valued decisions (each row may contain multiple decisions) because they are more reasonable models of data in many cases. To address this problem in a broad sense, we consider not only decision trees but also inhibitory trees where terminal nodes are labeled with “̸= decision”. Inhibitory trees can sometimes describe more knowledge from datasets than decision trees. As for cost functions, we consider depth or average depth to minimize time complexity of trees, and the number of nodes or the number of the terminal, or nonterminal nodes to minimize the space complexity of trees. We investigate the multi-stage optimization of trees relative to some cost functions, and also the possibility to describe the whole set of strictly optimal trees. Furthermore, we study the bi-criteria optimization cost vs. cost and cost vs. uncertainty for decision trees, and cost vs. cost and cost vs. completeness for inhibitory trees. The most interesting application of the developed technique is the creation of multi-pruning and restricted multi-pruning approaches which are useful for knowledge representation and prediction. The experimental results show that decision trees constructed by these approaches can often outperform the decision trees constructed by the CART algorithm. Another application includes the comparison of 12 greedy heuristics for single- and bi-criteria optimization (cost vs. cost) of trees. We also study the three approaches (decision tables with many-valued decisions, decision tables with most common decisions, and decision tables with generalized decisions) to handle inconsistency of decision tables. We also analyze the time complexity of decision and inhibitory trees over arbitrary sets of attributes represented by information systems in the frameworks of local (when we can use in trees only attributes from problem description) and global (when we can use in trees arbitrary attributes from the information system) approaches.
• Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems

(2018-05-24) [Dissertation]
Committee members: Genton, Marc G.; Hadwiger, Markus; Ltaief, Hatem; Elster, Ann C.
Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari- ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational ground-based astronomy to enhance the observed image quality by filtering out noise produced by the adap- tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positive-definite, and often data-sparse, therefore, hier- archically of low-rank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile low-rank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com- pressed data structure allows approximating the original problem up to user-defined numerical accuracy. This comes at the expense of dealing with tasks with much lower arithmetic intensities than traditional dense computations. In fact, this thesis con- solidates the two trends of dense and data-sparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Cholesky-based matrix al- gorithms, but it also implements a novel TLR-Cholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLR-Cholesky shows many-fold speedups against state-of-the-art implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This trade-off between performance and accuracy is, currently, a well-established leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap.
• A Study of Recurrent and Convolutional Neural Networks in the Native Language Identification Task

(2018-05-24) [Thesis]
Committee members: Moshkov, Mikhail; Gao, Xin
Native Language Identification (NLI) is the task of predicting the native language of an author from their text written in a second language. The idea is to find writing habits that transfer from an author’s native language to their second language. Many approaches to this task have been studied, from simple word frequency analysis, to analyzing grammatical and spelling mistakes to find patterns and traits that are common between different authors of the same native language. This can be a very complex task, depending on the native language and the proficiency of the author’s second language. The most common approach that has seen very good results is based on the usage of n-gram features of words and characters. In this thesis, we attempt to extract lexical, grammatical, and semantic features from the sentences of non-native English essays using neural networks. The training and testing data was obtained from a large corpus of publicly available essays written by authors of several countries around the world. The neural network models consisted of Long Short-Term Memory and Convolutional networks using the sentences of each document as the input. Additional statistical features were generated from the text to complement the predictions of the neural networks, which were then used as feature inputs to a Support Vector Machine, making the final prediction. Results show that Long Short-Term Memory neural network can improve performance over a naive bag of words approach, but with a much smaller feature set. With more fine-tuning of neural network hyperparameters, these results will likely improve significantly.
• In silico exploration of Red Sea Bacillus genomes for natural product biosynthetic gene clusters

(BMC Genomics, Springer Nature, 2018-05-22) [Article]
BackgroundThe increasing spectrum of multidrug-resistant bacteria is a major global public health concern, necessitating discovery of novel antimicrobial agents. Here, members of the genus Bacillus are investigated as a potentially attractive source of novel antibiotics due to their broad spectrum of antimicrobial activities. We specifically focus on a computational analysis of the distinctive biosynthetic potential of Bacillus paralicheniformis strains isolated from the Red Sea, an ecosystem exposed to adverse, highly saline and hot conditions.ResultsWe report the complete circular and annotated genomes of two Red Sea strains, B. paralicheniformis Bac48 isolated from mangrove mud and B. paralicheniformis Bac84 isolated from microbial mat collected from Rabigh Harbor Lagoon in Saudi Arabia. Comparing the genomes of B. paralicheniformis Bac48 and B. paralicheniformis Bac84 with nine publicly available complete genomes of B. licheniformis and three genomes of B. paralicheniformis, revealed that all of the B. paralicheniformis strains in this study are more enriched in nonribosomal peptides (NRPs). We further report the first computationally identified trans-acyltransferase (trans-AT) nonribosomal peptide synthetase/polyketide synthase (PKS/ NRPS) cluster in strains of this species.ConclusionsB. paralicheniformis species have more genes associated with biosynthesis of antimicrobial bioactive compounds than other previously characterized species of B. licheniformis, which suggests that these species are better potential sources for novel antibiotics. Moreover, the genome of the Red Sea strain B. paralicheniformis Bac48 is more enriched in modular PKS genes compared to B. licheniformis strains and other B. paralicheniformis strains. This may be linked to adaptations that strains surviving in the Red Sea underwent to survive in the relatively hot and saline ecosystems.
• Neural Inductive Matrix Completion for Predicting Disease-Gene Associations

(2018-05-21) [Thesis]
Committee members: Bajic, Vladimir B.; Hoehndorf, Robert
In silico prioritization of undiscovered associations can help find causal genes of newly discovered diseases. Some existing methods are based on known associations, and side information of diseases and genes. We exploit the possibility of using a neural network model, Neural inductive matrix completion (NIMC), in disease-gene prediction. Comparing to the state-of-the-art inductive matrix completion method, using neural networks allows us to learn latent features from non-linear functions of input features. Previous methods use disease features only from mining text. Comparing to text mining, disease ontology is a more informative way of discovering correlation of dis- eases, from which we can calculate the similarities between diseases and help increase the performance of predicting disease-gene associations. We compare the proposed method with other state-of-the-art methods for pre- dicting associated genes for diseases from the Online Mendelian Inheritance in Man (OMIM) database. Results show that both new features and the proposed NIMC model can improve the chance of recovering an unknown associated gene in the top 100 predicted genes. Best results are obtained by using both the new features and the new model. Results also show the proposed method does better in predicting associated genes for newly discovered diseases.
• GLAM: Glycogen-derived Lactate Absorption Map for visual analysis of dense and sparse surface reconstructions of rodent brain structures on desktop systems and virtual environments

(Computers & Graphics, Elsevier BV, 2018-05-21) [Article]
Human brain accounts for about one hundred billion neurons, but they cannot work properly without ultrastructural and metabolic support. For this reason, mammalian brains host another type of cells called “glial cells”, whose role is to maintain proper conditions for efficient neuronal function. One type of glial cell, astrocytes, are involved in particular in the metabolic support of neurons, by feeding them with lactate, one byproduct of glucose metabolism that they can take up from blood vessels, and store it under another form, glycogen granules. These energy-storage molecules, whose morphology resembles to spheres with a diameter ranging 10–80 nanometers roughly, can be easily recognized using electron microscopy, the only technique whose resolution is high enough to resolve them. Understanding and quantifying their distribution is of particular relevance for neuroscientists, in order to understand where and when neurons use energy under this form. To answer this question, we developed a visualization technique, dubbed GLAM (Glycogen-derived Lactate Absorption Map), and customized for the analysis of the interaction of astrocytic glycogen on surrounding neurites in order to formulate hypotheses on the energy absorption mechanisms. The method integrates high-resolution surface reconstruction of neurites, astrocytes, and the energy sources in form of glycogen granules from different automated serial electron microscopy methods, like focused ion beam scanning electron microscopy (FIB-SEM) or serial block face electron microscopy (SBEM), together with an absorption map computed as a radiance transfer mechanism. The resulting visual representation provides an immediate and comprehensible illustration of the areas in which the probability of lactate shuttling is higher. The computed dataset can be then explored and quantified in a 3D space, either using 3D modeling software or virtual reality environments. Domain scientists have evaluated the technique by either using the computed maps for formulating functional hypotheses or for planning sparse reconstructions to avoid excessive occlusion. Furthermore, we conducted a pioneering user study showing that immersive VR setups can ease the investigation of the areas of interest and the analysis of the absorption patterns in the cellular structures.
• Isotropic Surface Remeshing without Large and Small Angles

(IEEE Transactions on Visualization and Computer Graphics, Institute of Electrical and Electronics Engineers (IEEE), 2018-05-18) [Article]
We introduce a novel algorithm for isotropic surface remeshing which progressively eliminates obtuse triangles and improves small angles. The main novelty of the proposed approach is a simple vertex insertion scheme that facilitates the removal of large angles, and a vertex removal operation that improves the distribution of small angles. In combination with other standard local mesh operators, e.g., connectivity optimization and local tangential smoothing, our algorithm is able to remesh efficiently a low-quality mesh surface. Our approach can be applied directly or used as a post-processing step following other remeshing approaches. Our method has a similar computational efficiency to the fastest approach available, i.e., real-time adaptive remeshing [1]. In comparison with state-of-the-art approaches, our method consistently generates better results based on evaluations using different metrics.
• Large-scale Comparative Study of Hi-C-based Chromatin 3D Structure Modeling Methods

(2018-05-17) [Thesis]
Committee members: Hoehndorf, Robert; Fischle, Wolfgang
Chromatin is a complex polymer molecule in eukaryotic cells, primarily consisting of DNA and histones. Many works have shown that the 3D folding of chromatin structure plays an important role in DNA expression. The recently proposed Chro- mosome Conformation Capture technologies, especially the Hi-C assays, provide us an opportunity to study how the 3D structures of the chromatin are organized. Based on the data from Hi-C experiments, many chromatin 3D structure modeling methods have been proposed. However, there is limited ground truth to validate these methods and no robust chromatin structure alignment algorithms to evaluate the performance of these methods. In our work, we first made a thorough literature review of 25 publicly available population Hi-C-based chromatin 3D structure modeling methods. Furthermore, to evaluate and to compare the performance of these methods, we proposed a novel data simulation method, which combined the population Hi-C data and single-cell Hi-C data without ad hoc parameters. Also, we designed a global and a local alignment algorithms to measure the similarity between the templates and the chromatin struc- tures predicted by different modeling methods. Finally, the results from large-scale comparative tests indicated that our alignment algorithms significantly outperform the algorithms in literature.
• Battling Latency in Modern Wireless Networks

(IEEE Access, Institute of Electrical and Electronics Engineers (IEEE), 2018-05-15) [Article]
Buffer sizing has a tremendous effect on the performance of Wi-Fi based networks. Choosing the right buffer size is challenging due to the dynamic nature of the wireless environment. Over buffering or ‘bufferbloat’ may produce unacceptable endto-end delays. On the other hand, small buffers may limit the performance gains that can be obtained with various IEEE 802.11n/ac enhancements, such as frame aggregation. We propose Wireless Queue Management (WQM), a novel, practical, and lightweight queue management scheme for wireless networks. WQM adapts the buffer size based on the wireless link characteristics and the network load. Furthermore, it accounts for aggregates length when deciding on the optimal buffer size. We evaluate WQM using our 10 nodes wireless testbed. WQM reduces the end-to-end delay by an order of magnitude compared to the default buffer size in Linux while achieving similar network throughput. Also, WQM outperforms state of the art bufferbloat solutions, namely CoDel and PIE. WQM achieves 7× less latency compared to PIE, and 2× compared to CoDel at the cost of 8% drop in goodput in the worst case. Further, WQM improves network fairness as it limits the ability of a single flow to saturate the buffers.
• Enhancing Network Data Obliviousness in Trusted Execution Environment-based Stream Processing Systems

(2018-05-15) [Thesis]
Committee members: Kalnis, Panos; Keyes, David E.
Cloud computing usage is increasing and a common concern is the privacy and security of the data and computation. Third party cloud environments are not considered fit for processing private information because the data will be revealed to the cloud provider. However, Trusted Execution Environments (TEEs), such as Intel SGX, provide a way for applications to run privately and securely on untrusted platforms. Nonetheless, using a TEE by itself for stream processing systems is not sufficient since network communication patterns may leak properties of the data under processing. This work addresses leaky topology structures and suggests mitigation techniques for each of these. We create specific metrics to evaluate leaks occurring from the network patterns; the metrics measure information leaked when the stream processing system is running. We consider routing techniques for inter-stage communication in a streaming application to mitigate this data leakage. We consider a dynamic policy to change the mitigation technique depending on how much information is currently leaking. Additionally, we consider techniques to hide irregularities resulting from a filtering stage in a topology. We also consider leakages resulting from applications containing cycles. For each of the techniques, we explore their effectiveness in terms of the advantage they provide in overcoming the network leakage. The techniques are tested partly using simulations and some were implemented in a prototype SGX-based stream processing system.
• Ontology Design Patterns for Combining Pathology and Anatomy: Application to Study Aging and Longevity in Inbred Mouse Strains

(2018-05-13) [Thesis]
Committee members: Gao, Xin; Bajic, Vladimir B.
In biomedical research, ontologies are widely used to represent knowledge as well as to annotate datasets. Many of the existing ontologies cover a single type of phenomena, such as a process, cell type, gene, pathological entity or anatomical structure. Consequently, there is a requirement to use multiple ontologies to fully characterize the observations in the datasets. Although this allows precise annotation of different aspects of a given dataset, it limits our ability to use the ontologies in data analysis, as the ontologies are usually disconnected and their combinations cannot be exploited. Motivated by this, here we present novel ontology design methods for combining pathology and anatomy concepts. To this end, we use a dataset of mouse models which has been characterized through two ontologies: one of them is the mouse pathology ontology (MPATH) covering pathological lesions while the other is the mouse anatomy ontology (MA) covering the anatomical site of the lesions. We propose four novel ontology design patterns for combining these ontologies, and use these patterns to generate four ontologies in a data-driven way. To evaluate the generated ontologies, we utilize these in ontology-based data analysis, including ontology enrichment analysis and computation of semantic similarity. We demonstrate that there are significant differences between the four ontologies in different analysis approaches. In addition, when using semantic similarity to confirm the hypothesis that genetically identical mice should develop more similar diseases, the generated combined ontologies lead to significantly better analysis results compared to using each ontology individually. Our results reveal that using ontology design patterns to combine different facets characterizing a dataset can improve established analysis methods.
• DroidEnsemble: Detecting Android Malicious Applications with Ensemble of String and Structural Static Features

(IEEE Access, Institute of Electrical and Electronics Engineers (IEEE), 2018-05-11) [Article]
Android platform has dominated the Operating System of mobile devices. However, the dramatic increase of Android malicious applications (malapps) has caused serious software failures to Android system and posed a great threat to users. The effective detection of Android malapps has thus become an emerging yet crucial issue. Characterizing the behaviors of Android applications (apps) is essential to detecting malapps. Most existing work on detecting Android malapps was mainly based on string static features such as permissions and API usage extracted from apps. There also exists work on the detection of Android malapps with structural features, such as Control Flow Graph (CFG) and Data Flow Graph (DFG). As Android malapps have become increasingly polymorphic and sophisticated, using only one type of static features may result in false negatives. In this work, we propose DroidEnsemble that takes advantages of both string features and structural features to systematically and comprehensively characterize the static behaviors of Android apps and thus build a more accurate detection model for the detection of Android malapps. We extract each app’s string features, including permissions, hardware features, filter intents, restricted API calls, used permissions, code patterns, as well as structural features like function call graph. We then use three machine learning algorithms, namely, Support Vector Machine (SVM), k-Nearest Neighbor (kNN) and Random Forest (RF), to evaluate the performance of these two types of features and of their ensemble. In the experiments, We evaluate our methods and models with 1386 benign apps and 1296 malapps. Extensive experimental results demonstrate the effectiveness of DroidEnsemble. It achieves the detection accuracy as 95.8% with only string features and as 90.68% with only structural features. DroidEnsemble reaches the detection accuracy as 98.4% with the ensemble of both types of features, reducing 9 false positives and 12 false negatives compared to the results with only string features.
• SupportNet: a novel incremental learning framework through deep learning and support data

(Cold Spring Harbor Laboratory, 2018-05-08) [Working Paper]
Motivation: In most biological data sets, the amount of data is regularly growing and the number of classes is continuously increasing. To deal with the new data from the new classes, one approach is to train a classification model, e.g., a deep learning model, from scratch based on both old and new data. This approach is highly computationally costly and the extracted features are likely very different from the ones extracted by the model trained on the old data alone, which leads to poor model robustness. Another approach is to fine tune the trained model from the old data on the new data. However, this approach often does not have the ability to learn new knowledge without forgetting the previously learned knowledge, which is known as the catastrophic forgetting problem. To our knowledge, this problem has not been studied in the field of bioinformatics despite its existence in many bioinformatic problems. Results: Here we propose a novel method, SupportNet, to solve the catastrophic forgetting problem efficiently and effectively. SupportNet combines the strength of deep learning and support vector machine (SVM), where SVM is used to identify the support data from the old data, which are fed to the deep learning model together with the new data for further training so that the model can review the essential information of the old data when learning the new information. Two powerful consolidation regularizers are applied to ensure the robustness of the learned model. Comprehensive experiments on various tasks, including enzyme function prediction, subcellular structure classification and breast tumor classification, show that SupportNet drastically outperforms the state-of-the-art incremental learning methods and reaches similar performance as the deep learning model trained from scratch on both old and new data. Availability: Our program is accessible at: \url{https://github.com/lykaust15/SupportNet}.
• Use of unmanned aerial vehicles for efficient beach litter monitoring

(Marine Pollution Bulletin, Elsevier BV, 2018-05-05) [Article]
A global beach litter assessment is challenged by use of low-efficiency methodologies and incomparable protocols that impede data integration and acquisition at a national scale. The implementation of an objective, reproducible and efficient approach is therefore required. Here we show the application of a remote sensing based methodology using a test beach located on the Saudi Arabian Red Sea coastline. Litter was recorded via image acquisition from an Unmanned Aerial Vehicle, while an automatic processing of the high volume of imagery was developed through machine learning, employed for debris detection and classification in three categories. Application of the method resulted in an almost 40 times faster beach coverage when compared to a standard visual-census approach. While the machine learning tool faced some challenges in correctly detecting objects of interest, first classification results are promising and motivate efforts to further develop the technique and implement it at much larger scales.
• Discriminative Transfer Learning for General Image Restoration

(IEEE Transactions on Image Processing, Institute of Electrical and Electronics Engineers (IEEE), 2018-04-30) [Article]
Recently, several discriminative learning approaches have been proposed for effective image restoration, achieving convincing trade-off between image quality and computational efficiency. However, these methods require separate training for each restoration task (e.g., denoising, deblurring, demosaicing) and problem condition (e.g., noise level of input images). This makes it time-consuming and difficult to encompass all tasks and conditions during training. In this paper, we propose a discriminative transfer learning method that incorporates formal proximal optimization and discriminative learning for general image restoration. The method requires a single-pass discriminative training and allows for reuse across various problems and conditions while achieving an efficiency comparable to previous discriminative approaches. Furthermore, after being trained, our model can be easily transferred to new likelihood terms to solve untrained tasks, or be combined with existing priors to further improve image restoration quality.
• RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning

(PLOS Computational Biology, Public Library of Science (PLoS), 2018-04-26) [Article]
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
• Transcriptional landscape of Mycobacterium tuberculosis infection in macrophages

(Scientific Reports, Springer Nature, 2018-04-24) [Article]
Mycobacterium tuberculosis (Mtb) infection reveals complex and dynamic host-pathogen interactions, leading to host protection or pathogenesis. Using a unique transcriptome technology (CAGE), we investigated the promoter-based transcriptional landscape of IFNγ (M1) or IL-4/IL-13 (M2) stimulated macrophages during Mtb infection in a time-kinetic manner. Mtb infection widely and drastically altered macrophage-specific gene expression, which is far larger than that of M1 or M2 activations. Gene Ontology enrichment analysis for Mtb-induced differentially expressed genes revealed various terms, related to host-protection and inflammation, enriched in up-regulated genes. On the other hand, terms related to dis-regulation of cellular functions were enriched in down-regulated genes. Differential expression analysis revealed known as well as novel transcription factor genes in Mtb infection, many of them significantly down-regulated. IFNγ or IL-4/IL-13 pre-stimulation induce additional differentially expressed genes in Mtb-infected macrophages. Cluster analysis uncovered significant numbers, prolonging their expressional changes. Furthermore, Mtb infection augmented cytokine-mediated M1 and M2 pre-activations. In addition, we identified unique transcriptional features of Mtb-mediated differentially expressed lncRNAs. In summary we provide a comprehensive in depth gene expression/regulation profile in Mtb-infected macrophages, an important step forward for a better understanding of host-pathogen interaction dynamics in Mtb infection.
• 665 Nail lesions in 30 old inbred mouse strains

(Journal of Investigative Dermatology, Elsevier BV, 2018-04-19) [Poster]
• Efficient Temporal Action Localization in Videos

(2018-04-17) [Thesis]
We primarily study a special a weighted low-rank approximation of matrices and then apply it to solve the background modeling problem. We propose two algorithms for this purpose: one operates in the batch mode on the entire data and the other one operates in the batch-incremental mode on the data and naturally captures more background variations and computationally more effective. Moreover, we propose a robust technique that learns the background frame indices from the data and does not require any training frames. We demonstrate through extensive experiments that by inserting a simple weight in the Frobenius norm, it can be made robust to the outliers similar to the $\ell_1$ norm. Our methods match or outperform several state-of-the-art online and batch background modeling methods in virtually all quantitative and qualitative measures.