Recent Submissions

  • kaust-library/DoiMinter: This release includes a license and metadata for authors

    (Zenodo, 2020-02-26) [Software]
    This code tracks a DSpace repository via OAI-PMH and mints DOIs for items if requested.
  • Treating coral bleaching as weather: a framework to validate and optimize prediction skill

    De Carlo, Thomas Mario (Code Ocean, 2020) [Software]
    Data and code to analyze the statistical skill of capturing observed bleaching events globally using sea surface temperatures. The analysis here quantifies the hits, misses, false alarms, and correct negatives, and then applies these results to a variety of weather-forecasting metrics to quantify the skill of predicting bleaching. Finally, adjustments to sea surface temperature-based heat stress metrics are evaluated to test if prediction skill can be improved above typical approaches.
  • Hierarchical and View-invariant Light Field Segmentation by Maximizing Entropy Rate on 4D Ray Graphs (Supplement)

    Li, Rui; Heidrich, Wolfgang (2019-09-01) [Software]
    The supplementary material for paper "Rui Li, Wolfgang Heidrich, Hierarchical and View-invariant Light Field Segmentation by Maximizing Entropy Rate on 4D Ray Graphs. In SIGGRAPH Asia, 2019"
  • GenRot - Software to generate rotamers of molecules from Gaussian input files

    Thulin, Michael; Munkerup, Kristin; Huang, Kuo-Wei (2019-06-13) [Software]
    The GenRot program takes a typical Gaussian .gjf file (connectivity written after XYZ) as input, and generates new XYZ data based on inputs from the user. After opening the file in the program, the user clicks on 4 atoms to define a dihedral angle and determines the degree of rotation they wish for that angle. This is repeated until all the dihedral angles that the user wants to rotate have been defined. The program will rotate each of the dihedral angles 360/n times, where n is the rotation degree from the user input. If n, is, for example, 33, then 360/33=10.9, the number will be rounded down to 10. If the user wants to rotate four dihedral angles 120 degrees each, the program generates (360/120)^4 = 81 rotamers. It is possible for the user to change the route section of the Gaussian input file, in case there was an error, and also to append text to the end of the file. When all the dihedral angles have been chosen, and the desired rotation defined, one simply clicks “Generate files”, and all files are generated into folder specified by the user. While the program reads Gaussian type input file only, it can in principle generate files with any information before and after the XYZ data, and may be suited to generate input files for other quantum calculation software.
  • Code for: PolyA Prediction using Logistic Regression Model (LRM) and Deep Neural Networks (DNN)

    Albalawi, Fahad; Chahid, Abderrazak; Guo, Xingang; Albaradei, Somayah; Magana-Mora, Arturo; Jankovic, Boris R.; Uludag, Mahmut; Van Neste, Christophe Marc; Essack, Magbubah; Laleg-Kirati, Taous-Meriem; Bajic, Vladimir B. (GitHub, 2018-11-15) [Software]
    PolyA_Predicion_LRM_DNN is a novel method for predicting poly(A) signal (PAS) in human genomic DNA. It first utilizes signal processing transforms (Fourier-based and wavelet-based), statistics and position weight matrix PWM to generate sets of features that can help the poly(A) prediction problem to perform better due to the different aspects that these features offer. Then, it uses deep neural networks DNN and Logistic Regression Model (LRM) to distinguish between true PAS and pseudo PAS efficiently. This repository contains scripts which were used to generate three sets of features, namely: signal processing-based, statistics-based and PWM-based features. Then, we use these features to train and then test the DNN and LRM models.
  • DeepGSR: An optimized deep-learning structure for the recognition of genomic signals and regions

    Kalkatawi, Manal M.; Magana-Mora, Arturo; Jankovic, Boris; Bajic, Vladimir B. (Zenodo, 2017-12-16) [Software]
    Recognition of different genomic signals and regions (GSRs) in the DNA is helpful in gaining knowledge to understand genome organization and gene regulation as well as gene function. Accurate recognition of GSRs enables better genome and gene annotation. Although many methods have been developed to recognize GSRs, their pure computational identification remains challenging. Moreover, various GSRs usually require a specialized set of features for developing robust recognition models. Recently, deep learning (DL) methods have been shown to generate more accurate prediction models than the ‘shallow’ methods without the need to develop specialized features for the problems in question. Here, we explore the potential use of DL for the recognition of GSRs. We developed DeepGSR, an optimized DL architecture for the prediction of different types of GSRs. The performance of the DeepGSR structure is evaluated on the recognition of polyadenylation signals (PAS) and translation initiation sites (TIS) of different organisms: human, mouse, bovine and fruit fly. The results show that DeepGSR outperformed the state-of-the-art methods, reducing the classification error rate of the PAS and TIS prediction in the human genome by up to 29% and 86%, respectively. Moreover, the cross-organisms and genome-wide analyses we performed, confirmed the robustness of DeepGSR and provided new insights into the conservation of examined GSRs across species.
  • Code for "DDR: a method to predict drug target interactions using multiple similarities"

    Olayan, Rawan S.; Ashoor, Haitham; Bajic, Vladimir B. (Bitbucket, 2017-09-09) [Software]
    Motivation: Finding computationally drug-target interactions (DTIs) is a convenient strategy to identify new DTIs at low cost with reasonable accuracy. However, the current DTI prediction methods suffer a high false-positive prediction rate. Results: We developed DDR, a novel method that improves the DTI prediction accuracy. DDR is based on the use of a heterogeneous graph that contains known DTIs with multiple similarities between drugs and multiple similarities between target proteins. DDR applies non-linear similarity fusion method to combine different similarities. Before fusion, DDR performs a pre-processing step where a subset of similarities is selected in a heuristic process to obtain an optimized combination of similarities. Then, DDR applies a random forest model using different graph-based features extracted from the DTI heterogeneous graph. Using 5-repeats of 10-fold cross-validation, three testing setups, and the weighted average of area under the precision-recall curve (AUPR) scores, we show that DDR significantly reduces the AUPR score error relative to the next best start-of-the-art method for predicting DTIs by 31% when the drugs are new, by 23% when targets are new and by 34% when the drugs and the targets are known but not all DTIs between them are not known. Using independent sources of evidence, we verify as correct 22 out of the top 25 DDR novel predictions. This suggests that DDR can be used as an efficient method to identify correct DTIs. Availability and implementation: The data and code are provided at Dependencies: Python 2.7 numpy Scikitlearn Input format and files: DDR expects all network files to in the form of the adjacency list file. For relation files, DDR expects a tuple of drug and target in each line For similarity files, DDR expects a tuple of drug (target) and drug (target) and their similarity Usage: usage: [-h] --interaction R_FILE --DSimilarity D_SIM_FILE --TSimilarity T_SIM_FILE --outfile OUT_FILE [--no_of_splits NO_OF_SPLITS] [--K K] [--K_SNF K_SNF] [--T_SNF T_SNF] [--N NO_OF_TREES] [--s SPLIT] Optional arguments: -h, --help show this help message and exit --no_of_splits NO_OF_SPLITS Number of parts to split unknown interactions. Default: 10 --K K Number of nearest neighbors for drugs and targets neighborhood. Default: 5 --K_SNF K_SNF Number of neighbors similarity fusion. Default: 3 --T_SNF T_SNF Number of iteration for similarity fusion. Default: 10 --N NO_OF_TREES Number trees for the random forest. Default: 100 --s SPLIT Split criteria for random forest trees. Default: gini Required named arguments: --interaction R_FILE Name of the file containg drug target interaction tuples --DSimilarity D_SIM_FILE Name of the file containg drug similarties file names --TSimilarity T_SIM_FILE Name of the file containg target similarties file names --outfile OUT_FILE Output file to write predictions
  • Code for: Semantic prioritization of novel causative genomic variants

    Rozaimi B. Mahamad, Razali; Kulmanov, Maxat; Hashish, Yasmeen; Bajic, Vladimir B.; Goncalves-Serra, Eva; Schoenmakers, Nadia; Gkoutos, Georgios V; Schofield, Paul N.; Hoehndorf, Robert; Boudellioua, Imane (GitHub, 2017-04-17) [Software]
    Abstract: Discriminating the causative disease variant(s) for individuals with inherited or de novo mutations present one of the main challenges faced by the clinical genetics community today. Computational approaches for variant prioritization include machine learning methods utilizing a large number of features, including molecular information, interaction networks, or phenotypes. Here, we demonstrate the PhenomeNET Variant Predictor (PVP) system that exploits semantic technologies and automated reasoning over genotype-phenotype relations to filter and prioritize variants in whole exome and whole genome sequencing datasets. We demonstrate the performance of PVP in identifying causative variants on a large number of synthetic whole-exome and whole-genome sequences, covering a wide range of diseases and syndromes. In a retrospective study, we further illustrate the application of PVP for the interpretation of whole-exome sequencing data in patients suffering from congenital hypothyroidism. We find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants. PhenomeNet Variant Predictor (PVP) - User Guide A phenotype-based tool to annotate and prioritize disease variants in WES and WGS data This user guide have been tested on Ubuntu version 16.04. For details regarding model training and evaluation, please refer to dev/ directory above. Hardware requirements At least 32 GB RAM. At least 1TB free disk space to process and accommodate the necessary databases for annotation Software requirements (for native installation) Any Unix-based operating system Java 8 Python 2.7 (as a system default version) and install the dependencies (for Python 2.7) with: pip install -r requirements.txt Run python 2 for the script (available above) to test the installation of the python dependencies. If the script fails, please try again to install the required dependencies ( using "pip2" instead of "pip", checking for permissions, or try the docker image instead). Native Installation Download the distribution file Download the data files Extract the distribution files Extract the data files data.tar.gz inside the directory phenomenet-vp-2.1 cd phenomenet-vp-2.1 Run the command: bin/phenomenet-vp to display help and parameters. Database requirements Download CADD database file. Download and run the script (Requires TABIX). Copy the generated files cadd.txt.gz and cadd.txt.gz.tbi to directory phenomenet-vp-1.0/data/db. Download DANN database file and its indexed file to directory phenomenet-vp-1.0/data/db. Rename the DANN files as dann.txt.gz and dann.txt.gz.tbi respectively. Docker Container Install Docker Download the data files and database requirements Build phenomenet-vp docker image: docker build -t phenomenet-vp . Run phenomenet docker run -v $(pwd)/data:/data phenomenet-vp -f data/Miller.vcf -o OMIM:263750 Parameters --file, -f Path to VCF file --outfile, -of Path to results file --inh, -i Mode of inheritance Default: unknown --json, -j Path to PhenoTips JSON file containing phenotypes --omim, -o OMIM ID --phenotypes, -p List of phenotype ids separated by commas --human, -h Propagate human disease phenotypes to genes only Default: false --sp, -s Propagate mouse and fish disease phenotypes to genes only Default: false --digenic, -d Rank digenic combinations Default: false --trigenic, -t Rank trigenic combinations Default: false --combination, -c Maximum Number of variant combinations to prioritize (for digenic and trigenic cases only) Default: 1000 --ngenes, -n Number of genes in oligogenic combinations (more than three) Default: 4 --oligogenic, -og Rank oligogenic combinations Default: false --python, -y Path to Python executable (ex. /usr/bin/python) Default: python Usage: To run the tool, the user needs to provide a VCF file along with either an OMIM ID of the disease or a list of phenotypes (HPO or MPO terms). a) Prioritize disease-causing variants using an OMIM ID: bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750 b) Prioritize digenic disease-causing variants using an OMIM ID, and gene-to-phenotype datta from human studies only: bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750 --human --digenic c) Prioritize disease-causing variants using a set of phenotypes, and recessive inheritance mode bin/phenomenet-vp -f data/Miller.vcf -p HP:0000007,HP:0000028,HP:0000054,HP:0000077,HP:0000175 -i recessive The result file will be at the directory containg the input file. The output file has the same name as input file with .res extension. For digenic, trigenic or oligogenic prioritization, the result file will have .digenic, .trigenic, or .oligogenic extension repectivly. Analysis of Rare Variants: In order to effectively analysis rare variants, it is strongly recommended to filter the input VCF files by MAF prior to running phenomenet-vp on it. To do so, follow the instructions below: a) Install VCFtools. b) Run the following command using VCFtools on your input VCF file to filter out variants with MAF > 1%: vcftools --vcf input_file.vcf --recode --max-maf 0.01 --out filtered c) Run PVP on the output file filtered.recode.vcf generated from the command above. PVP 1.0 The original random-forest-based PVP tool is available to download here along with its required data files here. The prepared set of exomes and genomes used for the analysis and results are provided here. DeepPVP The updated neural-network model, DeepPVP is available to download here along with its required data files here. The prepared set of exomes used for the analysis and comparative results are provided here. The comparison with PVP is based on PVP-1.1 available here along with its required data files here. OligoPVP OligoPVP is provided as part of DeepPVP tool using the parameters --digenic, --trigenicm and --oligogenic for ranking candidate disease-causing variant pairs and triples. Our prepared set of synthetic genomes digenic combinations are available here using data from the DIgenic diseases DAtabase (DIDA). The comparison results with other methods are also provided. Results were obtained using DeepPVP v2.0. People PVP is jointly developed by researchers at the University of Birmingham (Prof George Gkoutos and his team), University of Cambridge (Dr Paul Schofield and his team), and King Abdullah University of Science and Technology (Prof Vladimir Bajic, Robert Hoehndorf, and teams). Publications [1] Boudellioua I, Mahamad Razali RB, Kulmanov M, Hashish Y, Bajic VB, Goncalves-Serra E, Schoenmakers N, Gkoutos GV., Schofield PN., and Hoehndorf R. (2017) Semantic prioritization of novel causative genomic variants. PLOS Computational Biology. [2] Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2018) OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants. Scientific Reports. [3] Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2019) DeepPVP: phenotype-based prioritization of causative variants using deep learning. BMC Bioinformatics. License Copyright (c) 2016-2018, King Abdullah University of Science and Technology All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: This product includes software developed by the King Abdullah University of Science and Technology. 4. Neither the name of the King Abdullah University of Science and Technology nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  • Database: TcoF-DB v2: update of the database of human and mouse transcription co-factors and transcription factor interactions

    Schmeier, Sebastian; Alam, Tanvir; Essack, Magbubah; Bajic, Vladimir B. (2017-01-01) [Database]
    Abstract Transcription factors (TFs) play a pivotal role in transcriptional regulation, making them crucial for cell survival and important biological functions. For the regulation of transcription, interactions of different regulatory proteins known as transcription co-factors (TcoFs) and TFs are essential in forming necessary protein complexes. Although TcoFs themselves do not bind DNA directly, their influence on transcriptional regulation and initiation, although indirect, has been shown to be significant, with the functionality of TFs strongly influenced by the presence of TcoFs. In the TcoF-DB v2 database, we collect information on TcoFs. In this article, we describe updates and improvements implemented in TcoF-DB v2. TcoF-DB v2 provides several new features that enable exploration of the roles of TcoFs. The content of the database has significantly expanded and is enriched with information from Gene Ontology, biological pathways, diseases, and molecular signatures. TcoF-DB v2 now includes many more TFs; has substantially increased the number of human TcoFs to 958, and now includes information on mouse (418 new TcoFs). TcoF-DB v2 enables the exploration of information on TcoFs and allows investigations into their influence on transcriptional regulation in humans and mice. TcoF-DB v2 can be accessed at
  • Code for "DRABAL: novel method to mine large high-throughput screening assays using Bayesian active learning"

    Soufan, Othman; Ba Alawi, Wail; Afeef, Moataz A.; Essack, Magbubah; Kalnis, Panos; Bajic, Vladimir B. (Figshare, 2016-05-07) [Software]
    Background Mining high-throughput screening (HTS) assays is key for enhancing decisions in the area of drug repositioning and drug discovery. However, many challenges are encountered in the process of developing suitable and accurate methods for extracting useful information from these assays. Virtual screening and a wide variety of databases, methods, and solutions proposed to-date, did not completely overcome these challenges. This study is based on a multi-label classification (MLC) technique for modeling correlations between several HTS assays, meaning that a single prediction represents a subset of assigned correlated labels instead of one label. Thus, the devised method provides an increased probability for more accurate predictions of compounds that were not tested in particular assays. Results Here we present DRABAL, a novel MLC solution that incorporates structure learning of a Bayesian network as a step to the model dependency between the HTS assays. In this study, DRABAL was used to process more than 1.4 million interactions of over 400,000 compounds and analyze the existing relationships between five large HTS assays from the PubChem BioAssay Database. Compared to different MLC methods, DRABAL significantly improves the F1Score by about 22%, on average. We further illustrated usefulness and utility of DRABAL through screening FDA approved drugs and reported ones that have a high probability to interact with several targets, thus enabling drug-multi-target repositioning. Specifically, DRABAL suggests the Thiabendazole drug as a common activator of the NCP1 and Rab-9A proteins, both of which are designed to identify treatment modalities for the Niemann–Pick type C disease. Conclusion We developed a novel MLC solution based on a Bayesian active learning framework to overcome the challenge of lacking fully labeled training data and exploit actual dependencies between the HTS assays. The solution is motivated by the need to model dependencies between existing experimental confirmatory HTS assays and improve prediction performance. We have pursued extensive experiments over several HTS assays and have shown the advantages of DRABAL. The datasets and programs can be downloaded from