Code for: Semantic prioritization of novel causative genomic variants
AuthorsRozaimi B. Mahamad, Razali
Bajic, Vladimir B.
Gkoutos, Georgios V
Schofield, Paul N.
KAUST DepartmentApplied Mathematics and Computational Science Program
Bio-Ontology Research Group (BORG)
Computational Bioscience Research Center (CBRC)
Computer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Permanent link to this recordhttp://hdl.handle.net/10754/656603
MetadataShow full item record
DescriptionAbstract: Discriminating the causative disease variant(s) for individuals with inherited or de novo mutations present one of the main challenges faced by the clinical genetics community today. Computational approaches for variant prioritization include machine learning methods utilizing a large number of features, including molecular information, interaction networks, or phenotypes. Here, we demonstrate the PhenomeNET Variant Predictor (PVP) system that exploits semantic technologies and automated reasoning over genotype-phenotype relations to filter and prioritize variants in whole exome and whole genome sequencing datasets. We demonstrate the performance of PVP in identifying causative variants on a large number of synthetic whole-exome and whole-genome sequences, covering a wide range of diseases and syndromes. In a retrospective study, we further illustrate the application of PVP for the interpretation of whole-exome sequencing data in patients suffering from congenital hypothyroidism. We find that PVP accurately identifies causative variants in whole exome and whole genome sequencing datasets and provides a powerful resource for the discovery of causal variants. PhenomeNet Variant Predictor (PVP) - User Guide A phenotype-based tool to annotate and prioritize disease variants in WES and WGS data This user guide have been tested on Ubuntu version 16.04. For details regarding model training and evaluation, please refer to dev/ directory above. Hardware requirements At least 32 GB RAM. At least 1TB free disk space to process and accommodate the necessary databases for annotation Software requirements (for native installation) Any Unix-based operating system Java 8 Python 2.7 (as a system default version) and install the dependencies (for Python 2.7) with: pip install -r requirements.txt Run python 2 for the script test.py (available above) to test the installation of the python dependencies. If the script fails, please try again to install the required dependencies ( using "pip2" instead of "pip", checking for permissions, or try the docker image instead). Native Installation Download the distribution file phenomenet-vp-2.1.zip Download the data files phenomenet-vp-2.1-data.zip Extract the distribution files phenomenet-vp-2.1.zip Extract the data files data.tar.gz inside the directory phenomenet-vp-2.1 cd phenomenet-vp-2.1 Run the command: bin/phenomenet-vp to display help and parameters. Database requirements Download CADD database file. Download and run the script generate.sh (Requires TABIX). Copy the generated files cadd.txt.gz and cadd.txt.gz.tbi to directory phenomenet-vp-1.0/data/db. Download DANN database file and its indexed file to directory phenomenet-vp-1.0/data/db. Rename the DANN files as dann.txt.gz and dann.txt.gz.tbi respectively. Docker Container Install Docker Download the data files phenomenet-vp-2.1-data.zip and database requirements Build phenomenet-vp docker image: docker build -t phenomenet-vp . Run phenomenet docker run -v $(pwd)/data:/data phenomenet-vp -f data/Miller.vcf -o OMIM:263750 Parameters --file, -f Path to VCF file --outfile, -of Path to results file --inh, -i Mode of inheritance Default: unknown --json, -j Path to PhenoTips JSON file containing phenotypes --omim, -o OMIM ID --phenotypes, -p List of phenotype ids separated by commas --human, -h Propagate human disease phenotypes to genes only Default: false --sp, -s Propagate mouse and fish disease phenotypes to genes only Default: false --digenic, -d Rank digenic combinations Default: false --trigenic, -t Rank trigenic combinations Default: false --combination, -c Maximum Number of variant combinations to prioritize (for digenic and trigenic cases only) Default: 1000 --ngenes, -n Number of genes in oligogenic combinations (more than three) Default: 4 --oligogenic, -og Rank oligogenic combinations Default: false --python, -y Path to Python executable (ex. /usr/bin/python) Default: python Usage: To run the tool, the user needs to provide a VCF file along with either an OMIM ID of the disease or a list of phenotypes (HPO or MPO terms). a) Prioritize disease-causing variants using an OMIM ID: bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750 b) Prioritize digenic disease-causing variants using an OMIM ID, and gene-to-phenotype datta from human studies only: bin/phenomenet-vp -f data/Miller.vcf -o OMIM:263750 --human --digenic c) Prioritize disease-causing variants using a set of phenotypes, and recessive inheritance mode bin/phenomenet-vp -f data/Miller.vcf -p HP:0000007,HP:0000028,HP:0000054,HP:0000077,HP:0000175 -i recessive The result file will be at the directory containg the input file. The output file has the same name as input file with .res extension. For digenic, trigenic or oligogenic prioritization, the result file will have .digenic, .trigenic, or .oligogenic extension repectivly. Analysis of Rare Variants: In order to effectively analysis rare variants, it is strongly recommended to filter the input VCF files by MAF prior to running phenomenet-vp on it. To do so, follow the instructions below: a) Install VCFtools. b) Run the following command using VCFtools on your input VCF file to filter out variants with MAF > 1%: vcftools --vcf input_file.vcf --recode --max-maf 0.01 --out filtered c) Run PVP on the output file filtered.recode.vcf generated from the command above. PVP 1.0 The original random-forest-based PVP tool is available to download here along with its required data files here. The prepared set of exomes and genomes used for the analysis and results are provided here. DeepPVP The updated neural-network model, DeepPVP is available to download here along with its required data files here. The prepared set of exomes used for the analysis and comparative results are provided here. The comparison with PVP is based on PVP-1.1 available here along with its required data files here. OligoPVP OligoPVP is provided as part of DeepPVP tool using the parameters --digenic, --trigenicm and --oligogenic for ranking candidate disease-causing variant pairs and triples. Our prepared set of synthetic genomes digenic combinations are available here using data from the DIgenic diseases DAtabase (DIDA). The comparison results with other methods are also provided. Results were obtained using DeepPVP v2.0. People PVP is jointly developed by researchers at the University of Birmingham (Prof George Gkoutos and his team), University of Cambridge (Dr Paul Schofield and his team), and King Abdullah University of Science and Technology (Prof Vladimir Bajic, Robert Hoehndorf, and teams). Publications  Boudellioua I, Mahamad Razali RB, Kulmanov M, Hashish Y, Bajic VB, Goncalves-Serra E, Schoenmakers N, Gkoutos GV., Schofield PN., and Hoehndorf R. (2017) Semantic prioritization of novel causative genomic variants. PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1005500  Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2018) OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants. Scientific Reports. https://doi.org/10.1038/s41598-018-32876-3  Boudellioua I, Kulmanov M, Schofield PN., Gkoutos GV., and Hoehndorf R . (2019) DeepPVP: phenotype-based prioritization of causative variants using deep learning. BMC Bioinformatics. https://doi.org/10.1186/s12859-019-2633-8 License Copyright (c) 2016-2018, King Abdullah University of Science and Technology All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: This product includes software developed by the King Abdullah University of Science and Technology. 4. Neither the name of the King Abdullah University of Science and Technology nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
SponsorsNS was funded by Wellcome Trust (Grant 100585/Z/12/Z) and the National Institute for Health Research Cambridge Biomedical Research Centre. IB, RBMR, MK, YH, VBB, RH were funded by the King Abdullah University of Science and Technology. GVG acknowledges funding from the National Science Foundation (NSF grant number: IOS-1340112) and the European Commission H2020 (Grant Agreement No. 731075).