We are all witnessing an explosion in the volumes of biological data generated by the latest high-throughput technologies. It is likely that the rate of increase in data volumes will expand even further in the future. The enormity of volumes, complexity and interdependence of data all pose analytical challenges. What to make out of data? How to analyze it? What is the deeply hidden knowledge buried under this complexity? These are just some of the questions that contemporary life sciences in general and bioinformatics in particular have to grapple with in the quest to improve lives through knowledge discovery. Challenges arising from size and complexity of data sets are not unique to life sciences; examples are high-energy physics, climate science, astrophysics, national security data, etc. What is common to all these fields is that the analysis of data in such cases requires a new approach in the paradigm of Big Data and exascale computing. In this seminar, some of the state-of-the-art approaches to Big Data challenges will be discussed. The seminar presents an opportunity for those who generate data and those who analyze it to discuss possible ways forward towards more efficient analysis, knowledge discovery and modeling. Conference web site: http://www.cbrc.kaust.edu.sa/cbrcweb/sp/bd2016.php

Recent Submissions

  • Knowledge Exploration from Big Data in Biomedicine

    Bajic, Vladimir B. (2016-01-27) [Presentation]
    The last few decades have witnessed an enormous accumulation of data and information in various forms in the domain of Biomedicine. To search for accurate and rich information on any particular topic in this domain appears challenging. The main reasons are that a) useful pieces of information are scattered across numerous sources, b) data is contained in a variety of formats, c) data/information are not indexed with standard identifiers, d) a lot of information is in a free text format, and e) frequently the information needed is not explicitly presented in any single data/information source. This situation requires new approaches to search for, extract and explore the desired information. We will present a system developed at KAUST that addresses some of these challenges. This system is a representative of a technological solution to what can be named Next Generation Knowledge Mining Systems for the biomedical domain.
  • Function and Phenotype prediction through Data and Knowledge Fusion

    Vespoor, Karen (2016-01-27) [Presentation]
    The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, as well as analysis of genetic variants that are supported by analysis of the literature and integration with complementary structured resources.
  • Knowledge-based analysis of phenotypes

    Hoehndorf, Robert (2016-01-27) [Presentation]
    Phenotypes are the observable characteristics of an organism, and they are widely recorded in biology and medicine. To facilitate data integration, ontologies that formally describe phenotypes are being developed in several domains. I will describe a formal framework to describe phenotypes. A formalized theory of phenotypes is not only useful for domain analysis, but can also be applied to assist in the diagnosis of rare genetic diseases, and I will show how our results on the ontology of phenotypes is now applied in biomedical research.
  • Machine learning and complex-network for personalized and systems biomedicine

    Cannistraci, Carlo Vittorio (2016-01-27) [Presentation]
    The talk will begin with an introduction on using machine learning to discover hidden information and unexpected patterns in large biomedical datasets. Then, recent results on the use of complex network theory in biomedicine and neuroscience will be discussed. In particular, metagenomics and metabolomics data, approaches for drug-target repositioning, functional/structural MR connectomes and gut-brain axis data will be presented. The conclusion will outline the novel and exciting perspectives offered by the translation of these methods from systems biology to systems medicine.
  • The Glory and Misery of Electronic Health Records

    Smith, Barry (2016-01-27) [Presentation]
    While bioinformatics has witnessed enormous technological advances since the turn of the millennium, progress in the EHR field has been stymied by outdated approaches entrenched through ill-conceived government mandates. In the US, especially, the dominant EHR systems are expensive, difficult to use, fail to ensure even a minimal level of interoperability, and detract from patient care. I will outline the reasons for some of these failures, and sketch an evolutionary path towards the sort of EHR landscape that will be needed in the future, in which consistency with biomedical ontologies will play a central role.
  • Protein phosphorylation in bcterial signaling and regulation

    Mijakovic, Ivan (2016-01-26) [Presentation]
    In 2003, it was demonstrated for the first time that bacteria possess protein-tyrosine kinases (BY-kinases), capable of phosphorylating other cellular proteins and regulating their activity. It soon became apparent that these kinases phosphorylate a number of protein substrates, involved in different cellular processes. More recently, we found out that BY-kinases can be activated by several distinct protein interactants, and are capable of engaging in cross-phosphorylation with other kinases. Evolutionary studies based on genome comparison indicate that BY-kinases exist only in bacteria. They are non-essential (present in about 40% bacterial genomes), and their knockouts lead to pleiotropic phenotypes, since they phosphorylate many substrates. Surprisingly, BY-kinase genes accumulate mutations at an increased rate (non-synonymous substitution rate significantly higher than other bacterial genes). One direct consequence of this phenomenon is no detectable co-evolution between kinases and their substrates. Their promiscuity towards substrates thus seems to be “hard-wired”, but why would bacteria maintain such promiscuous regulatory devices? One explanation is the maintenance of BY-kinases as rapidly evolving regulators, which can readily adopt new substrates when environmental changes impose selective pressure for quick evolution of new regulatory modules. Their role is clearly not to act as master regulators, dedicated to triggering a single response, but they might rather be employed to contribute to fine-tuning and improving robustness of various cellular responses. This unique feature makes BY-kinases a potentially useful tool in synthetic biology. While other bacterial kinases are very specific and their signaling pathways insulated, BY-kinase can relatively easily be engineered to adopt new substrates and control new biosynthetic processes. Since they are absent in humans, and regulate some key functions in pathogenic bacteria, they are also very promising targets for new antibacterial drugs.
  • Deep Learning and Applications in Computational Biology

    Zeng, Jianyang (2016-01-26) [Presentation]
    RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this work, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp.
  • Diversity Indices as Measures of Functional Annotation Methods in Metagenomics Studies

    Jankovic, Boris R. (2016-01-26) [Presentation]
    Applications of high-throughput techniques in metagenomics studies produce massive amounts of data. Fragments of genomic, transcriptomic and proteomic molecules are all found in metagenomics samples. Laborious and meticulous effort in sequencing and functional annotation are then required to, amongst other objectives, reconstruct a taxonomic map of the environment that metagenomics samples were taken from. In addition to computational challenges faced by metagenomics studies, the analysis is further complicated by the presence of contaminants in the samples, potentially resulting in skewed taxonomic analysis. The functional annotation in metagenomics can utilize all available omics data and therefore different methods that are associated with a particular type of data. For example, protein-coding DNA, non-coding RNA or ribosomal RNA data can be used in such an analysis. These methods would have their advantages and disadvantages and the question of comparison among them naturally arises. There are several criteria that can be used when performing such a comparison. Loosely speaking, methods can be evaluated in terms of computational complexity or in terms of the expected biological accuracy. We propose that the concept of diversity that is used in the ecosystems and species diversity studies can be successfully used in evaluating certain aspects of the methods employed in metagenomics studies. We show that when applying the concept of Hill’s diversity, the analysis of variations in the diversity order provides valuable clues into the robustness of methods used in the taxonomical analysis.
  • High throughtput comparisons and profiling of metagenomes for industrially relevant enzymes

    Alam, Intikhab (2016-01-26) [Presentation]
    More and more genomes and metagenomes are being sequenced since the advent of Next Generation Sequencing Technologies (NGS). Many metagenomic samples are collected from a variety of environments, each exhibiting a different environmental profile, e.g. temperature, environmental chemistry, etc… These metagenomes can be profiled to unearth enzymes relevant to several industries based on specific enzyme properties such as ability to work on extreme conditions, such as extreme temperatures, salinity, anaerobically, etc.. In this work, we present the DMAP platform comprising of a high-throughput metagenomic annotation pipeline and a data-warehouse for comparisons and profiling across large number of metagenomes. We developed two reference databases for profiling of important genes, one containing enzymes related to different industries and the other containing genes with potential bioactivity roles. In this presentation we describe an example analysis of a large number of publicly available metagenomic sample from TARA oceans study (Science 2015) that covers significant part of world oceans.
  • Comparative metagenomics of the Red Sea

    Mineta, Katsuhiko (2016-01-26) [Presentation]
    Metagenome produces a tremendous amount of data that comes from the organisms living in the environments. This big data enables us to examine not only microbial genes but also the community structure, interaction and adaptation mechanisms at the specific location and condition. The Red Sea has several unique characteristics such as high salinity, high temperature and low nutrition. These features must contribute to form the unique microbial community during the evolutionary process. Since 2014, we started monthly samplings of the metagenomes in the Red Sea under KAUST-CCF project. In collaboration with Kitasato University, we also collected the metagenome data from the ocean in Japan, which shows contrasting features to the Red Sea. Therefore, the comparative metagenomics of those data provides a comprehensive view of the Red Sea microbes, leading to identify key microbes, genes and networks related to those environmental differences.
  • Three-Dimentional Structures of Autophosphorylation Complexes in Crystals of Protein Kinases

    Dumbrack, Roland (2016-01-26) [Presentation]
    Protein kinase autophosphorylation is a common regulatory mechanism in cell signaling pathways. Several autophosphorylation complexes have been identified in crystals of protein kinases, with a known serine, threonine, or tyrosine autophosphorylation site of one kinase monomer sitting in the active site of another monomer of the same protein in the crystal. We utilized a structural bioinformatics method to identify all such autophosphorylation complexes in X-ray crystallographic structures in the Protein Data Bank (PDB) by generating all unique kinase/kinase interfaces within and between asymmetric units of each crystal and measuring the distance between the hydroxyl oxygen of potential autophosphorylation sites and the oxygen atoms of the active site aspartic acid residue side chain. We have identified 15 unique autophosphorylation complexes in the PDB, of which 5 complexes have not previously been described in the relevant publications on the crystal structures (N-terminal juxtamembrane regions of CSF1R and EPHA2, activation loop tyrosines of LCK and IGF1R, and a serine in a nuclear localization signal region of CLK2. Mutation of residues in the autophosphorylation complex interface of LCK either severely impaired autophosphorylation or increased it. Taking the autophosphorylation complexes as a whole and comparing them with peptide-substrate/kinase complexes, we observe a number of important features among them. The novel and previously observed autophosphorylation sites are conserved in many kinases, indicating that by homology we can extend the relevance of these complexes to many other clinically relevant drug targets.
  • Big data integration: scalability and sustainability

    Zhang, Zhang (2016-01-26) [Presentation]
    Integration of various types of omics data is critically indispensable for addressing most important and complex biological questions. In the era of big data, however, data integration becomes increasingly tedious, time-consuming and expensive, posing a significant obstacle to fully exploit the wealth of big biological data. Here we propose a scalable and sustainable architecture that integrates big omics data through community-contributed modules. Community modules are contributed and maintained by different committed groups and each module corresponds to a specific data type, deals with data collection, processing and visualization, and delivers data on-demand via web services. Based on this community-based architecture, we build Information Commons for Rice (IC4R; http://ic4r.org), a rice knowledgebase that integrates a variety of rice omics data from multiple community modules, including genome-wide expression profiles derived entirely from RNA-Seq data, resequencing-based genomic variations obtained from re-sequencing data of thousands of rice varieties, plant homologous genes covering multiple diverse plant species, post-translational modifications, rice-related literatures, and community annotations. Taken together, such architecture achieves integration of different types of data from multiple community-contributed modules and accordingly features scalable, sustainable and collaborative integration of big data as well as low costs for database update and maintenance, thus helpful for building IC4R into a comprehensive knowledgebase covering all aspects of rice data and beneficial for both basic and translational researches.
  • Modeling structure of G protein-coupled receptors in huan genome

    Zhang, Yang (2016-01-26) [Presentation]
    G protein-coupled receptors (or GPCRs) are integral transmembrane proteins responsible to various cellular signal transductions. Human GPCR proteins are encoded by 5% of human genes but account for the targets of 40% of the FDA approved drugs. Due to difficulties in crystallization, experimental structure determination remains extremely difficult for human GPCRs, which have been a major barrier in modern structure-based drug discovery. We proposed a new hybrid protocol, GPCR-I-TASSER, to construct GPCR structure models by integrating experimental mutagenesis data with ab initio transmembrane-helix assembly simulations, assisted by the predicted transmembrane-helix interaction networks. The method was tested in recent community-wide GPCRDock experiments and constructed models with a root mean square deviation 1.26 Å for Dopamine-3 and 2.08 Å for Chemokine-4 receptors in the transmembrane domain regions, which were significantly closer to the native than the best templates available in the PDB. GPCR-I-TASSER has been applied to model all 1,026 putative GPCRs in the human genome, where 923 are found to have correct folds based on the confidence score analysis and mutagenesis data comparison. The successfully modeled GPCRs contain many pharmaceutically important families that do not have previously solved structures, including Trace amine, Prostanoids, Releasing hormones, Melanocortins, Vasopressin and Neuropeptide Y receptors. All the human GPCR models have been made publicly available through the GPCR-HGmod database at http://zhanglab.ccmb.med.umich.edu/GPCR-HGmod/ The results demonstrate new progress on genome-wide structure modeling of transmembrane proteins which should bring useful impact on the effort of GPCR-targeted drug discovery.
  • Network-based analysis of proteomic profiles

    Wong, Limsoon (2016-01-26) [Presentation]
    Mass spectrometry (MS)-based proteomics is a widely used and powerful tool for profiling systems-wide protein expression changes. It can be applied for various purposes, e.g. biomarker discovery in diseases and study of drug responses. Although RNA-based high-throughput methods have been useful in providing glimpses into the underlying molecular processes, the evidences they provide are indirect. Furthermore, RNA and corresponding protein levels have been known to have poor correlation. On the other hand, MS-based proteomics tend to have consistency issues (poor reproducibility and inter-sample agreement) and coverage issues (inability to detect the entire proteome) that need to be urgently addressed. In this talk, I will discuss how these issues can be addressed by proteomic profile analysis techniques that use biological networks (especially protein complexes) as the biological context. In particular, I will describe several techniques that we have been developing for network-based analysis of proteomics profile. And I will present evidence that these techniques are useful in identifying proteomics-profile analysis results that are more consistent, more reproducible, and more biologically coherent, and that these techniques allow expansion of the detected proteome to uncover and/or discover novel proteins.
  • Finding a Leucine in a Haystack: Searching the Proteome for ambigous Leucine-Aspartic Acid motifs

    Arold, Stefan T. (2016-01-25) [Presentation]
    Leucine-aspartic acid (LD) motifs are short helical protein-protein interaction motifs involved in cell motility, survival and communication. LD motif interactions are also implicated in cancer metastasis and are targeted by several viruses. LD motifs are notoriously difficult to detect because sequence pattern searches lead to an excessively high number of false positives. Hence, despite 20 years of research, only six LD motif–containing proteins are known in humans, three of which are close homologues of the paxillin family. To enable the proteome-wide discovery of LD motifs, we developed LD Motif Finder (LDMF), a web tool based on machine learning that combines sequence information with structural predictions to detect LD motifs with high accuracy. LDMF predicted 13 new LD motifs in humans. Using biophysical assays, we experimentally confirmed in vitro interactions for four novel LD motif proteins. Thus, LDMF allows proteome-wide discovery of LD motifs, despite a highly ambiguous sequence pattern. Functional implications will be discussed.
  • Emerging experimental and computational technologies for purpose designed engineering of photosynthetic prokaryotes

    Lindblad, Peter (2016-01-25) [Presentation]
    With recent advances in synthetic molecular tools to be used in photosynthetic prokaryotes, like cyanobacteria, it is possible to custom design and construct microbial cells for specific metabolic functions. This cross-disciplinary area of research has emerged within the interfaces of advanced genetic engineering, computational science, and molecular biotechnology. We have initiated the development of a genetic toolbox, using a synthetic biology approach, to custom design, engineer and construct cyanobacteria for selected function and metabolism. One major bottleneck is a controlled transcription and translation of introduced genetic constructs. An additional major issue is genetic stability. I will present and discuss recent progress in our development of genetic tools for advanced cyanobacterial biotechnology. Progress on understanding the electron pathways in native and engineered cyanobacterial enzymes and heterologous expression of non-native enymzes in cyanobacterial cells will be highlighted. Finally, I will discuss our attempts to merge synthetic biology with synthetic chemistry to explore fundamantal questions of protein design and function.
  • Trusted Allies with New Benefits: Repositioning Existing Drugs

    Gao, Xin (2016-01-25) [Presentation]
    The classical assumption that one drug cures a single disease by binding to a single drug-target has been shown to be inaccurate. Recent studies estimate that each drug on average binds to at least six known and several unknown targets. Identifying the “off-targets” can help understand the side effects and toxicity of the drug. Moreover, off-targets for a given drug may inspire “drug repositioning”, where a drug already approved for one condition is redirected to treat another condition, thereby overcoming delays and costs associated with clinical trials and drug approval. In this talk, I will introduce our work along this direction. We have developed a structural alignment method that can precisely identify structural similarities between arbitrary types of interaction interfaces, such as the drug-target interaction. We have further developed a novel computational framework, iDTP that constructs the structural signatures of approved and experimental drugs, based on which we predict new targets for these drugs. Our method combines information from several sources including sequence independent structural alignment, sequence similarity, drug-target tissue expression data, and text mining. In a cross-validation study, we used iDTP to predict the known targets of 11 drugs, with 63% sensitivity and 81% specificity. We then predicted novel targets for these drugs—two that are of high pharmacological interest, the peroxisome proliferator-activated receptor gamma and the oncogene B-cell lymphoma 2, were successfully validated through in vitro binding experiments.
  • Welcome Address

    Frechet, Jean (2016-01-25) [Presentation]
  • Big Data and HPC: A Happy Marriage

    Mehmood, Rashid (2016-01-25) [Presentation]
    International Data Corporation (IDC) defines Big Data technologies as “a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data produced every day, by enabling high velocity capture, discovery, and/or analysis”. High Performance Computing (HPC) most generally refers to “the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business”. Big data platforms are built primarily considering the economics and capacity of the system for dealing with the 4V characteristics of data. HPC traditionally has been more focussed on the speed of digesting (computing) the data. For these reasons, the two domains (HPC and Big Data) have developed their own paradigms and technologies. However, recently, these two have grown fond of each other. HPC technologies are needed by Big Data to deal with the ever increasing Vs of data in order to forecast and extract insights from existing and new domains, faster, and with greater accuracy. Increasingly more data is being produced by scientific experiments from areas such as bioscience, physics, and climate, and therefore, HPC needs to adopt data-driven paradigms. Moreover, there are synergies between them with unimaginable potential for developing new computing paradigms, solving long-standing grand challenges, and making new explorations and discoveries. Therefore, they must get married to each other. In this talk, we will trace the HPC and big data landscapes through time including their respective technologies, paradigms and major applications areas. Subsequently, we will present the factors that are driving the convergence of the two technologies, the synergies between them, as well as the benefits of their convergence to the biosciences field. The opportunities and challenges of the computing paradigm resulting from this convergence will be discussed.
  • Big Data Analysis of Human Genome Variations

    Gojobori, Takashi (2016-01-25) [Presentation]
    Since the human genome draft sequence was in public for the first time in 2000, genomic analyses have been intensively extended to the population level. The following three international projects are good examples for large-scale studies of human genome variations: 1) HapMap Data (1,417 individuals) (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/), 2) HGDP (Human Genome Diversity Project) Data (940 individuals) (http://www.hagsc.org/hgdp/files.html), 3) 1000 genomes Data (2,504 individuals) http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ If we can integrate all three data into a single volume of data, we should be able to conduct a more detailed analysis of human genome variations for a total number of 4,861 individuals (= 1,417+940+2,504 individuals). In fact, we successfully integrated these three data sets by use of information on the reference human genome sequence, and we conducted the big data analysis. In particular, we constructed a phylogenetic tree of about 5,000 human individuals at the genome level. As a result, we were able to identify clusters of ethnic groups, with detectable admixture, that were not possible by an analysis of each of the three data sets. Here, we report the outcome of this kind of big data analyses and discuss evolutionary significance of human genomic variations. Note that the present study was conducted in collaboration with Katsuhiko Mineta and Kosuke Goto at KAUST.

View more