We are all witnessing an explosion in the volumes of biological data generated by the latest high-throughput technologies. It is likely that the rate of increase in data volumes will expand even further in the future. The enormity of volumes, complexity and interdependence of data all pose analytical challenges. What to make out of data? How to analyze it? What is the deeply hidden knowledge buried under this complexity? These are just some of the questions that contemporary life sciences in general and bioinformatics in particular have to grapple with in the quest to improve lives through knowledge discovery. Challenges arising from size and complexity of data sets are not unique to life sciences; examples are high-energy physics, climate science, astrophysics, national security data, etc. What is common to all these fields is that the analysis of data in such cases requires a new approach in the paradigm of Big Data and exascale computing. In this seminar, some of the state-of-the-art approaches to Big Data challenges will be discussed. The seminar presents an opportunity for those who generate data and those who analyze it to discuss possible ways forward towards more efficient analysis, knowledge discovery and modeling. Conference web site: http://www.cbrc.kaust.edu.sa/cbrcweb/sp/bd2016.php

Recent Submissions

  • The Glory and Misery of Electronic Health Records

    Smith, Barry (2016-01-27) [Presentation]
    While bioinformatics has witnessed enormous technological advances since the turn of the millennium, progress in the EHR field has been stymied by outdated approaches entrenched through ill-conceived government mandates. In the US, especially, the dominant EHR systems are expensive, difficult to use, fail to ensure even a minimal level of interoperability, and detract from patient care. I will outline the reasons for some of these failures, and sketch an evolutionary path towards the sort of EHR landscape that will be needed in the future, in which consistency with biomedical ontologies will play a central role.
  • Function and Phenotype prediction through Data and Knowledge Fusion

    Vespoor, Karen (2016-01-27) [Presentation]
    The biomedical literature captures the most current biomedical knowledge and is a tremendously rich resource for research. With over 24 million publications currently indexed in the US National Library of Medicine’s PubMed index, however, it is becoming increasingly challenging for biomedical researchers to keep up with this literature. Automated strategies for extracting information from it are required. Large-scale processing of the literature enables direct biomedical knowledge discovery. In this presentation, I will introduce the use of text mining techniques to support analysis of biological data sets, and will specifically discuss applications in protein function and phenotype prediction, as well as analysis of genetic variants that are supported by analysis of the literature and integration with complementary structured resources.
  • Machine learning and complex-network for personalized and systems biomedicine

    Cannistraci, Carlo Vittorio (2016-01-27) [Presentation]
    The talk will begin with an introduction on using machine learning to discover hidden information and unexpected patterns in large biomedical datasets. Then, recent results on the use of complex network theory in biomedicine and neuroscience will be discussed. In particular, metagenomics and metabolomics data, approaches for drug-target repositioning, functional/structural MR connectomes and gut-brain axis data will be presented. The conclusion will outline the novel and exciting perspectives offered by the translation of these methods from systems biology to systems medicine.
  • Knowledge Exploration from Big Data in Biomedicine

    Bajic, Vladimir B. (2016-01-27) [Presentation]
    The last few decades have witnessed an enormous accumulation of data and information in various forms in the domain of Biomedicine. To search for accurate and rich information on any particular topic in this domain appears challenging. The main reasons are that a) useful pieces of information are scattered across numerous sources, b) data is contained in a variety of formats, c) data/information are not indexed with standard identifiers, d) a lot of information is in a free text format, and e) frequently the information needed is not explicitly presented in any single data/information source. This situation requires new approaches to search for, extract and explore the desired information. We will present a system developed at KAUST that addresses some of these challenges. This system is a representative of a technological solution to what can be named Next Generation Knowledge Mining Systems for the biomedical domain.
  • Knowledge-based analysis of phenotypes

    Hoehndorf, Robert (2016-01-27) [Presentation]
    Phenotypes are the observable characteristics of an organism, and they are widely recorded in biology and medicine. To facilitate data integration, ontologies that formally describe phenotypes are being developed in several domains. I will describe a formal framework to describe phenotypes. A formalized theory of phenotypes is not only useful for domain analysis, but can also be applied to assist in the diagnosis of rare genetic diseases, and I will show how our results on the ontology of phenotypes is now applied in biomedical research.
  • Diversity Indices as Measures of Functional Annotation Methods in Metagenomics Studies

    Jankovic, Boris R. (2016-01-26) [Presentation]
    Applications of high-throughput techniques in metagenomics studies produce massive amounts of data. Fragments of genomic, transcriptomic and proteomic molecules are all found in metagenomics samples. Laborious and meticulous effort in sequencing and functional annotation are then required to, amongst other objectives, reconstruct a taxonomic map of the environment that metagenomics samples were taken from. In addition to computational challenges faced by metagenomics studies, the analysis is further complicated by the presence of contaminants in the samples, potentially resulting in skewed taxonomic analysis. The functional annotation in metagenomics can utilize all available omics data and therefore different methods that are associated with a particular type of data. For example, protein-coding DNA, non-coding RNA or ribosomal RNA data can be used in such an analysis. These methods would have their advantages and disadvantages and the question of comparison among them naturally arises. There are several criteria that can be used when performing such a comparison. Loosely speaking, methods can be evaluated in terms of computational complexity or in terms of the expected biological accuracy. We propose that the concept of diversity that is used in the ecosystems and species diversity studies can be successfully used in evaluating certain aspects of the methods employed in metagenomics studies. We show that when applying the concept of Hill’s diversity, the analysis of variations in the diversity order provides valuable clues into the robustness of methods used in the taxonomical analysis.
  • Protein phosphorylation in bcterial signaling and regulation

    Mijakovic, Ivan (2016-01-26) [Presentation]
    In 2003, it was demonstrated for the first time that bacteria possess protein-tyrosine kinases (BY-kinases), capable of phosphorylating other cellular proteins and regulating their activity. It soon became apparent that these kinases phosphorylate a number of protein substrates, involved in different cellular processes. More recently, we found out that BY-kinases can be activated by several distinct protein interactants, and are capable of engaging in cross-phosphorylation with other kinases. Evolutionary studies based on genome comparison indicate that BY-kinases exist only in bacteria. They are non-essential (present in about 40% bacterial genomes), and their knockouts lead to pleiotropic phenotypes, since they phosphorylate many substrates. Surprisingly, BY-kinase genes accumulate mutations at an increased rate (non-synonymous substitution rate significantly higher than other bacterial genes). One direct consequence of this phenomenon is no detectable co-evolution between kinases and their substrates. Their promiscuity towards substrates thus seems to be “hard-wired”, but why would bacteria maintain such promiscuous regulatory devices? One explanation is the maintenance of BY-kinases as rapidly evolving regulators, which can readily adopt new substrates when environmental changes impose selective pressure for quick evolution of new regulatory modules. Their role is clearly not to act as master regulators, dedicated to triggering a single response, but they might rather be employed to contribute to fine-tuning and improving robustness of various cellular responses. This unique feature makes BY-kinases a potentially useful tool in synthetic biology. While other bacterial kinases are very specific and their signaling pathways insulated, BY-kinase can relatively easily be engineered to adopt new substrates and control new biosynthetic processes. Since they are absent in humans, and regulate some key functions in pathogenic bacteria, they are also very promising targets for new antibacterial drugs.
  • Network-based analysis of proteomic profiles

    Wong, Limsoon (2016-01-26) [Presentation]
    Mass spectrometry (MS)-based proteomics is a widely used and powerful tool for profiling systems-wide protein expression changes. It can be applied for various purposes, e.g. biomarker discovery in diseases and study of drug responses. Although RNA-based high-throughput methods have been useful in providing glimpses into the underlying molecular processes, the evidences they provide are indirect. Furthermore, RNA and corresponding protein levels have been known to have poor correlation. On the other hand, MS-based proteomics tend to have consistency issues (poor reproducibility and inter-sample agreement) and coverage issues (inability to detect the entire proteome) that need to be urgently addressed. In this talk, I will discuss how these issues can be addressed by proteomic profile analysis techniques that use biological networks (especially protein complexes) as the biological context. In particular, I will describe several techniques that we have been developing for network-based analysis of proteomics profile. And I will present evidence that these techniques are useful in identifying proteomics-profile analysis results that are more consistent, more reproducible, and more biologically coherent, and that these techniques allow expansion of the detected proteome to uncover and/or discover novel proteins.
  • Deep Learning and Applications in Computational Biology

    Zeng, Jianyang (2016-01-26) [Presentation]
    RNA-binding proteins (RBPs) play important roles in the post-transcriptional control of RNAs. Identifying RBP binding sites and characterizing RBP binding preferences are key steps toward understanding the basic mechanisms of the post-transcriptional gene regulation. Though numerous computational methods have been developed for modeling RBP binding preferences, discovering a complete structural representation of the RBP targets by integrating their available structural features in all three dimensions is still a challenging task. In this work, we develop a general and flexible deep learning framework for modeling structural binding preferences and predicting binding sites of RBPs, which takes (predicted) RNA tertiary structural information into account for the first time. Our framework constructs a unified representation that characterizes the structural specificities of RBP targets in all three dimensions, which can be further used to predict novel candidate binding sites and discover potential binding motifs. Through testing on the real CLIP-seq datasets, we have demonstrated that our deep learning framework can automatically extract effective hidden structural features from the encoded raw sequence and structural profiles, and predict accurate RBP binding sites. In addition, we have conducted the first study to show that integrating the additional RNA tertiary structural features can improve the model performance in predicting RBP binding sites, especially for the polypyrimidine tract-binding protein (PTB), which also provides a new evidence to support the view that RBPs may own specific tertiary structural binding preferences. In particular, the tests on the internal ribosome entry site (IRES) segments yield satisfiable results with experimental support from the literature and further demonstrate the necessity of incorporating RNA tertiary structural information into the prediction model. The source code of our approach can be found in https://github.com/thucombio/deepnet-rbp.
  • Modeling structure of G protein-coupled receptors in huan genome

    Zhang, Yang (2016-01-26) [Presentation]
    G protein-coupled receptors (or GPCRs) are integral transmembrane proteins responsible to various cellular signal transductions. Human GPCR proteins are encoded by 5% of human genes but account for the targets of 40% of the FDA approved drugs. Due to difficulties in crystallization, experimental structure determination remains extremely difficult for human GPCRs, which have been a major barrier in modern structure-based drug discovery. We proposed a new hybrid protocol, GPCR-I-TASSER, to construct GPCR structure models by integrating experimental mutagenesis data with ab initio transmembrane-helix assembly simulations, assisted by the predicted transmembrane-helix interaction networks. The method was tested in recent community-wide GPCRDock experiments and constructed models with a root mean square deviation 1.26 Å for Dopamine-3 and 2.08 Å for Chemokine-4 receptors in the transmembrane domain regions, which were significantly closer to the native than the best templates available in the PDB. GPCR-I-TASSER has been applied to model all 1,026 putative GPCRs in the human genome, where 923 are found to have correct folds based on the confidence score analysis and mutagenesis data comparison. The successfully modeled GPCRs contain many pharmaceutically important families that do not have previously solved structures, including Trace amine, Prostanoids, Releasing hormones, Melanocortins, Vasopressin and Neuropeptide Y receptors. All the human GPCR models have been made publicly available through the GPCR-HGmod database at http://zhanglab.ccmb.med.umich.edu/GPCR-HGmod/ The results demonstrate new progress on genome-wide structure modeling of transmembrane proteins which should bring useful impact on the effort of GPCR-targeted drug discovery.
  • High throughtput comparisons and profiling of metagenomes for industrially relevant enzymes

    Alam, Intikhab (2016-01-26) [Presentation]
    More and more genomes and metagenomes are being sequenced since the advent of Next Generation Sequencing Technologies (NGS). Many metagenomic samples are collected from a variety of environments, each exhibiting a different environmental profile, e.g. temperature, environmental chemistry, etc… These metagenomes can be profiled to unearth enzymes relevant to several industries based on specific enzyme properties such as ability to work on extreme conditions, such as extreme temperatures, salinity, anaerobically, etc.. In this work, we present the DMAP platform comprising of a high-throughput metagenomic annotation pipeline and a data-warehouse for comparisons and profiling across large number of metagenomes. We developed two reference databases for profiling of important genes, one containing enzymes related to different industries and the other containing genes with potential bioactivity roles. In this presentation we describe an example analysis of a large number of publicly available metagenomic sample from TARA oceans study (Science 2015) that covers significant part of world oceans.
  • Comparative metagenomics of the Red Sea

    Mineta, Katsuhiko (2016-01-26) [Presentation]
    Metagenome produces a tremendous amount of data that comes from the organisms living in the environments. This big data enables us to examine not only microbial genes but also the community structure, interaction and adaptation mechanisms at the specific location and condition. The Red Sea has several unique characteristics such as high salinity, high temperature and low nutrition. These features must contribute to form the unique microbial community during the evolutionary process. Since 2014, we started monthly samplings of the metagenomes in the Red Sea under KAUST-CCF project. In collaboration with Kitasato University, we also collected the metagenome data from the ocean in Japan, which shows contrasting features to the Red Sea. Therefore, the comparative metagenomics of those data provides a comprehensive view of the Red Sea microbes, leading to identify key microbes, genes and networks related to those environmental differences.
  • Three-Dimentional Structures of Autophosphorylation Complexes in Crystals of Protein Kinases

    Dumbrack, Roland (2016-01-26) [Presentation]
    Protein kinase autophosphorylation is a common regulatory mechanism in cell signaling pathways. Several autophosphorylation complexes have been identified in crystals of protein kinases, with a known serine, threonine, or tyrosine autophosphorylation site of one kinase monomer sitting in the active site of another monomer of the same protein in the crystal. We utilized a structural bioinformatics method to identify all such autophosphorylation complexes in X-ray crystallographic structures in the Protein Data Bank (PDB) by generating all unique kinase/kinase interfaces within and between asymmetric units of each crystal and measuring the distance between the hydroxyl oxygen of potential autophosphorylation sites and the oxygen atoms of the active site aspartic acid residue side chain. We have identified 15 unique autophosphorylation complexes in the PDB, of which 5 complexes have not previously been described in the relevant publications on the crystal structures (N-terminal juxtamembrane regions of CSF1R and EPHA2, activation loop tyrosines of LCK and IGF1R, and a serine in a nuclear localization signal region of CLK2. Mutation of residues in the autophosphorylation complex interface of LCK either severely impaired autophosphorylation or increased it. Taking the autophosphorylation complexes as a whole and comparing them with peptide-substrate/kinase complexes, we observe a number of important features among them. The novel and previously observed autophosphorylation sites are conserved in many kinases, indicating that by homology we can extend the relevance of these complexes to many other clinically relevant drug targets.
  • Big data integration: scalability and sustainability

    Zhang, Zhang (2016-01-26) [Presentation]
    Integration of various types of omics data is critically indispensable for addressing most important and complex biological questions. In the era of big data, however, data integration becomes increasingly tedious, time-consuming and expensive, posing a significant obstacle to fully exploit the wealth of big biological data. Here we propose a scalable and sustainable architecture that integrates big omics data through community-contributed modules. Community modules are contributed and maintained by different committed groups and each module corresponds to a specific data type, deals with data collection, processing and visualization, and delivers data on-demand via web services. Based on this community-based architecture, we build Information Commons for Rice (IC4R; http://ic4r.org), a rice knowledgebase that integrates a variety of rice omics data from multiple community modules, including genome-wide expression profiles derived entirely from RNA-Seq data, resequencing-based genomic variations obtained from re-sequencing data of thousands of rice varieties, plant homologous genes covering multiple diverse plant species, post-translational modifications, rice-related literatures, and community annotations. Taken together, such architecture achieves integration of different types of data from multiple community-contributed modules and accordingly features scalable, sustainable and collaborative integration of big data as well as low costs for database update and maintenance, thus helpful for building IC4R into a comprehensive knowledgebase covering all aspects of rice data and beneficial for both basic and translational researches.
  • Molecular Genetic Diversity of Date (Phoenix dactylifera) Germplasm in Qatar based on Microsatellite Markers

    Ahmed, Talaat (2016-01-25) [Presentation]
    Depending on morphological traits alone, studying the genetic diversity of date palm is a very difficult task since morphological characteristics are highly affected by the environment. DNA markers are excellent option that can help and enhance the discriminatory power of morphological characteristics. To study the genetic diversity among date palm cultivars grown in Qatar, fifteen Date palm samples were collected from Qatar University Experimental Farm. DNAs were extracted from fresh leaves by using commercial DNeasy Plant System Kit (Qiagen, Inc., Valencia, CA). Total of 18 (Inter Simple Sequence Repeat) ISSR single primers were used to amplify DNA fragments using genomic DNA of the 15 samples. First screening was done to test the ability of these primers to amplify clear bands using Date palm genomic DNA. All 18 ISSR primers successfully produced clear bands in the first screening. Then, each primer was used separately to genotype the whole set of 15 Date palm samples. Total of 4794 bands were generated using 18 ISSR primers for the 15 Date palm samples. On average, each primer generated 400 bands. The Number of amplified bands varied from cultivar to cultivar. The highest number of bands was obtained using Primers 2, 5 and 12 for the 15 (470 bands), while the lowest number of bands were obtained by Primers 1, 7 and 8 where they produced only 329 bands. Markers were scored for the presence and absence of the corresponding band among the different cultivars. Data were subjected to cluster analysis. A similarity matrix was constructed and the similarity values were used for cluster analysis.
  • The power of data: structural bioinformatics yesterday and today

    Tramontano, Anna (2016-01-25) [Presentation]
    The protein structure database was established in 1971. At the time it contained seven structures, today there are more than 100,000. The improvement is not only a matter of quantity, but also of quality. Did we effectively exploit this information to gain knowledge? The answer is certainly affirmative. I will illustrate how this wealth of experimental data has allowed us to explore the landscape of macromolecular structures on one side, and to uncover the properties of specific protein families on the other. The latter plays an essential role in pursuing exciting new avenues in biomedical and biotechnological sciences. Experimental data are also part of a virtuous cycle whereby they reinforce and guide our ability to infer unknown macromolecular structures, which, while providing relevant information to scientists, permits to gauge the level of our understanding of the complex problem of protein folding. A paradigmatic example of the latter is represented by the “Critical Assessment of Techniques for Protein Structure Prediction” (CASP) initiative that I will briefly discuss.
  • The Genomic Code: Genome Evolution and Potential Applications

    Bernardi, Giorgio (2016-01-25) [Presentation]
    The genome of metazoans is organized according to a genomic code which comprises three laws: 1) Compositional correlations hold between contiguous coding and non-coding sequences, as well as among the three codon positions of protein-coding genes; these correlations are the consequence of the fact that the genomes under consideration consist of fairly homogeneous, long (≥200Kb) sequences, the isochores; 2) Although isochores are defined on the basis of purely compositional properties, GC levels of isochores are correlated with all tested structural and functional properties of the genome; 3) GC levels of isochores are correlated with chromosome architecture from interphase to metaphase; in the case of interphase the correlation concerns isochores and the three-dimensional “topological associated domains” (TADs); in the case of mitotic chromosomes, the correlation concerns isochores and chromosomal bands. Finally, the genomic code is the fourth and last pillar of molecular biology, the first three pillars being 1) the double helix structure of DNA; 2) the regulation of gene expression in prokaryotes; and 3) the genetic code.
  • Emerging experimental and computational technologies for purpose designed engineering of photosynthetic prokaryotes

    Lindblad, Peter (2016-01-25) [Presentation]
    With recent advances in synthetic molecular tools to be used in photosynthetic prokaryotes, like cyanobacteria, it is possible to custom design and construct microbial cells for specific metabolic functions. This cross-disciplinary area of research has emerged within the interfaces of advanced genetic engineering, computational science, and molecular biotechnology. We have initiated the development of a genetic toolbox, using a synthetic biology approach, to custom design, engineer and construct cyanobacteria for selected function and metabolism. One major bottleneck is a controlled transcription and translation of introduced genetic constructs. An additional major issue is genetic stability. I will present and discuss recent progress in our development of genetic tools for advanced cyanobacterial biotechnology. Progress on understanding the electron pathways in native and engineered cyanobacterial enzymes and heterologous expression of non-native enymzes in cyanobacterial cells will be highlighted. Finally, I will discuss our attempts to merge synthetic biology with synthetic chemistry to explore fundamantal questions of protein design and function.
  • Systems Biology for Mapping Genotype-Phenotype Relations in Yeast

    Nielsen, Jens (2016-01-25) [Presentation]
    The yeast Saccharomyces cerevisiae is widely used for production of fuels, chemicals, pharmaceuticals and materials. Through metabolic engineering of this yeast a number of novel new industrial processes have been developed over the last 10 years. Besides its wide industrial use, S. cerevisiae serves as an eukaryal model organism, and many systems biology tools have therefore been developed for this organism. Among these genome-scale metabolic models have shown to be most successful as they easy integrate with omics data and at the same time have been shown to have excellent predictive power. Despite our extensive knowledge of yeast metabolism and its regulation we are still facing challenges when we want to engineer complex traits, such as improved tolerance to toxic metabolites like butanol and elevated temperatures or when we want to engineer the highly complex protein secretory pathway. In this presentation it will be demonstrated how we can combine directed evolution with systems biology analysis to identify novel targets for rational design-build-test of yeast strains that have improved phenotypic properties. In this lecture an overview of systems biology of yeast will be presented together with examples of how genome-scale metabolic modeling can be used for prediction of cellular growth at different conditions. Examples will also be given on how adaptive laboratory evolution can be used for identifying targets for improving tolerance towards butanol, increased temperature and low pH and for improving secretion of heterologous proteins.
  • Big Data and HPC: A Happy Marriage

    Mehmood, Rashid (2016-01-25) [Presentation]
    International Data Corporation (IDC) defines Big Data technologies as “a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data produced every day, by enabling high velocity capture, discovery, and/or analysis”. High Performance Computing (HPC) most generally refers to “the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business”. Big data platforms are built primarily considering the economics and capacity of the system for dealing with the 4V characteristics of data. HPC traditionally has been more focussed on the speed of digesting (computing) the data. For these reasons, the two domains (HPC and Big Data) have developed their own paradigms and technologies. However, recently, these two have grown fond of each other. HPC technologies are needed by Big Data to deal with the ever increasing Vs of data in order to forecast and extract insights from existing and new domains, faster, and with greater accuracy. Increasingly more data is being produced by scientific experiments from areas such as bioscience, physics, and climate, and therefore, HPC needs to adopt data-driven paradigms. Moreover, there are synergies between them with unimaginable potential for developing new computing paradigms, solving long-standing grand challenges, and making new explorations and discoveries. Therefore, they must get married to each other. In this talk, we will trace the HPC and big data landscapes through time including their respective technologies, paradigms and major applications areas. Subsequently, we will present the factors that are driving the convergence of the two technologies, the synergies between them, as well as the benefits of their convergence to the biosciences field. The opportunities and challenges of the computing paradigm resulting from this convergence will be discussed.

View more