Recent Submissions

  • Synthesis of Fluoroalkoxy Substituted Arylboronic Esters by Iridium-Catalyzed Aromatic C–H Borylation

    Batool, Farhat; Parveen, Shehla; Emwas, Abdul-Hamid M.; Sioud, Salim; Gao, Xin; Munawar, Munawar A.; Chotana, Ghayoor A. (American Chemical Society (ACS), 2015-08-17)
    The preparation of fluoroalkoxy arylboronic esters by iridium-catalyzed aromatic C–H borylation is described. The fluoroalkoxy groups employed include trifluoromethoxy, difluoromethoxy, 1,1,2,2-tetrafluoroethoxy, and 2,2-difluoro-1,3-benzodioxole. The borylation reactions were carried out neat without the use of a glovebox or Schlenk line. The regioselectivities available through the iridium-catalyzed C–H borylation are complementary to those obtained by the electrophilic aromatic substitution reactions of fluoroalkoxy arenes. Fluoroalkoxy arylboronic esters can serve as versatile building blocks.
  • Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels

    Wang, Xiaolei; Kuwahara, Hiroyuki; Gao, Xin (Springer Nature, 2014-12-12)
    Background: A quantitative understanding of interactions between transcription factors (TFs) and their DNA binding sites is key to the rational design of gene regulatory networks. Recent advances in high-throughput technologies have enabled high-resolution measurements of protein-DNA binding affinity. Importantly, such experiments revealed the complex nature of TF-DNA interactions, whereby the effects of nucleotide changes on the binding affinity were observed to be context dependent. A systematic method to give high-quality estimates of such complex affinity landscapes is, thus, essential to the control of gene expression and the advance of synthetic biology. Results: Here, we propose a two-round prediction method that is based on support vector regression (SVR) with weighted degree (WD) kernels. In the first round, a WD kernel with shifts and mismatches is used with SVR to detect the importance of subsequences with different lengths at different positions. The subsequences identified as important in the first round are then fed into a second WD kernel to fit the experimentally measured affinities. To our knowledge, this is the first attempt to increase the accuracy of the affinity prediction by applying two rounds of string kernels and by identifying a small number of crucial k-mers. The proposed method was tested by predicting the binding affinity landscape of Gcn4p in Saccharomyces cerevisiae using datasets from HiTS-FLIP. Our method explicitly identified important subsequences and showed significant performance improvements when compared with other state-of-the-art methods. Based on the identified important subsequences, we discovered two surprisingly stable 10-mers and one sensitive 10-mer which were not reported before. Further test on four other TFs in S. cerevisiae demonstrated the generality of our method. Conclusion: We proposed in this paper a two-round method to quantitatively model the DNA binding affinity landscape. Since the ability to modify genetic parts to fine-tune gene expression rates is crucial to the design of biological systems, such a tool may play an important role in the success of synthetic biology going forward.
  • Quick Mining of Isomorphic Exact Large Patterns from Large Graphs

    Almasri, Islam; Gao, Xin; Fedoroff, Nina V. (Institute of Electrical and Electronics Engineers (IEEE), 2014-12)
    The applications of the sub graph isomorphism search are growing with the growing number of areas that model their systems using graphs or networks. Specifically, many biological systems, such as protein interaction networks, molecular structures and protein contact maps, are modeled as graphs. The sub graph isomorphism search is concerned with finding all sub graphs that are isomorphic to a relevant query graph, the existence of such sub graphs can reflect on the characteristics of the modeled system. The most computationally expensive step in the search for isomorphic sub graphs is the backtracking algorithm that traverses the nodes of the target graph. In this paper, we propose a pruning approach that is inspired by the minimum remaining value heuristic that achieves greater scalability over large query and target graphs. Our testing on various biological networks shows that performance enhancement of our approach over existing state-of-the-art approaches varies between 6x and 53x. © 2014 IEEE.
  • An automated framework for NMR resonance assignment through simultaneous slice picking and spin system forming

    Abbas, Ahmed; Guo, Xianrong; Jing, Bingyi; Gao, Xin (Springer Science + Business Media, 2014-04-19)
    Despite significant advances in automated nuclear magnetic resonance-based protein structure determination, the high numbers of false positives and false negatives among the peaks selected by fully automated methods remain a problem. These false positives and negatives impair the performance of resonance assignment methods. One of the main reasons for this problem is that the computational research community often considers peak picking and resonance assignment to be two separate problems, whereas spectroscopists use expert knowledge to pick peaks and assign their resonances at the same time. We propose a novel framework that simultaneously conducts slice picking and spin system forming, an essential step in resonance assignment. Our framework then employs a genetic algorithm, directed by both connectivity information and amino acid typing information from the spin systems, to assign the spin systems to residues. The inputs to our framework can be as few as two commonly used spectra, i.e., CBCA(CO)NH and HNCACB. Different from the existing peak picking and resonance assignment methods that treat peaks as the units, our method is based on 'slices', which are one-dimensional vectors in three-dimensional spectra that correspond to certain (N, H) values. Experimental results on both benchmark simulated data sets and four real protein data sets demonstrate that our method significantly outperforms the state-of-the-art methods while using a less number of spectra than those methods. Our method is freely available at http://sfb.kaust.edu.sa/Pages/Software.aspx. © 2014 Springer Science+Business Media.
  • Sparse structure regularized ranking

    Wang, Jim Jing-Yan; Sun, Yijun; Gao, Xin (Springer Nature, 2014-04-17)
    Learning ranking scores is critical for the multimedia database retrieval problem. In this paper, we propose a novel ranking score learning algorithm by exploring the sparse structure and using it to regularize ranking scores. To explore the sparse structure, we assume that each multimedia object could be represented as a sparse linear combination of all other objects, and combination coefficients are regarded as a similarity measure between objects and used to regularize their ranking scores. Moreover, we propose to learn the sparse combination coefficients and the ranking scores simultaneously. A unified objective function is constructed with regard to both the combination coefficients and the ranking scores, and is optimized by an iterative algorithm. Experiments on two multimedia database retrieval data sets demonstrate the significant improvements of the propose algorithm over state-of-the-art ranking score learning algorithms.
  • Feature selection and multi-kernel learning for sparse representation on a manifold

    Wang, Jim Jing-Yan; Bensmail, Halima; Gao, Xin (Elsevier BV, 2014-03)
    Sparse representation has been widely studied as a part-based data representation method and applied in many scientific and engineering fields, such as bioinformatics and medical imaging. It seeks to represent a data sample as a sparse linear combination of some basic items in a dictionary. Gao etal. (2013) recently proposed Laplacian sparse coding by regularizing the sparse codes with an affinity graph. However, due to the noisy features and nonlinear distribution of the data samples, the affinity graph constructed directly from the original feature space is not necessarily a reliable reflection of the intrinsic manifold of the data samples. To overcome this problem, we integrate feature selection and multiple kernel learning into the sparse coding on the manifold. To this end, unified objectives are defined for feature selection, multiple kernel learning, sparse coding, and graph regularization. By optimizing the objective functions iteratively, we develop novel data representation algorithms with feature selection and multiple kernel learning respectively. Experimental results on two challenging tasks, N-linked glycosylation prediction and mammogram retrieval, demonstrate that the proposed algorithms outperform the traditional sparse coding methods. © 2013 Elsevier Ltd.
  • Beyond cross-domain learning: Multiple-domain nonnegative matrix factorization

    Wang, Jim Jing-Yan; Gao, Xin (Elsevier BV, 2014-02)
    Traditional cross-domain learning methods transfer learning from a source domain to a target domain. In this paper, we propose the multiple-domain learning problem for several equally treated domains. The multiple-domain learning problem assumes that samples from different domains have different distributions, but share the same feature and class label spaces. Each domain could be a target domain, while also be a source domain for other domains. A novel multiple-domain representation method is proposed for the multiple-domain learning problem. This method is based on nonnegative matrix factorization (NMF), and tries to learn a basis matrix and coding vectors for samples, so that the domain distribution mismatch among different domains will be reduced under an extended variation of the maximum mean discrepancy (MMD) criterion. The novel algorithm - multiple-domain NMF (MDNMF) - was evaluated on two challenging multiple-domain learning problems - multiple user spam email detection and multiple-domain glioma diagnosis. The effectiveness of the proposed algorithm is experimentally verified. © 2013 Elsevier Ltd. All rights reserved.
  • mir-300 promotes self-renewal and inhibits the differentiation of glioma stem-like cells

    Zhang, Daming; Yang, Guang; Chen, Xin; Li, Chunmei; Wang, Lu; Liu, Yaohua; Han, Dayong; Liu, Huailei; Hou, Xu; Zhang, Weiguang; Li, Chenguang; Han, Zhanqiang; Gao, Xin; Zhao, Shiguang (Springer Science + Business Media, 2014-01-28)
    MicroRNAs (miRNAs) are small noncoding RNAs that have been critically implicated in several human cancers. miRNAs are thought to participate in various biological processes, including proliferation, cell cycle, apoptosis, and even the regulation of the stemness properties of cancer stem cells. In this study, we explore the potential role of miR-300 in glioma stem-like cells (GSLCs). We isolated GSLCs from glioma biopsy specimens and identified the stemness properties of the cells through neurosphere formation assays, multilineage differentiation ability analysis, and immunofluorescence analysis of glioma stem cell markers. We found that miR-300 is commonly upregulated in glioma tissues, and the expression of miR-300 was higher in GSLCs. The results of functional experiments demonstrated that miR-300 can enhance the self-renewal of GSLCs and reduce differentiation toward both astrocyte and neural fates. In addition, LZTS2 is a direct target of miR-300. In conclusion, our results demonstrate the critical role of miR-300 in GSLCs and its functions in LZTS2 inhibition and describe a new approach for the molecular regulation of tumor stem cells. © 2014 Springer Science+Business Media.
  • MiR-196a exerts its oncogenic effect in glioblastoma multiforme by inhibition of IκBα both in vitro and in vivo

    Yang, Guang; Han, Dayong; Chen, Xin; Zhang, Daming; Wang, Lu; Shi, Chen; Zhang, Weiguang; Li, Chenguang; Chen, Xiaofeng; Liu, Huailei; Zhang, Dongzhi; Kang, Jianhao; Peng, Fei; Liu, Ziyi; Qi, Jiping; Gao, Xin; Ai, Jing; Shi, Changbin; Zhao, Shiguang (Oxford University Press (OUP), 2014-01-23)
    BackgroundRecent studies have revealed that miR-196a is upregulated in glioblastoma multiforme (GBM) and that it correlates with the clinical outcome of patients with GBM. However, its potential regulatory mechanisms in GBM have never been reported.MethodsWe used quantitative real-time PCR to assess miR-196a expression levels in 132 GBM specimens in a single institution. Oncogenic capability of miR-196a was detected by apoptosis and proliferation assays in U87MG and T98G cells. Immunohistochemistry was used to determine the expression of IκBα in GBM tissues, and a luciferase reporter assay was carried out to confirm whether IκBα is a direct target of miR-196a. In vivo, xenograft tumors were examined for an antiglioma effect of miR-196a inhibitors.ResultsWe present for the first time evidence that miR-196a could directly interact with IκBα 3′-UTR to suppress IκBα expression and subsequently promote activation of NF-κB, consequently promoting proliferation of and suppressing apoptosis in GBM cells both in vitro and in vivo. Our study confirmed that miR-196a was upregulated in GBM specimens and that high levels of miR-196a were significantly correlated with poor outcome in a large cohort of GBM patients. Our data from human tumor xenografts in nude mice treated with miR-196 inhibitors demonstrated that inhibition of miR-196a could ameliorate tumor growth in vivo.ConclusionsMiR-196a exerts its oncogenic effect in GBM by inhibiting IκBα both in vitro and in vivo. Our findings provide new insights into the pathogenesis of GBM and indicate that miR-196a may predict clinical outcome of GBM patients and serve as a new therapeutic target for GBM. © 2014 © The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Neuro-Oncology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
  • Adding Robustness to Support Vector Machines Against Adversarial Reverse Engineering

    Alabdulmohsin, Ibrahim; Gao, Xin; Zhang, Xiangliang (Association for Computing Machinery (ACM), 2014)
    Many classification algorithms have been successfully deployed in security-sensitive applications including spam filters and intrusion detection systems. Under such adversarial environments, adversaries can generate exploratory attacks against the defender such as evasion and reverse engineering. In this paper, we discuss why reverse engineering attacks can be carried out quite efficiently against fixed classifiers, and investigate the use of randomization as a suitable strategy for mitigating their risk. In particular, we derive a semidefinite programming (SDP) formulation for learning a distribution of classifiers subject to the constraint that any single classifier picked at random from such distribution provides reliable predictions with a high probability. We analyze the tradeoff between variance of the distribution and its predictive accuracy, and establish that one can almost always incorporate randomization with large variance without incurring a loss in accuracy. In other words, the conventional approach of using a fixed classifier in adversarial environments is generally Pareto suboptimal. Finally, we validate such conclusions on both synthetic and real-world classification problems. Copyright 2014 ACM.
  • Joint learning and weighting of visual vocabulary for bag-of-feature based tissue classification

    Wang, Jim Jing-Yan; Bensmail, Halima; Gao, Xin (Elsevier BV, 2013-12)
    Automated classification of tissue types of Region of Interest (ROI) in medical images has been an important application in Computer-Aided Diagnosis (CAD). Recently, bag-of-feature methods which treat each ROI as a set of local features have shown their power in this field. Two important issues of bag-of-feature strategy for tissue classification are investigated in this paper: the visual vocabulary learning and weighting, which are always considered independently in traditional methods by neglecting the inner relationship between the visual words and their weights. To overcome this problem, we develop a novel algorithm, Joint-ViVo, which learns the vocabulary and visual word weights jointly. A unified objective function based on large margin is defined for learning of both visual vocabulary and visual word weights, and optimized alternately in the iterative algorithm. We test our algorithm on three tissue classification tasks: classifying breast tissue density in mammograms, classifying lung tissue in High-Resolution Computed Tomography (HRCT) images, and identifying brain tissue type in Magnetic Resonance Imaging (MRI). The results show that Joint-ViVo outperforms the state-of-art methods on tissue classification problems. © 2013 Elsevier Ltd.
  • Multiple graph regularized nonnegative matrix factorization

    Wang, Jim Jing-Yan; Bensmail, Halima; Gao, Xin (Elsevier BV, 2013-10)
    Non-negative matrix factorization (NMF) has been widely used as a data representation method based on components. To overcome the disadvantage of NMF in failing to consider the manifold structure of a data set, graph regularized NMF (GrNMF) has been proposed by Cai et al. by constructing an affinity graph and searching for a matrix factorization that respects graph structure. Selecting a graph model and its corresponding parameters is critical for this strategy. This process is usually carried out by cross-validation or discrete grid search, which are time consuming and prone to overfitting. In this paper, we propose a GrNMF, called MultiGrNMF, in which the intrinsic manifold is approximated by a linear combination of several graphs with different models and parameters inspired by ensemble manifold regularization. Factorization metrics and linear combination coefficients of graphs are determined simultaneously within a unified object function. They are alternately optimized in an iterative algorithm, thus resulting in a novel data representation algorithm. Extensive experiments on a protein subcellular localization task and an Alzheimer's disease diagnosis task demonstrate the effectiveness of the proposed algorithm. © 2013 Elsevier Ltd. All rights reserved.
  • Accurate prediction of hot spot residues through physicochemical characteristics of amino acid sequences

    Chen, Peng; Li, Jinyan; Limsoon, Wong; Kuwahara, Hiroyuki; Huang, Jianhua Z.; Gao, Xin (Wiley-Blackwell, 2013-07-23)
    Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the computational prediction of hot spots. However, structural information is not always available. In this article, we investigated the problem of identifying hot spots using only physicochemical characteristics extracted from amino acid sequences. We first extracted 132 relatively independent physicochemical features from a set of the 544 properties in AAindex1, an amino acid index database. Each feature was utilized to train a classification model with a novel encoding schema for hot spot prediction by the IBk algorithm, an extension of the K-nearest neighbor algorithm. The combinations of the individual classifiers were explored and the classifiers that appeared frequently in the top performing combinations were selected. The hot spot predictor was built based on an ensemble of these classifiers and to work in a voting manner. Experimental results demonstrated that our method effectively exploited the feature space and allowed flexible weights of features for different queries. On the commonly used hot spot benchmark sets, our method significantly outperformed other machine learning algorithms and state-of-the-art hot spot predictors. The program is available at http://sfb.kaust.edu.sa/pages/software.aspx. © 2013 Wiley Periodicals, Inc.
  • Assessing protein conformational sampling methods based on bivariate lag-distributions of backbone angles

    Maadooliat, Mehdi; Gao, Xin; Huang, Jianhua Z. (Oxford University Press (OUP), 2012-08-27)
    Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence-structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu. edu/~madoliat/LagSVD) that can be used to produce informative animations. © The Author 2012. Published by Oxford University Press.
  • ProClusEnsem: Predicting membrane protein types by fusing different modes of pseudo amino acid composition

    Wang, Jim Jing-Yan; Li, Yongping; Wang, Quanquan; You, Xinge; Man, Jiaju; Wang, Chao; Gao, Xin (Elsevier BV, 2012-05)
    Knowing the type of an uncharacterized membrane protein often provides a useful clue in both basic research and drug discovery. With the explosion of protein sequences generated in the post genomic era, determination of membrane protein types by experimental methods is expensive and time consuming. It therefore becomes important to develop an automated method to find the possible types of membrane proteins. In view of this, various computational membrane protein prediction methods have been proposed. They extract protein feature vectors, such as PseAAC (pseudo amino acid composition) and PsePSSM (pseudo position-specific scoring matrix) for representation of protein sequence, and then learn a distance metric for the KNN (K nearest neighbor) or NN (nearest neighbor) classifier to predicate the final type. Most of the metrics are learned using linear dimensionality reduction algorithms like Principle Components Analysis (PCA) and Linear Discriminant Analysis (LDA). Such metrics are common to all the proteins in the dataset. In fact, they assume that the proteins lie on a uniform distribution, which can be captured by the linear dimensionality reduction algorithm. We doubt this assumption, and learn local metrics which are optimized for local subset of the whole proteins. The learning procedure is iterated with the protein clustering. Then a novel ensemble distance metric is given by combining the local metrics through Tikhonov regularization. The experimental results on a benchmark dataset demonstrate the feasibility and effectiveness of the proposed algorithm named ProClusEnsem. © 2012 Elsevier Ltd.
  • Bag-of-features based medical image retrieval via multiple assignment and visual words weighting

    Wang, Jingyan; Li, Yongping; Zhang, Ying; Wang, Chao; Xie, Honglan; Chen, Guoling; Gao, Xin (Institute of Electrical and Electronics Engineers (IEEE), 2011-11)
    Bag-of-features based approaches have become prominent for image retrieval and image classification tasks in the past decade. Such methods represent an image as a collection of local features, such as image patches and key points with scale invariant feature transform (SIFT) descriptors. To improve the bag-of-features methods, we first model the assignments of local descriptors as contribution functions, and then propose a novel multiple assignment strategy. Assuming the local features can be reconstructed by their neighboring visual words in a vocabulary, reconstruction weights can be solved by quadratic programming. The weights are then used to build contribution functions, resulting in a novel assignment method, called quadratic programming (QP) assignment. We further propose a novel visual word weighting method. The discriminative power of each visual word is analyzed by the sub-similarity function in the bin that corresponds to the visual word. Each sub-similarity function is then treated as a weak classifier. A strong classifier is learned by boosting methods that combine those weak classifiers. The weighting factors of the visual words are learned accordingly. We evaluate the proposed methods on medical image retrieval tasks. The methods are tested on three well-known data sets, i.e., the ImageCLEFmed data set, the 304 CT Set, and the basal-cell carcinoma image set. Experimental results demonstrate that the proposed QP assignment outperforms the traditional nearest neighbor assignment, the multiple assignment, and the soft assignment, whereas the proposed boosting based weighting strategy outperforms the state-of-the-art weighting methods, such as the term frequency weights and the term frequency-inverse document frequency weights. © 2011 IEEE.
  • 3DSwap: Curated knowledgebase of proteins involved in 3D domain swapping

    Shameer, Khader; Shingate, Prashant N.; Manjunath, S. C. P.; Karthika, M.; Pugalenthi, Ganesan; Sowdhamini, Ramanathan (Oxford University Press (OUP), 2011-09-29)
    Three-dimensional domain swapping is a unique protein structural phenomenon where two or more protein chains in a protein oligomer share a common structural segment between individual chains. This phenomenon is observed in an array of protein structures in oligomeric conformation. Protein structures in swapped conformations perform diverse functional roles and are also associated with deposition diseases in humans. We have performed in-depth literature curation and structural bioinformatics analyses to develop an integrated knowledgebase of proteins involved in 3D domain swapping. The hallmark of 3D domain swapping is the presence of distinct structural segments such as the hinge and swapped regions. We have curated the literature to delineate the boundaries of these regions. In addition, we have defined several new concepts like 'secondary major interface' to represent the interface properties arising as a result of 3D domain swapping, and a new quantitative measure for the 'extent of swapping' in structures. The catalog of proteins reported in 3DSwap knowledgebase has been generated using an integrated structural bioinformatics workflow of database searches, literature curation, by structure visualization and sequence-structure-function analyses. The current version of the 3DSwap knowledgebase reports 293 protein structures, the analysis of such a compendium of protein structures will further the understanding molecular factors driving 3D domain swapping. The Author(s) 2011.
  • BLProt: Prediction of bioluminescent proteins based on support vector machine and relieff feature selection

    Kandaswamy, Krishna Kumar; Pugalenthi, Ganesan; Hazrati, Mehrnaz Khodam; Kalies, Kai-Uwe; Martinetz, Thomas (Springer Nature, 2011-08-17)
    Background: Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence.Results: In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated.Conclusion: BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. 2011 Kandaswamy et al; licensee BioMed Central Ltd.
  • Towards fully automated structure-based NMR resonance assignment of 15N-labeled proteins from automatically picked peaks

    Jang, Richard; Gao, Xin; Li, Ming (Mary Ann Liebert Inc, 2011-03)
    In NMR resonance assignment, an indispensable step in NMR protein studies, manually processed peaks from both N-labeled and C-labeled spectra are typically used as inputs. However, the use of homologous structures can allow one to use only N-labeled NMR data and avoid the added expense of using C-labeled data. We propose a novel integer programming framework for structure-based backbone resonance assignment using N-labeled data. The core consists of a pair of integer programming models: one for spin system forming and amino acid typing, and the other for backbone resonance assignment. The goal is to perform the assignment directly from spectra without any manual intervention via automatically picked peaks, which are much noisier than manually picked peaks, so methods must be error-tolerant. In the case of semi-automated/manually processed peak data, we compare our system with the Xiong-Pandurangan-Bailey- Kellogg's contact replacement (CR) method, which is the most error-tolerant method for structure-based resonance assignment. Our system, on average, reduces the error rate of the CR method by five folds on their data set. In addition, by using an iterative algorithm, our system has the added capability of using the NOESY data to correct assignment errors due to errors in predicting the amino acid and secondary structure type of each spin system. On a publicly available data set for human ubiquitin, where the typing accuracy is 83%, we achieve 91% accuracy, compared to the 59% accuracy obtained without correcting for such errors. In the case of automatically picked peaks, using assignment information from yeast ubiquitin, we achieve a fully automatic assignment with 97% accuracy. To our knowledge, this is the first system that can achieve fully automatic structure-based assignment directly from spectra. This has implications in NMR protein mutant studies, where the assignment step is repeated for each mutant. © Copyright 2011, Mary Ann Liebert, Inc.
  • Combining ambiguous chemical shift mapping with structure-based backbone and NOE assignment from 15N-NOESY

    Jang, Richard; Gao, Xin; Li, Ming (Association for Computing Machinery (ACM), 2011)
    Chemical shift mapping is an important technique in NMRbased drug screening for identifying the atoms of a target protein that potentially bind to a drug molecule upon the molecule's introduction in increasing concentrations. The goal is to obtain a mapping of peaks with known residue assignment from the reference spectrum of the unbound protein to peaks with unknown assignment in the target spectrum of the bound protein. Although a series of perturbed spectra help to trace a path from reference peaks to target peaks, a one-to-one mapping generally is not possible, especially for large proteins, due to errors, such as noise peaks, missing peaks, missing but then reappearing, overlapped, and new peaks not associated with any peaks in the reference. Due to these difficulties, the mapping is typically done manually or semi-automatically. However, automated methods are necessary for high-throughput drug screening. We present PeakWalker, a novel peak walking algorithm for fast-exchange systems that models the errors explicitly and performs many-to-one mapping. On the proteins: hBclXL, UbcH5B, and histone H1, it achieves an average accuracy of over 95% with less than 1.5 residues predicted per target peak. Given these mappings as input, we present PeakAssigner, a novel combined structure-based backbone resonance and NOE assignment algorithm that uses just 15N-NOESY, while avoiding TOCSY experiments and 13C- labeling, to resolve the ambiguities for a one-toone mapping. On the three proteins, it achieves an average accuracy of 94% or better. Copyright © 2011 ACM.

View more