Zero in on Key Open Problems in Automated NMR Protein Structure Determination
Permanent link to this recordhttp://hdl.handle.net/10754/582299
MetadataShow full item record
AbstractNuclear magnetic resonance (NMR) is one of the main approaches for protein struc- ture determination. The biggest advantage of this approach is that it can determine the three-dimensional structure of the protein in the solution phase. Thus, the natural dynamics of the protein can be studied. However, NMR protein structure determina- tion is an expertise intensive and time-consuming process. If the structure determi- nation process can be accelerated or even automated by computational methods, that will significantly advance the structural biology field. Our goal in this dissertation is to propose highly efficient and error tolerant methods that can work well on real and noisy data sets of NMR. Our first contribution in this dissertation is the development of a novel peak pick- ing method (WaVPeak). First, WaVPeak denoises the NMR spectra using wavelet smoothing. A brute force method is then used to identify all the candidate peaks. Af- ter that, the volume of each candidate peak is estimated. Finally, the peaks are sorted according to their volumes. WaVPeak is tested on the same benchmark data set that was used to test the state-of-the-art method, PICKY. WaVPeak shows significantly better performance than PICKY in terms of recall and precision. Our second contribution is to propose an automatic method to select peaks pro- duced by peak picking methods. This automatic method is used to overcome the limitations of fixed number-based methods. Our method is based on the Benjamini- Hochberg (B-H) algorithm. The method is used with both WaVPeak and PICKY to automatically select the number of peaks to return from out of hundreds of candidate peaks. The volume (in WaVPeak) and the intensity (in PICKY) are converted into p-values. Peaks that have p-values below some certain threshold are selected. Ex- perimental results show that the new method is better than the fixed number-based method in terms of recall. To improve precision, we tried to eliminate false peaks using consensus of the B-H selected peaks from both PICKY and WaVPeak. On average, the consensus method is able to identify more than 88% of the expected true peaks, whereas less than 17% of the selected peaks are false ones. Our third contribution is to propose for the first time, the 3D extension of the Median-Modified-Wiener-Filter (MMWF), and its novel variation named MMWF*. These spatial filters have only one parameter to tune: the window-size. Unlike wavelet denoising, the higher dimensional extension of the newly proposed filters is relatively easy. Thus, they can be applied to denoise multi-dimensional NMR-spectra. We tested the proposed filters and the Wiener-filter, an adaptive variant of the mean-filter, on a benchmark set that contains 16 two-dimensional and three-dimensional NMR- spectra extracted from eight proteins. Our results demonstrate that the adaptive spatial filters significantly outperform their non-adaptive versions. The performance of the new MMWF* on 2D/3D-spectra is even better than wavelet-denoising. Finally, we propose a novel framework that simultaneously conducts slice picking and spin system forming, an essential step in resonance assignment. Our framework then employs a genetic algorithm, directed by both connectivity information and amino acid typing information from the spin systems to assign the spin systems to residues. The inputs to our framework can be as few as two commonly used spectra, i.e., CBCA(CO)NH and HNCACB. Different from existing peak picking and resonance assignment methods that treat peaks as the units, our method is based on slices, which are one-dimensional vectors in three-dimensional spectra that correspond to certain (N, H) values. Experimental results on both benchmark simulated data sets and four real protein data sets demonstrate that our method significantly outperforms the state-of-the-art methods especially on the more challenging real protein data sets, while using a less number of spectra than those methods. Furthermore, we show that using the chemical shift assignments predicted by our method for the four real proteins can lead to accurate calculation of their final three-dimensional structures by using CS-ROSETTA server.