Read length and repeat resolution: Exploring prokaryote genomes using next-generation sequencing technologies
Article - Full Text
Supplemental File 1
Supplemental File 2
Supplemental File 3
Supplemental File 4
KAUST DepartmentBiological and Environmental Sciences and Engineering (BESE) Division
Computational Bioscience Research Center (CBRC)
Permanent link to this recordhttp://hdl.handle.net/10754/325284
MetadataShow full item record
AbstractBackground: There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats. Methodology/Principal Findings: Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads. Conclusions: Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length. 2010 Cahill et al.
CitationCahill MJ, Köser CU, Ross NE, Archer JAC (2010) Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies. PLoS ONE 5: e11518. doi:10.1371/journal.pone.0011518.
PublisherPublic Library of Science (PLoS)
PubMed Central IDPMC2902515
- SeqEntropy: genome-wide assessment of repeats for short read sequencing.
- Authors: Chu HT, Hsiao WW, Tsao TT, Hsu DF, Chen CC, Lee SA, Kao CY
- Issue date: 2013
- Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler.
- Authors: Zerbino DR, McEwen GK, Margulies EH, Birney E
- Issue date: 2009 Dec 22
- Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing.
- Authors: Qu W, Hashimoto S, Morishita S
- Issue date: 2009 Jul
- De novo sequencing of plant genomes using second-generation technologies.
- Authors: Imelfort M, Edwards D
- Issue date: 2009 Nov
- 6-10× pyrosequencing is a practical approach for whole prokaryote genome studies.
- Authors: Li J, Jiang J, Leung FC
- Issue date: 2012 Feb 15
Showing items related by title, author, creator and subject.
Data for : Poly(A) Dataset for PAS sequences and pseudo-PAS sequences Classification (fasta format)Albalawi, Fahad; Chahid, Abderrazak; Guo, Xingang; Albaradei, Somayah; Magana-Mora, Arturo; Jankovic, Boris R.; Uludag, Mahmut; Van Neste, Christophe; Essack, Magbubah; Laleg-Kirati, Taous-Meriem; Bajic, Vladimir B. (2018-11-15) [Dataset]This Dataset contains DNA sequences of the human genome hg38 from GENCODE folder at EBI ftp server (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh38.primary_assembly.genome.fa.gz) A-Positive set (PAS sequences) Using GENCODE annotation for poly(A) (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gff3.gz) We selected poly(A) signal annotation. Using bedtools-slop option, we found regions extended 300 bp upstream and 300 bp downstream of the poly(A) hexamer. With the bedtools-getfasta option, we extracted 606 bp fasta sequences from these regions. After eliminating duplicates, we obtained 37’516 presumed true functional poly(A) signal (PAS) sequences. Sequences from this set will be denoted as positive. B- Negative set (pseudo-PAS sequences) For the negative set, we looked for regions extended outside the region covering 1’000 bp upstream and downstream of the positive poly(A) hexamer signal using bedtools-complement. Homer tool was used to find matches for the 12 most frequent human poly(A) variants. Since the number of matches was huge, sampling was used to select 37’516 pseudo-PAS sequences. Sampling was done from each chromosome proportionally to the lengths of the chromosomes and also to the expected frequency of the poly(A) variants. Out of these predictions, for each PAS hexamer, we selected the same number of pseudo-PAS sequences as in the positive set. Training and testing sets We selected randomly from each of the positive and negative datasets 20% of sequences for the independent test data. The testing set thus consisted of 15’020 sequences. The remaining data represented the training set that consisted of 60’012 sequences. Both datasets are balanced relative to the true PAS and pseudo-PAS sequences.
Viral metagenomics: Analysis of begomoviruses by illumina high-throughput sequencingIdris, Ali; Al-Saleh, Mohammed; Piatek, Marek J.; Al-Shahwan, Ibrahim; Ali, Shahjahan; Brown, Judith K. (Viruses, MDPI AG, 2014-03-12) [Article]Traditional DNA sequencing methods are inefficient, lack the ability to discern the least abundant viral sequences, and ineffective for determining the extent of variability in viral populations. Here, populations of single-stranded DNA plant begomoviral genomes and their associated beta- and alpha-satellite molecules (virus-satellite complexes) (genus, Begomovirus; family, Geminiviridae) were enriched from total nucleic acids isolated from symptomatic, field-infected plants, using rolling circle amplification (RCA). Enriched virus-satellite complexes were subjected to Illumina-Next Generation Sequencing (NGS). CASAVA and SeqMan NGen programs were implemented, respectively, for quality control and for de novo and reference-guided contig assembly of viral-satellite sequences. The authenticity of the begomoviral sequences, and the reproducibility of the Illumina-NGS approach for begomoviral deep sequencing projects, were validated by comparing NGS results with those obtained using traditional molecular cloning and Sanger sequencing of viral components and satellite DNAs, also enriched by RCA or amplified by polymerase chain reaction. As the use of NGS approaches, together with advances in software development, make possible deep sequence coverage at a lower cost; the approach described herein will streamline the exploration of begomovirus diversity and population structure from naturally infected plants, irrespective of viral abundance. This is the first report of the implementation of Illumina-NGS to explore the diversity and identify begomoviral-satellite SNPs directly from plants naturally-infected with begomoviruses under field conditions. 2014 by the authors; licensee MDPI, Basel, Switzerland.