BlockPolish: accurate polishing of long-read assembly via block divide-and-conquer.
KAUST DepartmentComputer Science Program
Computational Bioscience Research Center (CBRC)
Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division
Embargo End Date2022-10-07
Permanent link to this recordhttp://hdl.handle.net/10754/672539
MetadataShow full item record
AbstractLong-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and the wide error distribution of raw reads result in a large number of errors in the assembly. Polishing is a procedure to fix errors in the draft assembly and improve the reliability of genomic analysis. However, existing methods treat all the regions of the assembly equally while there are fundamental differences between the error distributions of these regions. How to achieve very high accuracy in genome assembly is still a challenging problem. Motivated by the uneven errors in different regions of the assembly, we propose a novel polishing workflow named BlockPolish. In this method, we divide contigs into blocks with low complexity and high complexity according to statistics of aligned nucleotide bases. Multiple sequence alignment is applied to realign raw reads in complex blocks and optimize the alignment result. Due to the different distributions of error rates in trivial and complex blocks, two multitask bidirectional Long short-term memory (LSTM) networks are proposed to predict the consensus sequences. In the whole-genome assemblies of NA12878 assembled by Wtdbg2 and Flye using Nanopore data, BlockPolish has a higher polishing accuracy than other state-of-the-arts including Racon, Medaka and MarginPolish & HELEN. In all assemblies, errors are predominantly indels and BlockPolish has a good performance in correcting them. In addition to the Nanopore assemblies, we further demonstrate that BlockPolish can also reduce the errors in the PacBio assemblies. The source code of BlockPolish is freely available on Github (https://github.com/huangnengCSU/BlockPolish).
CitationOUP accepted manuscript. (2021). Briefings In Bioinformatics. doi:10.1093/bib/bbab405
SponsorsThis work was supported in part by the National Natural Science Foundation of China under grants (Nos. U1909208 and 61772557); 111 Project (No. B18059); Hunan Provincial Science and Technology Program (No. 2018wk4001 to J.W.); the US National Institute of Food and Agriculture (NIFA) under grant (2017-70016-26051 to F.L.) and the US National Science Foundation (NSF) under grant (ABI-1759856 to F.L.); the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. (FCC/1/1976-26-01, URF/1/3412-01-01, URF/1/4098-01-01, REI/1/4742-01-01 and REI/1/4473-01-01 to X.G.).
PublisherOxford University Press (OUP)
JournalBriefings in bioinformatics
- NeuralPolish: a novel Nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU Networks.
- Authors: Huang N, Nie F, Ni P, Luo F, Gao X, Wang J
- Issue date: 2021 May 11
- Polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads to improve genomic analyses.
- Authors: Chen Z, Erickson DL, Meng J
- Issue date: 2021 May
- Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.
- Authors: Firtina C, Kim JS, Alser M, Senol Cali D, Cicek AE, Alkan C, Mutlu O
- Issue date: 2020 Jun 1
- Evaluation of assembly methods combining long-reads and short-reads to obtain <i>Paenibacillus</i> sp. R4 high-quality complete genome.
- Authors: Shin SC, Choi W, Lee J, Kim HJ, Kim HW
- Issue date: 2020 Nov
- Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case.
- Authors: Wang W, Schalamun M, Morales-Suarez A, Kainer D, Schwessinger B, Lanfear R
- Issue date: 2018 Dec 29