Searching and mapping genomic subsequences in nanopore raw signals through novel dynamic time warping algorithms
KAUST DepartmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Computer Science Program
Computational Bioscience Research Center (CBRC)
Permanent link to this recordhttp://hdl.handle.net/10754/630282
MetadataShow full item record
AbstractNanopore sequencing is a promising technology to generate ultra-long reads based on the direct measurement of electrical current signals when a DNA molecule passes through a nanopore. These ultra-long reads are critical for detecting large structural variations in the genome. However, it is challenging to use nanopore sequencing to identify single nucleotide polymorphisms (SNPs) or other modifications such as methylations, especially at a low sequencing coverage, due to the high error rate in the base-called reads. It is possible to correct the base-calling error through the subsequence search by mapping a SNP-containing genomic region to the long nanopore raw signal sequences that contain this region and taking consensus of these signals. Nevertheless, the ultra-long raw signals and an order of magnitude difference in the sampling speed between the two sequences make the traditional algorithms infeasible to solve the problem. Here we propose two novel algorithms, the direct subsequence dynamic time warping for nanopore raw signal search (DSDTWnano) and the continuous wavelet subsequence dynamic time warping for nanopore raw signal search (cwSDTWnano), to enable the direct subsequence searching and exact mapping in nanopore raw signals. The proposed algorithms are based on the idea of subsequence-extended dynamic time warping and directly operate on the raw signals, without any loss of information. DSDTWnano could ensure an output of highly accurate query results and cwSDTWnano is the accelerated version of DSDTWnano, with the help of seeding and multi-scale coarsening of signals that are based on continuous wavelet transform. Furthermore, a novel error function is proposed to specify the mapping accuracy between a genomic sequence and an electrical current signal sequence, which may serve as the standard criterion for further genome-to-signal mapping studies. Comprehensive experiments on three real-world nanopore datasets (human and lambda phage) demonstrate the efficiency and effectiveness of the proposed algorithms. Finally, we show the power of our algorithms in SNP detection under a low coverage (20x) on E. coli, with >95% detection rate. Our program is available at https://github.com/icthrm/cwSDTWnano.git.
CitationHan R, Wang S, Gao X (2018) Searching and mapping genomic subsequences in nanopore raw signals through novel dynamic time warping algorithms. Available: http://dx.doi.org/10.1101/491456.
SponsorsThe authors thank Minh Duc Cao, Lachlan J.M. Coin, Louise Roddam and Tania Duarte for providing the nanopore sequencing data. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Awards No. FCC/1/1976-04, URF/1/2601-01, URF/1/3007-01, URF/1/3412-01, and URF/1/3450-01.
PublisherCold Spring Harbor Laboratory