Parallel motif extraction from very long sequences

Handle URI:
http://hdl.handle.net/10754/564651
Title:
Parallel motif extraction from very long sequences
Authors:
Sahli, Majed ( 0000-0002-4576-9708 ) ; Mansour, Essam; Kalnis, Panos ( 0000-0002-5060-1360 )
Abstract:
Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program
Publisher:
Association for Computing Machinery (ACM)
Journal:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13
Conference/Event name:
22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
Issue Date:
2013
DOI:
10.1145/2505515.2505575
Type:
Conference Paper
ISBN:
9781450322638
Appears in Collections:
Conference Papers; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorSahli, Majeden
dc.contributor.authorMansour, Essamen
dc.contributor.authorKalnis, Panosen
dc.date.accessioned2015-08-04T07:10:58Zen
dc.date.available2015-08-04T07:10:58Zen
dc.date.issued2013en
dc.identifier.isbn9781450322638en
dc.identifier.doi10.1145/2505515.2505575en
dc.identifier.urihttp://hdl.handle.net/10754/564651en
dc.description.abstractMotifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16, 384 cores on a supercomputer. Copyright is held by the owner/author(s).en
dc.publisherAssociation for Computing Machinery (ACM)en
dc.subjectCache efficiencyen
dc.subjectIn-memoryen
dc.subjectMotifen
dc.subjectParallelen
dc.subjectSuffix treeen
dc.titleParallel motif extraction from very long sequencesen
dc.typeConference Paperen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.identifier.journalProceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13en
dc.conference.date27 October 2013 through 1 November 2013en
dc.conference.name22nd ACM International Conference on Information and Knowledge Management, CIKM 2013en
dc.conference.locationSan Francisco, CAen
dc.contributor.institutionQatar Computing Research Institute (QCRI), Doha, Qataren
kaust.authorKalnis, Panosen
kaust.authorSahli, Majeden
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.