Towards a Database System for Large-scale Analytics on Strings

Handle URI:
http://hdl.handle.net/10754/575502
Title:
Towards a Database System for Large-scale Analytics on Strings
Authors:
Sahli, Majed A. ( 0000-0002-4576-9708 )
Abstract:
Recent technological advances are causing an explosion in the production of sequential data. Biological sequences, web logs and time series are represented as strings. Currently, strings are stored, managed and queried in an ad-hoc fashion because they lack a standardized data model and query language. String queries are computationally demanding, especially when strings are long and numerous. Existing approaches cannot handle the growing number of strings produced by environmental, healthcare, bioinformatic, and space applications. There is a trade- off between performing analytics efficiently and scaling to thousands of cores to finish in reasonable times. In this thesis, we introduce a data model that unifies the input and output representations of core string operations. We define a declarative query language for strings where operators can be pipelined to form complex queries. A rich set of core string operators is described to support string analytics. We then demonstrate a database system for string analytics based on our model and query language. In particular, we propose the use of a novel data structure augmented by efficient parallel computation to strike a balance between preprocessing overheads and query execution times. Next, we delve into repeated motifs extraction as a core string operation for large-scale string analytics. Motifs are frequent patterns used, for example, to identify biological functionality, periodic trends, or malicious activities. Statistical approaches are fast but inexact while combinatorial methods are sound but slow. We introduce ACME, a combinatorial repeated motifs extractor. We study the spatial and temporal locality of motif extraction and devise a cache-aware search space traversal technique. ACME is the only method that scales to gigabyte- long strings, handles large alphabets, and supports interesting motif types with minimal overhead. While ACME is cache-efficient, it is limited by being serial. We devise a lightweight parallel space traversal technique, called FAST, that enables ACME to scale to thousands of cores. High degree of concurrency is achieved by partition- ing the search space horizontally and balancing the workload among cores with minimal communication overhead. Consequently, complex queries are solved in minutes instead of days. ACME is a versatile system that runs on workstations, clusters, and supercomputers. It is the first to utilize a supercomputer and scale to 16 thousand CPUs. Merely using more cores does not guarantee efficiency, because of the related overheads. To this end, we introduce an automatic tuning mechanism that suggests the appropriate number of cores to meet user constraints in terms of runtime while minimizing the financial cost of cloud resources. Particularly, we study workload frequency distributions then build a model that finds the best problem decomposition and estimates serial and parallel runtimes. Finally, we generalize our automatic tuning method as a general method, called APlug. APlug can be used in other applications and we integrate it with systems for molecular docking and multiple sequence alignment.
Advisors:
Kalnis, Panos ( 0000-0002-5060-1360 )
Committee Member:
Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Wonka, Peter ( 0000-0003-0627-9746 ) ; Salama, Khaled N. ( 0000-0001-7742-1282 ) ; Palpanas, Themis
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
23-Jul-2015
Type:
Dissertation
Appears in Collections:
Dissertations; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.advisorKalnis, Panosen
dc.contributor.authorSahli, Majed A.en
dc.date.accessioned2015-08-23T10:04:49Zen
dc.date.available2015-08-23T10:04:49Zen
dc.date.issued2015-07-23en
dc.identifier.urihttp://hdl.handle.net/10754/575502en
dc.description.abstractRecent technological advances are causing an explosion in the production of sequential data. Biological sequences, web logs and time series are represented as strings. Currently, strings are stored, managed and queried in an ad-hoc fashion because they lack a standardized data model and query language. String queries are computationally demanding, especially when strings are long and numerous. Existing approaches cannot handle the growing number of strings produced by environmental, healthcare, bioinformatic, and space applications. There is a trade- off between performing analytics efficiently and scaling to thousands of cores to finish in reasonable times. In this thesis, we introduce a data model that unifies the input and output representations of core string operations. We define a declarative query language for strings where operators can be pipelined to form complex queries. A rich set of core string operators is described to support string analytics. We then demonstrate a database system for string analytics based on our model and query language. In particular, we propose the use of a novel data structure augmented by efficient parallel computation to strike a balance between preprocessing overheads and query execution times. Next, we delve into repeated motifs extraction as a core string operation for large-scale string analytics. Motifs are frequent patterns used, for example, to identify biological functionality, periodic trends, or malicious activities. Statistical approaches are fast but inexact while combinatorial methods are sound but slow. We introduce ACME, a combinatorial repeated motifs extractor. We study the spatial and temporal locality of motif extraction and devise a cache-aware search space traversal technique. ACME is the only method that scales to gigabyte- long strings, handles large alphabets, and supports interesting motif types with minimal overhead. While ACME is cache-efficient, it is limited by being serial. We devise a lightweight parallel space traversal technique, called FAST, that enables ACME to scale to thousands of cores. High degree of concurrency is achieved by partition- ing the search space horizontally and balancing the workload among cores with minimal communication overhead. Consequently, complex queries are solved in minutes instead of days. ACME is a versatile system that runs on workstations, clusters, and supercomputers. It is the first to utilize a supercomputer and scale to 16 thousand CPUs. Merely using more cores does not guarantee efficiency, because of the related overheads. To this end, we introduce an automatic tuning mechanism that suggests the appropriate number of cores to meet user constraints in terms of runtime while minimizing the financial cost of cloud resources. Particularly, we study workload frequency distributions then build a model that finds the best problem decomposition and estimates serial and parallel runtimes. Finally, we generalize our automatic tuning method as a general method, called APlug. APlug can be used in other applications and we integrate it with systems for molecular docking and multiple sequence alignment.en
dc.language.isoenen
dc.subjectData Managementen
dc.subjectParallele Processingen
dc.subjectDBMS for stringsen
dc.titleTowards a Database System for Large-scale Analytics on Stringsen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberWonka, Peteren
dc.contributor.committeememberSalama, Khaled N.en
dc.contributor.committeememberPalpanas, Themisen
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id113627en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.