Efficient Disk-Based Techniques for Manipulating Very Large String Databases

Handle URI:
http://hdl.handle.net/10754/623691
Title:
Efficient Disk-Based Techniques for Manipulating Very Large String Databases
Authors:
Allam, Amin ( 0000-0001-5137-0990 )
Abstract:
Indexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories, such as bioinformatics and entity resolution. When the string count or sizes become very large, several state-of-the-art techniques for indexing and processing such strings may fail or behave very inefficiently. Modifying an existing technique to overcome these issues is not usually straightforward or even possible. A category of string operations can be facilitated by the suffix tree data structure, which basically indexes a long string to enable efficient finding of any substring of the indexed string, and can be used in other operations as well, such as approximate string matching. In this document, we introduce a novel efficient method to construct the suffix tree index for very long strings using parallel architectures, which is a major challenge in this category. Another category of string operations require clustering similar strings in order to perform application-specific processing on the resulting possibly-overlapping clusters. In this document, based on clustering similar strings, we introduce a novel efficient technique for record linkage and entity resolution, and a novel method for correcting errors in a large number of small strings (read sequences) generated by the DNA sequencing machines.
Advisors:
Kalnis, Panos ( 0000-0002-5060-1360 )
Committee Member:
Gao, Xin ( 0000-0002-7108-3574 ) ; Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Mokbel, Mohamed
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
18-May-2017
Type:
Dissertation
Appears in Collections:
Dissertations

Full metadata record

DC FieldValue Language
dc.contributor.advisorKalnis, Panosen
dc.contributor.authorAllam, Aminen
dc.date.accessioned2017-05-22T08:41:24Z-
dc.date.available2017-05-22T08:41:24Z-
dc.date.issued2017-05-18-
dc.identifier.urihttp://hdl.handle.net/10754/623691-
dc.description.abstractIndexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories, such as bioinformatics and entity resolution. When the string count or sizes become very large, several state-of-the-art techniques for indexing and processing such strings may fail or behave very inefficiently. Modifying an existing technique to overcome these issues is not usually straightforward or even possible. A category of string operations can be facilitated by the suffix tree data structure, which basically indexes a long string to enable efficient finding of any substring of the indexed string, and can be used in other operations as well, such as approximate string matching. In this document, we introduce a novel efficient method to construct the suffix tree index for very long strings using parallel architectures, which is a major challenge in this category. Another category of string operations require clustering similar strings in order to perform application-specific processing on the resulting possibly-overlapping clusters. In this document, based on clustering similar strings, we introduce a novel efficient technique for record linkage and entity resolution, and a novel method for correcting errors in a large number of small strings (read sequences) generated by the DNA sequencing machines.en
dc.language.isoenen
dc.subjectlarge databasesen
dc.subjectstring processingen
dc.subjectdisk-baseden
dc.subjectSuffix treeen
dc.subjectrecord linkageen
dc.subjecterror correctionen
dc.titleEfficient Disk-Based Techniques for Manipulating Very Large String Databasesen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberGao, Xinen
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberMokbel, Mohameden
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id113324en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.