Evaluating SPARQL queries on massive RDF datasets

Handle URI:
http://hdl.handle.net/10754/578875
Title:
Evaluating SPARQL queries on massive RDF datasets
Authors:
Al-Harbi, Razen ( 0000-0001-7298-5484 ) ; Abdelaziz, Ibrahim ( 0000-0003-1449-5115 ) ; Kalnis, Panos ( 0000-0002-5060-1360 ) ; Mamoulis, Nikos
Abstract:
Distributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs. In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.
KAUST Department:
Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Publisher:
VLDB Endowment
Journal:
Proceedings of the VLDB Endowment
Conference/Event name:
Proceedings of the 41st International Conference on Very Large Data Bases
Issue Date:
1-Aug-2015
DOI:
10.14778/2824032.2824083
Type:
Conference Paper
ISSN:
Evaluating SPARQL queries on massive RDF datasets 2015, 8 (12):1848 Proceedings of the VLDB Endowment; 21508097
Additional Links:
http://dl.acm.org/citation.cfm?doid=2824032.2824083
Appears in Collections:
Conference Papers; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorAl-Harbi, Razenen
dc.contributor.authorAbdelaziz, Ibrahimen
dc.contributor.authorKalnis, Panosen
dc.contributor.authorMamoulis, Nikosen
dc.date.accessioned2015-09-29T10:23:21Zen
dc.date.available2015-09-29T10:23:21Zen
dc.date.issued2015-08-01en
dc.identifier.issnEvaluating SPARQL queries on massive RDF datasets 2015, 8 (12):1848 Proceedings of the VLDB Endowmenten
dc.identifier.issn21508097en
dc.identifier.doi10.14778/2824032.2824083en
dc.identifier.urihttp://hdl.handle.net/10754/578875en
dc.description.abstractDistributed RDF systems partition data across multiple computer nodes. Partitioning is typically based on heuristics that minimize inter-node communication and it is performed in an initial, data pre-processing phase. Therefore, the resulting partitions are static and do not adapt to changes in the query workload; as a result, existing systems are unable to consistently avoid communication for queries that are not favored by the initial data partitioning. Furthermore, for very large RDF knowledge bases, the partitioning phase becomes prohibitively expensive, leading to high startup costs. In this paper, we propose AdHash, a distributed RDF system which addresses the shortcomings of previous work. First, AdHash initially applies lightweight hash partitioning, which drastically minimizes the startup cost, while favoring the parallel processing of join patterns on subjects, without any data communication. Using a locality-aware planner, queries that cannot be processed in parallel are evaluated with minimal communication. Second, AdHash monitors the data access patterns and adapts dynamically to the query load by incrementally redistributing and replicating frequently accessed data. As a result, the communication cost for future queries is drastically reduced or even eliminated. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds. In this demonstration, audience can use a graphical interface of AdHash to verify its performance superiority compared to state-of-the-art distributed RDF systems.en
dc.publisherVLDB Endowmenten
dc.relation.urlhttp://dl.acm.org/citation.cfm?doid=2824032.2824083en
dc.rightsThis work is licensed under the Creative Commons Attribution NonCommercial NoDerivs 3.0 Unported License. To view a copy of this li cense, visit http://creativecommons.org/licenses/by nc nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing info@vldb.org.en
dc.titleEvaluating SPARQL queries on massive RDF datasetsen
dc.typeConference Paperen
dc.contributor.departmentComputer Science Programen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalProceedings of the VLDB Endowmenten
dc.conference.dateAugust 1, 2015en
dc.conference.nameProceedings of the 41st International Conference on Very Large Data Basesen
dc.conference.locationKohala Coast, Hawaiien
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.institutionUniversity of Ioannina, Greeceen
kaust.authorAl-Harbi, Razenen
kaust.authorAbdelaziz, Ibrahimen
kaust.authorKalnis, Panosen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.