Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning

Handle URI:
http://hdl.handle.net/10754/621375
Title:
Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning
Authors:
Al-Harbi, Razen ( 0000-0001-7298-5484 ) ; Abdelaziz, Ibrahim ( 0000-0003-1449-5115 ) ; Kalnis, Panos ( 0000-0002-5060-1360 ) ; Mamoulis, Nikos; Ebrahim, Yasser; Sahli, Majed ( 0000-0002-4576-9708 )
Abstract:
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.
KAUST Department:
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Citation:
Harbi R, Abdelaziz I, Kalnis P, Mamoulis N, Ebrahim Y, et al. (2016) Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25: 355–380. Available: http://dx.doi.org/10.1007/s00778-016-0420-y.
Publisher:
Springer Nature
Journal:
The VLDB Journal
Issue Date:
8-Feb-2016
DOI:
10.1007/s00778-016-0420-y
Type:
Article
ISSN:
1066-8888; 0949-877X
Appears in Collections:
Articles

Full metadata record

DC FieldValue Language
dc.contributor.authorAl-Harbi, Razenen
dc.contributor.authorAbdelaziz, Ibrahimen
dc.contributor.authorKalnis, Panosen
dc.contributor.authorMamoulis, Nikosen
dc.contributor.authorEbrahim, Yasseren
dc.contributor.authorSahli, Majeden
dc.date.accessioned2016-11-03T08:27:50Z-
dc.date.available2016-11-03T08:27:50Z-
dc.date.issued2016-02-08en
dc.identifier.citationHarbi R, Abdelaziz I, Kalnis P, Mamoulis N, Ebrahim Y, et al. (2016) Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25: 355–380. Available: http://dx.doi.org/10.1007/s00778-016-0420-y.en
dc.identifier.issn1066-8888en
dc.identifier.issn0949-877Xen
dc.identifier.doi10.1007/s00778-016-0420-yen
dc.identifier.urihttp://hdl.handle.net/10754/621375-
dc.description.abstractState-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.en
dc.publisherSpringer Natureen
dc.subjectMain memory enginesen
dc.subjectParallel and distributed RDF systemsen
dc.subjectSPARQL query processingen
dc.titleAccelerating SPARQL queries by exploiting hash-based locality and adaptive partitioningen
dc.typeArticleen
dc.contributor.departmentKing Abdullah University of Science and Technology, Thuwal, Saudi Arabiaen
dc.identifier.journalThe VLDB Journalen
dc.contributor.institutionUniversity of Ioannina, Ioannina, Greeceen
dc.contributor.institutionMicrosoft Corporation, Redmond, WA, United Statesen
kaust.authorAl-Harbi, Razenen
kaust.authorAbdelaziz, Ibrahimen
kaust.authorKalnis, Panosen
kaust.authorSahli, Majeden
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.