Embargo End Date2019-07-17
Permanent link to this recordhttp://hdl.handle.net/10754/628043
MetadataShow full item record
Access RestrictionsAt the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation became available to the public after the expiration of the embargo on 2019-07-17.
AbstractResource Description Framework (RDF) provides a simple way for expressing facts across the web, leading to Web linked data. Several distributed and federated RDF systems have emerged to handle the massive amounts of RDF data available nowadays. Distributed systems are optimized to query massive datasets that appear as a single graph, while federated systems are designed to query hundreds of decentralized and interlinked graphs. This thesis starts with a comprehensive experimental study of the state-of-the-art RDF systems. It identifies a set of research problems for improving the state-of-the-art, including: supporting the emerging RDF analytics required by many modern applications, querying linked data at scale, and enabling discovery on linked data. Addressing these problems is the focus of this thesis. First, we propose Spartex; a versatile framework for complex RDF analytics. Spartex extends SPARQL to seamlessly combine generic graph algorithms with SPARQL queries. Spartex implements a generic SPARQL operator as a vertex-centric program that interprets SPARQL queries and executes them efficiently using a built-in optimizer. We demonstrate that Spartex scales to datasets with billions of edges, and is at least as fast as the state-of-the-art specialized RDF engines. For analytical tasks, Spartex is an order of magnitude faster than existing alternatives. To address the scalability limitation of federated RDF engines, we propose Lusail; a scalable system for querying geo-distributed RDF graphs. Lusail follows a two-tier strategy: (i) locality-aware decomposition of the query into subqueries to maximize the computations at the endpoints and minimize intermediary results, and (ii) selectivity-aware execution to reduce network latency and increase parallelism. Our experiments on billions of triples show that Lusail outperforms existing systems by orders of magnitude in scalability and response time. Finally, enabling discovery on linked data is challenging due to the prior knowledge required to formulate SPARQL queries. To address these challenges; we develop novel techniques to (i) predict semantically equivalent SPARQL queries from a set of keywords by leveraging word embeddings, and (ii) generate fine-grained and non-blocking query plans to get fast and early results.