Scalable Frequent Subgraph Mining

Handle URI:
http://hdl.handle.net/10754/625049
Title:
Scalable Frequent Subgraph Mining
Authors:
Abdelhamid, Ehab ( 0000-0002-8708-0642 )
Abstract:
A graph is a data structure that contains a set of nodes and a set of edges connecting these nodes. Nodes represent objects while edges model relationships among these objects. Graphs are used in various domains due to their ability to model complex relations among several objects. Given an input graph, the Frequent Subgraph Mining (FSM) task finds all subgraphs with frequencies exceeding a given threshold. FSM is crucial for graph analysis, and it is an essential building block in a variety of applications, such as graph clustering and indexing. FSM is computationally expensive, and its existing solutions are extremely slow. Consequently, these solutions are incapable of mining modern large graphs. This slowness is caused by the underlying approaches of these solutions which require finding and storing an excessive amount of subgraph matches. This dissertation proposes a scalable solution for FSM that avoids the limitations of previous work. This solution is composed of four components. The first component is a single-threaded technique which, for each candidate subgraph, needs to find only a minimal number of matches. The second component is a scalable parallel FSM technique that utilizes a novel two-phase approach. The first phase quickly builds an approximate search space, which is then used by the second phase to optimize and balance the workload of the FSM task. The third component focuses on accelerating frequency evaluation, which is a critical step in FSM. To do so, a machine learning model is employed to predict the type of each graph node, and accordingly, an optimized method is selected to evaluate that node. The fourth component focuses on mining dynamic graphs, such as social networks. To this end, an incremental index is maintained during the dynamic updates. Only this index is processed and updated for the majority of graph updates. Consequently, search space is significantly pruned and efficiency is improved. The empirical evaluation shows that the proposed components significantly outperform existing solutions, scale to a large number of processors and process graphs that previous techniques cannot handle, such as large and dynamic graphs.
Advisors:
Kalnis, Panos ( 0000-0002-5060-1360 )
Committee Member:
Gomes, Diogo A. ( 0000-0002-3129-3956 ) ; Gao, Xin ( 0000-0002-7108-3574 ) ; Zaki, Mohammed J.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Program:
Computer Science
Issue Date:
19-Jun-2017
Type:
Dissertation
Appears in Collections:
Dissertations

Full metadata record

DC FieldValue Language
dc.contributor.advisorKalnis, Panosen
dc.contributor.authorAbdelhamid, Ehaben
dc.date.accessioned2017-06-19T07:47:13Z-
dc.date.available2017-06-19T07:47:13Z-
dc.date.issued2017-06-19-
dc.identifier.urihttp://hdl.handle.net/10754/625049-
dc.description.abstractA graph is a data structure that contains a set of nodes and a set of edges connecting these nodes. Nodes represent objects while edges model relationships among these objects. Graphs are used in various domains due to their ability to model complex relations among several objects. Given an input graph, the Frequent Subgraph Mining (FSM) task finds all subgraphs with frequencies exceeding a given threshold. FSM is crucial for graph analysis, and it is an essential building block in a variety of applications, such as graph clustering and indexing. FSM is computationally expensive, and its existing solutions are extremely slow. Consequently, these solutions are incapable of mining modern large graphs. This slowness is caused by the underlying approaches of these solutions which require finding and storing an excessive amount of subgraph matches. This dissertation proposes a scalable solution for FSM that avoids the limitations of previous work. This solution is composed of four components. The first component is a single-threaded technique which, for each candidate subgraph, needs to find only a minimal number of matches. The second component is a scalable parallel FSM technique that utilizes a novel two-phase approach. The first phase quickly builds an approximate search space, which is then used by the second phase to optimize and balance the workload of the FSM task. The third component focuses on accelerating frequency evaluation, which is a critical step in FSM. To do so, a machine learning model is employed to predict the type of each graph node, and accordingly, an optimized method is selected to evaluate that node. The fourth component focuses on mining dynamic graphs, such as social networks. To this end, an incremental index is maintained during the dynamic updates. Only this index is processed and updated for the majority of graph updates. Consequently, search space is significantly pruned and efficiency is improved. The empirical evaluation shows that the proposed components significantly outperform existing solutions, scale to a large number of processors and process graphs that previous techniques cannot handle, such as large and dynamic graphs.en
dc.language.isoenen
dc.subjectgraphen
dc.subjectparallel processingen
dc.subjectFrequent subgraph miningen
dc.subjectincremental indexingen
dc.titleScalable Frequent Subgraph Miningen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberGomes, Diogo A.en
dc.contributor.committeememberGao, Xinen
dc.contributor.committeememberZaki, Mohammed J.en
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id113059en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.