Fast crawling methods of exploring content distributed over large graphs

Handle URI:
http://hdl.handle.net/10754/627359
Title:
Fast crawling methods of exploring content distributed over large graphs
Authors:
Wang, Pinghui; Zhao, Junzhou; Lui, John C. S.; Towsley, Don; Guan, Xiaohong
Abstract:
Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Wang P, Zhao J, Lui JCS, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Knowledge and Information Systems. Available: http://dx.doi.org/10.1007/s10115-018-1178-x.
Publisher:
Springer Nature
Journal:
Knowledge and Information Systems
Issue Date:
15-Mar-2018
DOI:
10.1007/s10115-018-1178-x
Type:
Article
ISSN:
0219-1377; 0219-3116
Sponsors:
The authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).
Additional Links:
http://link.springer.com/article/10.1007/s10115-018-1178-x
Appears in Collections:
Articles; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorWang, Pinghuien
dc.contributor.authorZhao, Junzhouen
dc.contributor.authorLui, John C. S.en
dc.contributor.authorTowsley, Donen
dc.contributor.authorGuan, Xiaohongen
dc.date.accessioned2018-03-20T08:50:23Z-
dc.date.available2018-03-20T08:50:23Z-
dc.date.issued2018-03-15en
dc.identifier.citationWang P, Zhao J, Lui JCS, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Knowledge and Information Systems. Available: http://dx.doi.org/10.1007/s10115-018-1178-x.en
dc.identifier.issn0219-1377en
dc.identifier.issn0219-3116en
dc.identifier.doi10.1007/s10115-018-1178-xen
dc.identifier.urihttp://hdl.handle.net/10754/627359-
dc.description.abstractDespite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.en
dc.description.sponsorshipThe authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).en
dc.publisherSpringer Natureen
dc.relation.urlhttp://link.springer.com/article/10.1007/s10115-018-1178-xen
dc.rightsThe final publication is available at Springer via http://dx.doi.org/10.1007/s10115-018-1178-xen
dc.subjectCrawlingen
dc.subjectOnline social networksen
dc.subjectSamplingen
dc.subjectRandom walksen
dc.titleFast crawling methods of exploring content distributed over large graphsen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalKnowledge and Information Systemsen
dc.eprint.versionPost-printen
dc.contributor.institutionShenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, Chinaen
dc.contributor.institutionMOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, Chinaen
dc.contributor.institutionDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kongen
dc.contributor.institutionDepartment of Computer Science, University of Massachusetts Amherst, Amherst, USen
dc.contributor.institutionCenter for Intelligent and Networked Systems, Tsinghua University, Beijing, Chinaen
kaust.authorZhao, Junzhouen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.