Show simple item record

dc.contributor.authorWang, Pinghui
dc.contributor.authorZhao, Junzhou
dc.contributor.authorLui, John C. S.
dc.contributor.authorTowsley, Don
dc.contributor.authorGuan, Xiaohong
dc.date.accessioned2018-03-20T08:50:23Z
dc.date.available2018-03-20T08:50:23Z
dc.date.issued2018-03-15
dc.identifier.citationWang P, Zhao J, Lui JCS, Towsley D, Guan X (2018) Fast crawling methods of exploring content distributed over large graphs. Knowledge and Information Systems. Available: http://dx.doi.org/10.1007/s10115-018-1178-x.
dc.identifier.issn0219-1377
dc.identifier.issn0219-3116
dc.identifier.doi10.1007/s10115-018-1178-x
dc.identifier.urihttp://hdl.handle.net/10754/627359
dc.description.abstractDespite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.
dc.description.sponsorshipThe authors wish to thank the anonymous reviewers for their helpful feedback. This work was supported in part by Army Research Office Contract W911NF-12-1-0385, and ARL under Cooperative Agreement W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the ARL, or the U.S. Government. The work was also supported in part by National Natural Science Foundation of China (61603290, 61602371, U1301254), Ministry of Education and China Mobile Research Fund (MCM20160311), China Postdoctoral Science Foundation (2015M582663), Natural Science Basic Research Plan in Zhejiang Province of China (LGG18F020016), Natural Science Basic Research Plan in Shaanxi Province of China (2016JQ6034, 2017JM6095), Shenzhen Basic Research Grant (JCYJ20160229195940462).
dc.publisherSpringer Nature
dc.relation.urlhttp://link.springer.com/article/10.1007/s10115-018-1178-x
dc.rightsThe final publication is available at Springer via http://dx.doi.org/10.1007/s10115-018-1178-x
dc.subjectCrawling
dc.subjectOnline social networks
dc.subjectSampling
dc.subjectRandom walks
dc.titleFast crawling methods of exploring content distributed over large graphs
dc.typeArticle
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.identifier.journalKnowledge and Information Systems
dc.eprint.versionPost-print
dc.contributor.institutionShenzhen Research Institute of Xi’an Jiaotong University, Shenzhen, China
dc.contributor.institutionMOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, China
dc.contributor.institutionDepartment of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong
dc.contributor.institutionDepartment of Computer Science, University of Massachusetts Amherst, Amherst, US
dc.contributor.institutionCenter for Intelligent and Networked Systems, Tsinghua University, Beijing, China
kaust.personZhao, Junzhou
refterms.dateFOA2019-03-15T00:00:00Z
dc.date.published-online2018-03-15
dc.date.published-print2019-04


Files in this item

Thumbnail
Name:
KAIS2018-1.pdf
Size:
2.282Mb
Format:
PDF
Description:
Accepted Manuscript

This item appears in the following Collection(s)

Show simple item record