Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

Handle URI:
http://hdl.handle.net/10754/561504
Title:
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Authors:
Tao, Yufei; Yi, Ke; Sheng, Cheng; Kalnis, Panos ( 0000-0002-5060-1360 )
Abstract:
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders ofmagnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results. © 2010 ACM.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program
Publisher:
Association for Computing Machinery (ACM)
Journal:
ACM Transactions on Database Systems
Issue Date:
1-Jul-2010
DOI:
10.1145/1806907.1806912
Type:
Article
ISSN:
03625915
Sponsors:
Y. Tao and C. Sheng were supported by Grants GRF 4161/07, GRF 4173/08, and GRF 4169/09 from HKRGC, and a direct grant (2050395) from CUHK. K. Yi was supported by a Hong Kong Direct Allocation grant (DAG07/08).
Appears in Collections:
Articles; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorTao, Yufeien
dc.contributor.authorYi, Keen
dc.contributor.authorSheng, Chengen
dc.contributor.authorKalnis, Panosen
dc.date.accessioned2015-08-02T09:12:59Zen
dc.date.available2015-08-02T09:12:59Zen
dc.date.issued2010-07-01en
dc.identifier.issn03625915en
dc.identifier.doi10.1145/1806907.1806912en
dc.identifier.urihttp://hdl.handle.net/10754/561504en
dc.description.abstractNearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders ofmagnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial reduction in the space and running time. In our experiments, our technique was faster: (i) than distance browsing (a well-known method for solving the problem exactly) by several orders of magnitude, and (ii) than D-shift (an approximate approach with theoretical guarantees in low-dimensional space) by one order of magnitude, and at the same time, outputs better results. © 2010 ACM.en
dc.description.sponsorshipY. Tao and C. Sheng were supported by Grants GRF 4161/07, GRF 4173/08, and GRF 4169/09 from HKRGC, and a direct grant (2050395) from CUHK. K. Yi was supported by a Hong Kong Direct Allocation grant (DAG07/08).en
dc.publisherAssociation for Computing Machinery (ACM)en
dc.subjectAlgorithmsen
dc.subjectExperimentationen
dc.subjectTheoryen
dc.titleEfficient and accurate nearest neighbor and closest pair search in high-dimensional spaceen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.identifier.journalACM Transactions on Database Systemsen
dc.contributor.institutionDepartment of Computer Science and Engineering, Chinese University of Hong Kong, Sha Tin, Hong Kongen
dc.contributor.institutionDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kongen
kaust.authorKalnis, Panosen
kaust.authorSheng, Chengen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.