Handle URI:
http://hdl.handle.net/10754/594265
Title:
No free lunch
Authors:
Ture, Ferhan; Elsayed, Tamer; Lin, Jimmy
Abstract:
This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Ture F, Elsayed T, Lin J (2011) No free lunch. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR ’11. Available: http://dx.doi.org/10.1145/2009916.2010042.
Publisher:
Association for Computing Machinery (ACM)
Journal:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11
Issue Date:
2011
DOI:
10.1145/2009916.2010042
Type:
Conference Paper
ISBN:
9781450309349
Appears in Collections:
Conference Papers; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorTure, Ferhanen
dc.contributor.authorElsayed, Tameren
dc.contributor.authorLin, Jimmyen
dc.date.accessioned2016-01-19T14:44:43Zen
dc.date.available2016-01-19T14:44:43Zen
dc.date.issued2011en
dc.identifier.citationTure F, Elsayed T, Lin J (2011) No free lunch. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR ’11. Available: http://dx.doi.org/10.1145/2009916.2010042.en
dc.identifier.isbn9781450309349en
dc.identifier.doi10.1145/2009916.2010042en
dc.identifier.urihttp://hdl.handle.net/10754/594265en
dc.description.abstractThis work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.en
dc.publisherAssociation for Computing Machinery (ACM)en
dc.subjectLSHen
dc.subjectMachine translationen
dc.subjectWikipediaen
dc.titleNo free lunchen
dc.typeConference Paperen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalProceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11en
dc.contributor.institutionDept. of Computer Science, United Statesen
dc.contributor.institutionISchool, University of Maryland, United Statesen
kaust.authorLin, Jimmyen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.