Handle URI:
http://hdl.handle.net/10754/623134
Title:
BigDansing
Authors:
Khayyat, Zuhair ( 0000-0003-3650-6997 ) ; Ilyas, Ihab F.; Jindal, Alekh; Madden, Samuel; Ouzzani, Mourad; Papotti, Paolo; Quiané-Ruiz, Jorge-Arnulfo; Tang, Nan; Yin, Si
Abstract:
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.
KAUST Department:
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Citation:
Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, et al. (2015) BigDansing. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD ’15. Available: http://dx.doi.org/10.1145/2723372.2747646.
Publisher:
Association for Computing Machinery (ACM)
Journal:
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15
Issue Date:
2-Jun-2015
DOI:
10.1145/2723372.2747646
Type:
Conference Paper
Appears in Collections:
Conference Papers

Full metadata record

DC FieldValue Language
dc.contributor.authorKhayyat, Zuhairen
dc.contributor.authorIlyas, Ihab F.en
dc.contributor.authorJindal, Alekhen
dc.contributor.authorMadden, Samuelen
dc.contributor.authorOuzzani, Mouraden
dc.contributor.authorPapotti, Paoloen
dc.contributor.authorQuiané-Ruiz, Jorge-Arnulfoen
dc.contributor.authorTang, Nanen
dc.contributor.authorYin, Sien
dc.date.accessioned2017-04-13T11:50:56Z-
dc.date.available2017-04-13T11:50:56Z-
dc.date.issued2015-06-02en
dc.identifier.citationKhayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, et al. (2015) BigDansing. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD ’15. Available: http://dx.doi.org/10.1145/2723372.2747646.en
dc.identifier.doi10.1145/2723372.2747646en
dc.identifier.urihttp://hdl.handle.net/10754/623134-
dc.description.abstractData cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms.en
dc.publisherAssociation for Computing Machinery (ACM)en
dc.titleBigDansingen
dc.typeConference Paperen
dc.contributor.departmentKing Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabiaen
dc.identifier.journalProceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15en
dc.contributor.institutionUniversity of Waterloo, Waterloo, Canadaen
dc.contributor.institutionMIT, Cambridge, MA, USAen
dc.contributor.institutionQatar Computing Research Institute, Doha, Qataren
kaust.authorKhayyat, Zuhairen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.