Efficient Estimation of Dynamic Density Functions with Applications in Streaming Data

Handle URI:
http://hdl.handle.net/10754/609049
Title:
Efficient Estimation of Dynamic Density Functions with Applications in Streaming Data
Authors:
Qahtan, Abdulhakim ( 0000-0001-8254-1764 )
Abstract:
Recent advances in computing technology allow for collecting vast amount of data that arrive continuously in the form of streams. Mining data streams is challenged by the speed and volume of the arriving data. Furthermore, the underlying distribution of the data changes over the time in unpredicted scenarios. To reduce the computational cost, data streams are often studied in forms of condensed representation, e.g., Probability Density Function (PDF). This thesis aims at developing an online density estimator that builds a model called KDE-Track for characterizing the dynamic density of the data streams. KDE-Track estimates the PDF of the stream at a set of resampling points and uses interpolation to estimate the density at any given point. To reduce the interpolation error and computational complexity, we introduce adaptive resampling where more/less resampling points are used in high/low curved regions of the PDF. The PDF values at the resampling points are updated online to provide up-to-date model of the data stream. Comparing with other existing online density estimators, KDE-Track is often more accurate (as reflected by smaller error values) and more computationally efficient (as reflected by shorter running time). The anytime available PDF estimated by KDE-Track can be applied for visualizing the dynamic density of data streams, outlier detection and change detection in data streams. In this thesis work, the first application is to visualize the taxi traffic volume in New York city. Utilizing KDE-Track allows for visualizing and monitoring the traffic flow on real time without extra overhead and provides insight analysis of the pick up demand that can be utilized by service providers to improve service availability. The second application is to detect outliers in data streams from sensor networks based on the estimated PDF. The method detects outliers accurately and outperforms baseline methods designed for detecting and cleaning outliers in sensor data. The third application is to detect changes in data streams. We propose a framework based on Principal Component Analysis (PCA) that reduces the problem of detecting changes in multidimensional data into the problem of detecting changes in the projected data on the principal components. We provide a theoretical analysis, which is support by experimental results to show that utilizing PCA reflects different types of changes in data streams on the projected data over one or more principal components. Our framework is accurate in detecting changes with low computational costs and scales well for high dimensional data.
Advisors:
Zhang, Xiangliang ( 0000-0002-3574-5665 )
Committee Member:
Wang, Suojin; Gama, Joao; Moshkov, Mikhail ( 0000-0003-0085-9483 ) ; Gao, Xin ( 0000-0002-7108-3574 )
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science
Program:
Computer Science
Issue Date:
11-May-2016
Type:
Dissertation
Appears in Collections:
Dissertations; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.advisorZhang, Xiangliangen
dc.contributor.authorQahtan, Abdulhakimen
dc.date.accessioned2016-05-11T12:19:23Zen
dc.date.available2016-05-11T12:19:23Zen
dc.date.issued2016-05-11en
dc.identifier.urihttp://hdl.handle.net/10754/609049en
dc.description.abstractRecent advances in computing technology allow for collecting vast amount of data that arrive continuously in the form of streams. Mining data streams is challenged by the speed and volume of the arriving data. Furthermore, the underlying distribution of the data changes over the time in unpredicted scenarios. To reduce the computational cost, data streams are often studied in forms of condensed representation, e.g., Probability Density Function (PDF). This thesis aims at developing an online density estimator that builds a model called KDE-Track for characterizing the dynamic density of the data streams. KDE-Track estimates the PDF of the stream at a set of resampling points and uses interpolation to estimate the density at any given point. To reduce the interpolation error and computational complexity, we introduce adaptive resampling where more/less resampling points are used in high/low curved regions of the PDF. The PDF values at the resampling points are updated online to provide up-to-date model of the data stream. Comparing with other existing online density estimators, KDE-Track is often more accurate (as reflected by smaller error values) and more computationally efficient (as reflected by shorter running time). The anytime available PDF estimated by KDE-Track can be applied for visualizing the dynamic density of data streams, outlier detection and change detection in data streams. In this thesis work, the first application is to visualize the taxi traffic volume in New York city. Utilizing KDE-Track allows for visualizing and monitoring the traffic flow on real time without extra overhead and provides insight analysis of the pick up demand that can be utilized by service providers to improve service availability. The second application is to detect outliers in data streams from sensor networks based on the estimated PDF. The method detects outliers accurately and outperforms baseline methods designed for detecting and cleaning outliers in sensor data. The third application is to detect changes in data streams. We propose a framework based on Principal Component Analysis (PCA) that reduces the problem of detecting changes in multidimensional data into the problem of detecting changes in the projected data on the principal components. We provide a theoretical analysis, which is support by experimental results to show that utilizing PCA reflects different types of changes in data streams on the projected data over one or more principal components. Our framework is accurate in detecting changes with low computational costs and scales well for high dimensional data.en
dc.language.isoenen
dc.subjectdata streamsen
dc.subjectdensity estimationen
dc.subjectdynamic densityen
dc.subjectchange detectionen
dc.subjectouttier detectionen
dc.titleEfficient Estimation of Dynamic Density Functions with Applications in Streaming Dataen
dc.typeDissertationen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Scienceen
thesis.degree.grantorKing Abdullah University of Science and Technologyen_GB
dc.contributor.committeememberWang, Suojinen
dc.contributor.committeememberGama, Joaoen
dc.contributor.committeememberMoshkov, Mikhailen
dc.contributor.committeememberGao, Xinen
thesis.degree.disciplineComputer Scienceen
thesis.degree.nameDoctor of Philosophyen
dc.person.id113057en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.