AuthorsAbdel Messih, Mario A.
AdvisorsBajic, Vladimir B.
Permanent link to this recordhttp://hdl.handle.net/10754/248712
MetadataShow full item record
AbstractThis study develops a novel hybrid method for outlier detection (HMOD) that combines the idea of distance based and density based methods. The proposed method has two main advantages over most of the other outlier detection methods. The first advantage is that it works well on both dense and sparse datasets. The second advantage is that, unlike most other outlier detection methods that require careful parameter setting and prior knowledge of the data, HMOD is not very sensitive to small changes in parameter values within certain parameter ranges. The only required parameter to set is the number of nearest neighbors. In addition, we made a fully parallelized implementation of HMOD that made it very efficient in applications. Moreover, we proposed a new way of using the outlier detection for redundancy reduction in datasets where the confidence level that evaluates how accurate the less redundant dataset can be used to represent the original dataset can be specified by users. HMOD is evaluated on synthetic datasets (dense and mixed “dense and sparse”) and a bioinformatics problem of redundancy reduction of dataset of position weight matrices (PWMs) of transcription factor binding sites. In addition, in the process of assessing the performance of our redundancy reduction method, we developed a simple tool that can be used to evaluate the confidence level of reduced dataset representing the original dataset. The evaluation of the results shows that our method can be used in a wide range of problems.