Anomaly detection based on zero appearances in subspaces
thesisposted on 2017-03-01, 04:02 authored by Pang, Guansong
Anomaly detection is regarded as one of the most important tasks in data mining due to its wide application in various domains, such as finance, information security, healthcare and earth science. With advancements in data collection techniques, the volume and dimensionality of anomaly detection data sets increase explosively, and diverse attribute types occur within these data sets. Also, in many data sets, anomalies can be detected in some attributes only, while other attributes are irrelevant to anomaly detection. All these characteristics pose new challenges to existing anomaly detection techniques. Motivated by this fact, this research aims to design an anomaly detection method which can scale up to large and high dimensional data, is able to identify anomalies in data sets with different types of attributes, and tolerates irrelevant attributes. This thesis posits that anomalies are instances with low probabilities in subspaces in a data set. So, in a random subset of the data set, anomalies have higher probabilities of having zero appearances in the subspaces than normal instances. Based on this property, this thesis proposes a novel anomaly detection method called ZERO++ which employs the number of zero appearances in subspaces to detect anomalies. ZERO++ is the only anomaly detector based on zero appearances in subspaces, as far as we know. It is unique in that it works in regions of subspaces that are not occupied by data; whereas other methods work in regions occupied by data. Utilising the anti-monotone property: `if an instance has zero appearances in a subspace, it must also have zero appearances in subspaces containing this subspace', we show that only a small number of subspaces with low dimensionality needs to be considered to identify anomalies effectively. ZERO++ is an efficient algorithm with linear time complexity with respect to data size and data dimensionality, and it can work effectively in data sets with different types of attributes, and a low percentage of relevant attributes.