## File(s) under permanent embargo

**Reason:** Restricted by author. A copy can be supplied under Section 51(2) of the Australian Copyright Act 1968 by submitting a document delivery request through your library or by emailing document.delivery@monash.edu

# New generative classifiers with mass-based likelihood estimation

thesis

posted on 14.02.2017, 00:19 by Aryal, SunilThe learning task in classification is to learn a classification model from the labelled training data that maps each data instance (x) to one of the predefined classes (y). Discriminative classifiers model the class separation boundaries and make prediction directly through a mapping of x → y; whereas generative classifiers model the class distributions to estimate the likelihood p(x|y) and the class prior p(y) and use the Bayes rule to predict the most probable class. In order to estimate p(x|y), generative classifiers require a density estimator. Current density estimators such as kernel density estimator and k-nearest neighbour density estimator have high time and space complexities. Thus, it is difficult to estimate p(x|y) directly even in data sets having a moderate number of dimensions and moderate data size. To mitigate this difficulty, existing generative classifiers make some assumptions about the probabilistic dependencies between attributes (such as attributes are conditionally independent given class label), and estimate simplified surrogates of p(x|y) from the training data. Though this type of generative classifier has been shown to perform well, the assumptions made are often violated in practice and can result in poor predictive accuracy. This thesis extends the previous work in mass and mass-based density estimation, and suggests an improved implementation to model a multi-dimensional distribution effectively in supervised learning. It presents two new likelihood estimators, one based on mass estimation and the other based on mass-based density estimation. Unlike current density estimators, the proposed methods do not employ any distance calculations. They estimate the multi-dimensional likelihood from the observed data efficiently. Based on the two proposed likelihood estimators, this thesis introduces two new generative classifiers called MassBayes and DEMassBayes, which estimate p(x|y) through mass estimation and mass-based density estimation, respectively. Unlike existing generative classifiers, they do not make any assumptions about the inter-dependencies among the attributes. They estimate multi-dimensional likelihood p(x|y) directly from the training data. Empirical evaluations show that the proposed generative classifiers yield better predictive accuracy than existing generative classifiers on benchmark data sets, especially in large data sets. They can work with sub-samples of the training data to estimate p(x|y) in large data sets. They have constant time and space complexities in training a classification model. Hence, they scale better than existing generative classifiers in large data sets.