MML Inference of Decision Trees, Graphs and Forests

2017-03-16T04:38:21Z (GMT) by Peter J. Tan
The Minimum Message Length (MML) principle is a general inductive inference framework which recasts inductive inference as a coding process. Using assumptions identical to those from the well-known Bayesian inference, prediction and modelling, MML represents general inductive and statistical inference problems as a data encoding process which conveys the data to a receiver in a two-part message. The first part is the message to encode the inferred model. The second part is the message to encode the data in light of the inferred model. MML states that the model with the shortest two-part message is the best model approximating the true model.
   Decision trees, decision graphs and decision forests are popular supervised learning methods in machine learning. In this dissertation, the MML principle is applied to build machine learning schemes for decision trees, decision graphs and decision forests. Two novel MML inference schemes are developed. One is a MML coding scheme for Oblique decision trees, which are decision trees with linear discriminate functions at their internal nodes. Another is a MML coding scheme for decision graphs with multi-way joins and dynamic attributes. A decision forests learning scheme based on MML oblique decision trees is also presented.
   Experiments were conducted across a range of problems using data from University of California Machine Learning Repository and the Singapore Data Mining Centre. These experiments showed that compared to other popular decision tree models such C4.5 and C5, models generated by MML inference schemes achieved favourable results in both classification and probabilistic predictive accuracy. The study showed that MML inference schemes are able to find the optimal trade-off between the complexity of these structure models and goodness-of-fit for a given set of data.