Multivariate alternating decision tree

2017-02-21T02:30:25Z (GMT) by Sok, Hong Kuan
Decision trees are characterized by fast induction time and comprehensible classification rules. However, their classification accuracies are relatively lower in comparison to other black-box classifiers such as support vector machines. It is often possible to improve decision tree accuracies by combining them via boosting or bagging to form an ensemble of trees (i.e., forests). Unfortunately, ensemble approaches will cause the decision trees to lose their comprehensibility and significantly lengthen their induction time. The invention of the alternating decision tree (ADTree) allows the incorporation of boosting within a single decision tree to retain comprehensibility. However, the existing ADTree is univariate in nature which limits its potential to further improve the accuracy and induction time. This thesis presents the multivariate alternating decision tree, whereby multivariate decision nodes are incorporated into the ADTree learning algorithm. It can be considered as a generalization of the existing univariate ADTree. Three different variants of multivariate ADTrees are presented in this thesis, namely the Fisher’s ADTree, Sparse ADTree and regularized LogitBoost ADTree (rLADTree). These algorithms were benchmarked against other existing univariate, multivariate and ensemble-based decision trees using real-world datasets from the University of California, Irvine Machine Learning Repository and University of Eastern Finland Spectral Database. It is shown that the Fisher’s ADTree is capable of improving the accuracy of multivariate decision trees through boosting. At the same time it remains to be significantly smaller than boosted multivariate decision trees. It is also shown that the Sparse ADTree is a non-parametric extension of the Sparse Linear Discriminant Analysis (SLDA). It is therefore able to linearly partition the data when it is beneficial to do so, or to grow a tree to improve the classification accuracy when necessary. The most notable multivariate ADTree variant is the regularized LADTree, which is characterized by having no statistically significant differences in all performance metrics and offering comprehensibility when compared with the univariate, unboosted decision trees like C4.5 and CART for general datasets. For datasets with highly correlated features, the regularized LADTree outperforms the existing decision trees in terms of accuracy and comprehensibility, making it a top choice among decision tree classifiers.