Restricted Access
Reason: Restricted by author. A copy can be supplied under Section 51(2) of the Australian Copyright Act 1968 by submitting a document delivery request through your library, or by emailing document.delivery@monash.edu
Scalability and Performance of Random Forest based Learning-to-Rank for Information Retrieval
thesis
posted on 2016-12-15, 05:26 authored by Muhammad IbrahimFor
a query submitted by a user, the goal of an information retrieval system is to
return a list of documents which are highly relevant with respect to that
query. Traditionally different scoring methods, ranging from simple heuristic
models to probabilistic models, have been used for this task. Recently
researchers have started to use supervised machine learning techniques for
solving this problem which is then called the learning-to-rank (LtR) problem.
Many supervised learning methods have been tested so far with empirical success
over conventional methods.
The random forest is a relatively simple but effective and efficient learning algorithm which aggregates the predictions of a large number of independent and variant base learners, namely decision trees. Its major benefits over other state-of-the-art methods include inherent parallelizability, ease of tuning and competitive performance. These benefits attract researchers across various disciplines where a random forest is a very popular choice. However, for LtR task, the random forest has not been thoroughly investigated.
In this research, we investigate the random forest based LtR algorithms. We aim at improving the efficiency, effectiveness, and understanding of these algorithms. With respect to the first goal, we employ undersampling techniques and leverage the inherent structure of a random forest to achieve better scalability, especially for highly imbalanced datasets. We also reduce the correlation among the trees to reduce learning time and to improve performance. With respect to the second goal, we investigate various objective functions ranging from completely randomized splitting criterion to so-called listwise splitting. We also conduct a thorough study on random forest based pointwise algorithms. With respect to the third goal, we develop methods for estimating the bias and variance of rank-learning algorithms, and examine their empirical behavior against parameters of the learning algorithm.
The random forest is a relatively simple but effective and efficient learning algorithm which aggregates the predictions of a large number of independent and variant base learners, namely decision trees. Its major benefits over other state-of-the-art methods include inherent parallelizability, ease of tuning and competitive performance. These benefits attract researchers across various disciplines where a random forest is a very popular choice. However, for LtR task, the random forest has not been thoroughly investigated.
In this research, we investigate the random forest based LtR algorithms. We aim at improving the efficiency, effectiveness, and understanding of these algorithms. With respect to the first goal, we employ undersampling techniques and leverage the inherent structure of a random forest to achieve better scalability, especially for highly imbalanced datasets. We also reduce the correlation among the trees to reduce learning time and to improve performance. With respect to the second goal, we investigate various objective functions ranging from completely randomized splitting criterion to so-called listwise splitting. We also conduct a thorough study on random forest based pointwise algorithms. With respect to the third goal, we develop methods for estimating the bias and variance of rank-learning algorithms, and examine their empirical behavior against parameters of the learning algorithm.