Monash University
Browse

Embargoed and Restricted Access

Reason: Under embargo until June 2010. After this date a copy can be supplied under Section 51(2) of the Australian Copyright Act 1968 by submitting a document delivery request through your library

Differential prioritization in feature selection for multiclass molecular classification

thesis
posted on 2017-01-09, 02:51 authored by Ooi, Chia Huey
The aim of the thesis is to develop a filter-based feature selection (FS) technique for multi class molecular classification. Molecular classification involves the classification of samples into groups of biological phenotypes based on high-dimensional gene expression data obtained from microarray experiments. The multi class nature of the classification problems demands work on two specific areas: (a) differential prioritization and (b) combinations between different decomposition paradigms of FS and classification. FS aims to form, from the larger set of features in the dataset, a smaller subset of features which are capable of producing the best classification accuracy. This subset is called the predictor set. Relevance and redundancy have always been acknowledged as important criteria in the formation of the predictor set in filter-based FS. These two criteria are included as elements in the predictor set score, which measures the goodness of the predictor set. However, especially in a multiclass problem, we propose that a third criterion is necessary for the formation of the predictor set. This third criterion is the differential prioritization, a novel criterion which dictates the priority of maximizing relevance to the priority of minimizing redundancy. Differential prioritization ensures that the optimal balance between relevance and redundancy is achieved based on the number of classes in the classification problem. This is because as the number of classes increases, the relative importance of minimizing redundancy also increases. For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class problem is more important than minimizing redundancy in a two-class problem. An outcome of the work on differential prioritization is the development of a superior measure for redundancy. Redundancy in the predictor is defined as the amount of similarity or repetition of information among the members of the predictor set. Traditionally, redundancy is measured by directly summing up the pairwise similarity among the members of the predictor set. It is then minimized by defining it as the denominator in a ratio-based predictor set score. This method of measuring and minimizing redundancy faces the problem of singularity at nearminimum redundancy, which results in a skewed representation of the goodness of the predictor set. This motivates us to come up with an alternative measure for redundancy which circumvents the aforementioned problem. In multiclass problems, following the 'divide-and-conquer' philosophy, FS may be decomposed into several two-class sub-problems. The manner of the decomposition determines the decomposition paradigm for the FS problem. This is also true for multiclass classification, which may also be decomposed into several two-class sub-problems. The problem of FS and the problem of classification are inevitably linked to each other, since one of the aims of FS is to aid classification. However, there exists no formal approach for systematically combining the twin problems of FS and classification based on the decomposition paradigm used in each problem. Hence, we propose a system for combining the FS and the classification problems which will enable us to examine the effect of different combinations between decomposition paradigms ofFS and classification on accuracy in multiclass molecular classification.

History

Campus location

Australia

Principal supervisor

Madhu Chetty

Year of Award

2007

Department, School or Centre

Information Technology (Monash University Gippsland)

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Usage metrics

    Faculty of Information Technology Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC