Learning discriminative relational features for sequence labeling
thesisposted on 28.02.2017 by Nair, Naveen Sudhakaran
In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.
Sequence labeling is the task of assigning a class/state label to each instance in a sequence of observations; it is generally grouped under structured output classification problems. Typical sequence labeling algorithms learn probabilistic information about the neighboring states (along with the probabilistic information about the inputs) from the training data and find the globally best assignment for the entire query sequence at once. Hidden Markov Models (HMM), Conditional Random Fields (CRF) and Support Vector Machines on Structured Output Spaces (StructSVM) are some of the most popular sequence labeling approaches. All these models learn parameters for the state-observation relationships in a sequence and the transition relationships between states at successive steps. Inference is generally performed using a dynamic programming algorithm called the Viterbi algorithm. One of the problems in sequence labeling by conventional approaches is the limitation in discovering the interactions among inputs. Typical approaches tend to assume conditional independence between individual inputs, given the class label. Although this enables a naive factorization of observation distribution, in several cases, where there are non-linear relationships among input variables, it results in loss of accuracy. Discovering the relational structure in input space could give a meaningful representation of the model and thereby improve the quality of the model in terms of labeling accuracy. In this work, we propose to learn useful relational features that capture the relationships in input space. The space of relational features in such settings is exponential in the number of basic inputs. For instance, in the simple case of learning features that are conjunctions of basic inputs at any single sequence position, the feature space is of size 2N for N basic inputs. The size would be much larger if we consider complex relational features built from inputs at different relative positions. Since an exhaustive search in this exponentially large feature space is infeasible, most of the relational feature learning systems for sequence labeling such as tildeCRF follow a greedy search strategy. In this thesis, we study the possibility of efficiently learning and using discriminative relational features for sequence labeling. We pose the problem as learning relational features/rules in the form of definite clauses. For this, we identify classes of relational features based on their complexities and develop efficient learning approaches for those feature classes that we identify as relevant and useful. We first investigate the problem of learning simple conjunctions of basic (propositional) input features for any given position in a sequence. This type of features is referred to as Simple Conjuncts (SC). We start with developing a greedy feature induction approach for sequence labeling. Our greedy feature induction approach incrementally discovers the best model by employing a greedy hill climbing search in the space of features. In each iteration of the search, we derive a candidate model from the previous model, combine it with transition rules, evaluate in a custom implementation of HMM, prune low scoring candidate models (and their refinements) and select the best scoring model. There have been a few other approaches similar to our approach, but in different learning settings, that learn composite features for sequence labeling. Although these approaches give better performance than conventional approaches, being greedy, they cannot guarantee optimal solutions. We therefore propose and develop a Hierarchical Kernels based approach for learning optimal SCs relevant for each output label. The Hierarchical Kernels approach, referred to as Hierarchical Kernel Learning for Structured Output Spaces (StructHKL), optimally and efficiently explores the hierarchical structure in the feature space for problems with structured output spaces such as sequence labeling. Here we extend the Hierarchical Kernel Learning (HKL) approach, originally introduced by Bach and Jawanpuria et al., to learn feature conjunctions for multi-class structured output classification. We build on the Support Vector Machines for Structured Output Spaces (StructSVM) model for sequence prediction problems and employ a p-norm hierarchical regularizer for input/observation features and a conventional 2-norm regularizer for the state transition features. The hierarchical regularizer penalizes large features and thereby selects a small set of short features. StructHKL learns the input features and their weights simultaneously in an efficient way. We now look into the problem of learning complex relational features that are derived from inputs at multiple sequence positions. Although the StructHKL algorithm optimally solves the objective of learning the most discriminative SCs for sequence labeling, due to some theoretical requirements of the feature space, its applicability in learning complex relational features, that are derived from inputs at different relative positions, is non-trivial and challenging. Therefore, we determine feature classes that can be composed to yield complex ones, with the goal of formulating efficient yet effective relational feature learning procedures. We identify a self-contained class of features called Absolute Features (AF), whose (unary/multiple) compositions yield complex relational features in another class called Composite Features (CF). We seek to leverage optimal feature learning in all the steps of relational feature induction, which can be addressed either by (i) enumerating AFs and discovering their compositions (CF) using StructHKL or by (ii) developing methods to learn optimal AFs (or CFs directly). As for the first option, the space of AFs is prohibitively large, which makes enumeration in that space impractical. We thus selectively filter AFs based on some relevance criteria (minimum support) and then make use of the StructHKL algorithm to learn compositions of selected features. However, the partial ordering of AFs does not comply with the requirement of StructHKL that the descendant kernels in the partial ordering of features should be summable in polynomial time. Consequently, leveraging StructHKL to optimally learn features in the space of AFs (and its super-space of CFs) is infeasible. For the second option to learn optimal CFs directly, in the structured output classification model, we leverage a relational kernel that computes the similarity between instances in an implicit feature space of CFs. To this end, we employ the relational subsequence kernel at each sequence position (over a time window of inputs around the pivot position) for the classification model. While this way of modeling does not result in interpretability, relational subsequence kernels do efficiently capture the relational sequential information on the inputs. Although the main contribution of the thesis is feature learning for sequence labeling, we have also contributed in two related problem domains, which we briefly introduce in the following paragraphs. In general classification settings (with or without structured outputs), where it is not feasible to ground all variables, dynamic programming approaches have limitations in performing inference. We now derive a Satisfiability approach for fast and memory efficient inference in general horn clause settings, which prunes a major part of the possible groundings and performs inference in a small restricted space. Our approach finds a model in polynomial time, if it exists; otherwise finds a most likely interpretation given the evidence. We now briefly introduce our second related contribution, which is performing dimensionality reduction in classification settings by leveraging Hierarchical Kernel Learning. Many real world classification problems are characterized by a large set of features that possibly contain a non-trivial amount of redundant and irrelevant information. Using the entire feature space as it is often leads to over-fitting and therefore less effective models. Dimensionality reduction techniques are typically used to reduce the dimension of the data either by projecting the features onto a collapsed space or by selecting a subset of features, both as preprocessing steps. These approaches suffer from the drawback that the dimensionality reduction objective and the objective for classifier training are decoupled (performed one after the other) and often, the approach for dimensionality reduction is greedy. A few approaches have been recently proposed to address the two tasks in a combined manner by attempting to solve an upper-bound to a single objective function. However, the main drawback of these methods is that the number of reduced dimensions is not learned, but taken as an input to the system. In this work, we propose an integrated learning approach for non-parametric dimension reduction by projecting the features from the original feature space to the space of disjunctions and discovering a sparse set of important disjunctions out of them. Here, in order to discover good disjunctive features, hierarchical kernels have been employed that efficiently and optimally perform feature selection and classifier training simultaneously in a maximum margin framework. We demonstrate the efficiency of our feature induction approaches in improving prediction accuracy in the domain of activity recognition. The proposed satisfiability based inference approach and the dimensionality reduction approach are also evaluated on standard datasets. Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the Indian Institute of Technology Bombay, India and Monash University, Australia.