A cognition inspired approach to capturing data sequences
thesisposted on 2017-02-23, 04:32 authored by Gunasinghe, Upuli Pushpika
Data in the form of sequences accumulate in many domains such as engineering, health, finance and marketing. Therefore, it is important that models and techniques are developed and utilised to effectively capture and analyse sequential information. Capturing sequences of variable length, capturing the substructure of sequences and extracting useful frequent sequential patterns are three main challenges in the domain of sequence analysis. Furthermore, it is important that the developed techniques can handle sequences with diverse characteristics. It can be observed that humans have the ability to effortlessly comprehend, capture and utilise sequential information in everyday cognitive tasks such as vision, language, motor control and problem solving. It has also been demonstrated in the literature that one of the key factors behind human intelligence is the ability to store and utilise sequences. The work undertaken and reported on in this thesis focuses on building learning models and techniques for sequence analysis through incorporating theories on human cognition. In addition, the application of the proposed techniques to effectively capture and analyse sequences in multiple and diverse application areas is also demonstrated. Addressing the problems of capturing frequent, variable length sequences and their substructure, the Adaptive Suffix Trie (ASTrie) algorithm is first introduced in the thesis. The ASTrie algorithm incorporates the biologically inspired Hebbian learning rule into the suffix trie data structure and transforms it into a flexible learning tool for capturing sequences. Next, the Adaptive Suffix Tree (ASTree) algorithm is introduced as a space efficient successor to the ASTrie. %Both algorithms can capture discrete, long/short, dense/sparse and single dimensional sequences. These are based on the suffix trie and suffix tree data structures which can capture variable length sequences and their substructure. However, these are static data structures which store all suffixes of a given sequence. For most data analysis and data mining tasks capturing all sequences are not required. Rather the focus is on capturing the interesting or frequent patterns of occurrences. Most sequences indexed by time, such as time series data, are continuous in nature. In addition, elements in sequences could consist of multiple dimensions or attributes. In order to analyse continuous, multidimensional sequences, the ASTrie and ASTree algorithms are extended and the Continuous ASTrie (CASTrie) and Continuous ASTree (CASTree) algorithms are proposed. This is carried out through integrating a discretisation layer composed of the Growing Self Organising Map (GSOM), an unsupervised clustering algorithm which can handle continuous and multidimensional elements, in the ASTrie and ASTree algorithms. One of the main practical problems in sequence analysis techniques is the high processing time requirement. This is due to the exponential increase in the number of sequences when the length of sequences increases. In order to increase the efficiency of sequence analysis techniques, a measure is introduced for evaluating the quality of sequences and extracting only a subset of high quality sequences for analysis. The thesis also reports on the application and the efficiency investigations of the proposed models and techniques in diverse domains. First, the proposed algorithms and the quality measure are utilised in the domain of bioinformatics, for improving the efficiency of alignment free sequence comparison methods. Next, a novel sequence based text clustering model is proposed and it is demonstrated that the proposed model improves both the accuracy and the efficiency of the text clustering process while capturing better semantics. The proposed techniques are also applied to the analysis of geometric datasets at multiple levels of granularity. Finally, all components proposed in the thesis are brought together into a single framework for an integrated sequence capture and analysis suite of tools which could be used in diverse domains.