Reason: Access restricted by the author. A copy can be requested for private research and study by contacting your institution's library service. This copy cannot be republished
An autonomous incremental learning model for efficient mining of text data
thesisposted on 2017-02-14, 02:55 authored by Matharage, Sumith Shantha
Proliferation of the World Wide Web has massively increased the availability of textual data in recent years, presenting a challenge for researchers to maximise the usage of this data with minimum human intervention. The field of text mining research has emerged as a solution to this, focusing on the development of new techniques to discover useful knowledge from these large volumes of text data. The main research challenges in the text mining field are; (a) unstructured nature of the text (b) capturing semantics information (c) coping with a large number of words and the structure of the natural language. There have been many different techniques proposed in the text mining literature trying to address the above mentioned challenges individually or as a combination. The Self Organizing Feature Map (SOM) algorithm is one of the most successful and widely used techniques among all these and has been extended for diverse text mining tasks. The primary aim of this thesis is to provide a more efficient autonomous incremental text clustering model. Also, improving the semantic aspects of the text clustering process is examined. A Fast Scalable Growing Self Organizing Map (FSGSOM) algorithm is proposed to provide a more efficient autonomous clustering of text based on the dynamic topology preservation capabilities of the Growing Self Organizing Map (GSOM) algorithm. To enrich the semantic capabilities, a dynamic variable length sequence based feature selection model is integrated into the feature selection phase. As an additional method of incorporating semantics, Wikipedia is used as a background information source in result interpretation. As most of the text information available is not stationary, an incremental learning model based on the FSGSOM clustering is proposed to handle non-stationary text information. The proposed model consists of a semi-continuous text processing model together with an evolving hierarchy of concepts to generalise and preserve the learning outcomes for future training. A template based document selection mechanism is utilised to form lateral connection across the different phases of learning. In summary, this thesis proposes a more efficient incremental text clustering and knowledge preservation model contributing to the field of text mining research.