Automatic blog classification using concept-category vectorization
thesis
posted on 2017-02-06, 04:23authored byAyyasamy, Ramesh Kumar
Blog classification is the system of classifying blogs based on pre-defined categories. This area
is addressed by considering the textual content, or the surrounding features of the blogs. This
thesis focuses on the textual content of the blogs which uses the blog title and posts for
classification. These blogs are categorised and maintained as a blog directory that serves the
demands of the users searching information online. Such online blog directories use human
indexers to categorize blog pages. Manual classifications of blogs are tends to be labour
intensive and time consuming. In related fields such as text mining and web mining, various
classification methods such as supervised, semi-supervised and unsupervised methods were
proposed. These studies have used Bag-of-Words representation of text documents, and indexes
using term weighting scheme which does not capture the semantic relatedness.
In this thesis, we devise a novel framework for automatic topic based blog classification,
denoted as Terms-to-Concepts-to-Category framework. Our framework utilizes Wikipedia’s
categorical index for blog classification task. Wikipedia article titles are treated as concepts,
and are used for terms to concept substitution to compute a weighted-semantic path connecting
concepts to their main category. The n-gram based concept extraction technique is then used
to extract concepts from the titles and posts from blogs. Our study improves upon previous
studies, by using concept based representation and concept weighting scheme that embeds
semantic information for blog classification.
We also propose clustering of blogs as an improvement to the Terms-to-Concepts-to-Category
framework and show that clustering can improve the accuracy of blog classification by
enhancing our framework with two clustering approaches: coarse clustering and fine
clustering. The first approach is achieved by using tripartite spectral graph which uses three
vertexes: Document context similarity, concept similarity and content similarity to categorize blog posts.
In the second approach, to achieve better classification and cluster multi categories, we
enhance our framework using fuzzy c-means clustering and fuzzy similarity. Experimental
results show that our framework produces better accuracy in classifying multiple categories
than the existing tripartite and Fuzzy c-means clustering techniques.
History
Campus location
Australia
Principal supervisor
Siew Eugene
Additional supervisor 1
Saadat M. Alhashmi
Year of Award
2012
Department, School or Centre
School of Information Technology (Monash University Malaysia)