Monash University
Browse

Automatic blog classification using concept-category vectorization

thesis
posted on 2017-02-06, 04:23 authored by Ayyasamy, Ramesh Kumar
Blog classification is the system of classifying blogs based on pre-defined categories. This area is addressed by considering the textual content, or the surrounding features of the blogs. This thesis focuses on the textual content of the blogs which uses the blog title and posts for classification. These blogs are categorised and maintained as a blog directory that serves the demands of the users searching information online. Such online blog directories use human indexers to categorize blog pages. Manual classifications of blogs are tends to be labour intensive and time consuming. In related fields such as text mining and web mining, various classification methods such as supervised, semi-supervised and unsupervised methods were proposed. These studies have used Bag-of-Words representation of text documents, and indexes using term weighting scheme which does not capture the semantic relatedness. In this thesis, we devise a novel framework for automatic topic based blog classification, denoted as Terms-to-Concepts-to-Category framework. Our framework utilizes Wikipedia’s categorical index for blog classification task. Wikipedia article titles are treated as concepts, and are used for terms to concept substitution to compute a weighted-semantic path connecting concepts to their main category. The n-gram based concept extraction technique is then used to extract concepts from the titles and posts from blogs. Our study improves upon previous studies, by using concept based representation and concept weighting scheme that embeds semantic information for blog classification. We also propose clustering of blogs as an improvement to the Terms-to-Concepts-to-Category framework and show that clustering can improve the accuracy of blog classification by enhancing our framework with two clustering approaches: coarse clustering and fine clustering. The first approach is achieved by using tripartite spectral graph which uses three vertexes: Document context similarity, concept similarity and content similarity to categorize blog posts. In the second approach, to achieve better classification and cluster multi categories, we enhance our framework using fuzzy c-means clustering and fuzzy similarity. Experimental results show that our framework produces better accuracy in classifying multiple categories than the existing tripartite and Fuzzy c-means clustering techniques.

History

Campus location

Australia

Principal supervisor

Siew Eugene

Additional supervisor 1

Saadat M. Alhashmi

Year of Award

2012

Department, School or Centre

School of Information Technology (Monash University Malaysia)

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Usage metrics

    Faculty of Information Technology Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC