Reason: Access restricted by the author. A copy can be requested for private research and study by contacting your institution's library service. This copy cannot be republished
Sentiment analysis under resource constraints
thesis
posted on 2017-02-16, 05:14authored byAndiyakkal Rajendran, Balamurali
Sentiment Analysis (SA) deals with the detection of sentiment of a textual content from
a speaker’s perspective. Both supervised and unsupervised approaches exist for this task.
Previous studies show that supervised approaches perform better than unsupervised approaches. However, supervised approaches heavily depend on the availability of training
data. We present two resource constraints with respect to training data for SA, one in
the language of operation and the other in the domain of operation. In this thesis, we
propose approaches which can alleviate the problems caused by these constraints.
Majority research on SA are in English. This has led to a skewness of resource development in favour of the popular language of the web. Two SA resources are i) sentiment lexicons ii) annotated corpora. In this thesis, we address the problem of unavailability or inadequacy of annotated corpora. We present an approach to leverage data from languages which have annotated data.
Our approach uses wordnet sense (or otherwise known as synsets) and is based on
the fact that semantics influences sentiment. We compared the results of sense based and
lexeme based features for sentiment analysis in a monolingual setting. We found that
sense based features perform better than lexeme based features. Also, as we move from
lexeme feature space to sense feature space, dimensionality reduces. This dimensionality
reduction additionally solves the data sparsity problem. As per this approach, we replace
synsets not present in the test set with similar synsets from the training set using a
wordnet similarity metric. A significant improvement in the classification accuracy is
obtained through this approach.
Sense identifiers for same concepts belonging to different languages are same if their
wordnets are developed using merge method. We leverage this fact to address the problem
of unavailability or inadequacy of annotated corpora in a language. A document in test
set language (L_Test ) is tested for polarity through a classifier trained on sense marked and
polarity labeled corpora of training language (L_Train ). We perform our experiments on
two widely spoken Indian languages, Hindi and Marathi. Results show that wordnet sense
can bridge the language gaps for SA. However, sense annotation is an additional task in a
sentiment analysis system. Hence, to study the cost of annotation and its benefit to the
end application, we introduce an economic model. Our model suggests that annotation is
beneficial in terms of the performance achieved vis-a-vis the cost associated for developing
the system.
Existing approaches to reduce resource constraints based on the language of opera-
tion depend on machine translation. However, we question the efficacy of these approaches
since machine translation is very resource intensive. To test this, we convert data in a
resource scarce language, RL_Test , to a resource rich language, RL_Train , using various machine translation techniques. We perform our analysis on 4 European languages (English,
French, German, Russian). Our study shows that such a strategy ignores the fact that
a machine translation system is much more demanding in terms of resources than a SA
engine. Moreover, these approaches fail to take into account the divergence in the expression of sentiments across languages. We provide strong experimental evidence to prove
that the performance of such systems comes nowhere close to that obtained by using only
a few polarity annotated documents in the target language.
Drop in accuracy due to a shift in domain is a common problem for all NLP tasks
including sentiment analysis. To address resource constraints in the domain of operation,
we present an approach for cross domain sentiment analysis. The idea is to use a group
of classifiers trained on the source domain to generate noisy tagged data for the target
domain. A small amount of hand-labeled target domain data is then used to decide a
confidence threshold for filtering out the noise. The remaining data which is tagged with
a high confidence is then used to train a high accuracy sentiment tagger for the target
domain. On a training domain similar to the target domain, our system performs on par
with or even better than a classifier trained using in-domain data. Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of the Indian Institute of Technology Bombay, India and Monash University, Australia.
History
Campus location
Australia
Principal supervisor
Ingrid Zukerman
Year of Award
2014
Department, School or Centre
Information Technology (Monash University Clayton)
Additional Institution or Organisation
Indian Institute of Technology Bombay, India (IITB)