Inferring social behavior and interaction on twitter by combining metadata about users & messages

Cheong

Marc

2017

Social media - in particular microblogging - is fast becoming important in today's world. A good example is Twitter, which is a rich source of readily-available information by, and about, people. Real-life happenings are constantly reported on Twitter; thus, it functions as a 'mirror' to the real world. These happenings range from the banal (individual thoughts, opinions, and observations), to the dramatic (celebrity announcements, scandals, and Internet memes), to real-world events with serious consequences (riots, coordination during natural disasters, response to terrorism, and political dissent). Most extant literature treats the message and user domains on Twitter independently of one another. Current research focuses only on a single domain, but rarely on both. Research consists mostly of specialized techniques, such as opinion and sentiment mining, community detection, social network analysis, and trend mining which are merely applied to Twitter data. Rarely are metadata from both the user and message domains analyzed in tandem with each other. My thesis combines metadata from both domains and transforms them into useful inferences for detecting hidden patterns. The basis of my research is the use of metadata from both Twitter users and messages as the raw material, from which we can discover hidden patterns and inferences. Such patterns and inferences, in turn, can be combined with data mining techniques to unearth a wealth of knowledge about Twitter users in particular, and people in general. In this thesis, I investigate two aspects. First, I introduce a new framework for the large-scale gathering and collation of Twitter user and message metadata. Secondly, I introduce and investigate new inference algorithms that combines metadata from both domains, inspired by current literature, which are hitherto absent in research. In doing so, I contributed to the development of novel inference algorithms, and frameworks to harvest raw metadata from Twitter for the provision of ample data for the evaluation of my algorithms. From the wealth of metadata from the two domains on Twitter, my new algorithms produce three categories of inferences - social demographics, exhibition of online presence by users, and messaging (tweeting) behavior of users. Using these new inference algorithms, I tested my findings on a large-scale real-world dataset, collected from Twitter using data gathering frameworks I have developed. Consequently I was able to draw conclusions of the current 'state of the Twitterverse'. Following that, I introduced a novel application of pattern detection and clustering on inferences generated from my algorithms. This is for the detection of latent traits and identification of non-obvious patterns, with respect to the three categories of inferences that are generated from my algorithms. To conclude my thesis, I showed that my approaches provide useful insights about serious real-world phenomena captured on Twitter pertaining to - environmental activism, terrorism events, and public disorder - all of which are of interest to researchers, governments, and the media alike. Using the approaches proposed throughout my thesis, I was able to discover the behavior of people in the real world, and illustrated how such real-life behavior is translated into expression and social communication in the online realm. The results from these studies covered in my thesis led to a better understanding of who social media consumers are, how they communicate online, and how behavioral patterns from these users 'mirror' the real-world.