Comparing Topological Communities and Communities of Interest Using Topic Modeling

Abstract

In this thesis I propose the repurposing of Latent Dirichlet Allocation (LDA), a topic modeling algorithm, for the discovery of communities of interest. To test it, I use it to discover communities on the social news and entertainment website reddit. I then use it to compare the composition of communities of interest to that of topological communities: communities discovered based on the topology of social graphs. I use both methods to find communities based on the Enron email corpus, and compare their results using cluster evaluation methods.

Keywords

topic modeling;latent dirichlet allocation;LDA;machine learning;unsupervised learning;communities;community of interest;topological community;graph;social graph;reddit;Enron;mutual information;normalised mutual information;NMI;Jaccard Index;cluster validation;information theory

Citation