NLP - Reddit user Analytics

Using NLP to cluster reddit user comments by topics

My goal for this project was to apply NLP techniques I recently acquired. I wanted to create a tool which analyzes a media platform on a macro level to see what people are/were talking about. Since there is a huge amount of Reddit data available online as zip files I used a portion of this data amounting to about 15 million user comments (2005-2012). I dumped all this data into a MongoDB and used AWS to process, prepare and modeled my data. Precisely, I created a pipeline to clean the user comments with NLTK, create a matrix of the term frequency–inverse document frequency features and used Latent Dirichlet Allocation for clustering topics. LDA is a model that makes it possible to look at each document/user comment as a combination of a number of different topics. Each word of the document is assignable to one of the document's topics. Some of the major topics that were found are:

Politics, War, Programming, Internet, Religion
Election / voting, Life/Family, Globalism, Economy, Science

After looking at every user's main topics I created a web application with which you can find out new suggestions for connections and topics for an individual user.

Short summary:

Data used for this project: 25GB of reddit text data / 15 million comments Objectives:
Create a tool to summarize user activities using NLP
Topics of interest
Connection to users with similar interest
Suggestions for new connections and topics

Tools used for storage, processing, visualizations:

MongoDB
AWS
D3, Flask

Data Exploring / Topic Modeling:

Text cleaning with NLTK
Count / TFIDF Vectorizer
LSA / SVD
LDA

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
LDA.png		LDA.png
README.md		README.md
Slides.pdf		Slides.pdf
model.py		model.py
mongodb.py		mongodb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP - Reddit user Analytics

Using NLP to cluster reddit user comments by topics

About

Releases

Packages

Languages

Flowinger/NLP-Reddit-Comments

Folders and files

Latest commit

History

Repository files navigation

NLP - Reddit user Analytics

Using NLP to cluster reddit user comments by topics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages