Skip to content

CMessinides/thrones-sentiment-analysis

Repository files navigation

Game of Thrones Fan Comments: A Sentiment Analysis

WARNING: This README and this repository contain spoilers for the Game of Thrones television series.

What can Reddit comments from fans of HBO’s Game of Thrones tell us about how their emotional responses to the show’s characters changed over time? This repository contains scripts and data files from my attempt to answer that question.

Table of Contents

User Guide

Requirements

This project requires that you have the following software installed on your system:

In addition, the following software is recommended for using some of the project’s optional features:

  • yarn for running scripts and managing JavaScript dependencies
  • R and RStudio for quickly analyzing the data and making visualizations

Installation

Clone this repository using Git:

$ git clone https://github.com/CMessinides/thrones-sentiment-analysis.git

Then navigate into the project directory:

$ cd thrones-sentiment-analysis

Use Pipenv to install Python dependencies:

$ pipenv install --dev

Use Yarn (or npm) to install JavaScript dependencies:

$ yarn install
# npm install

Configuration

If you plan on scraping comments with scripts/scrape.py (see “Scripts”), you will have to configure authentication for the Reddit API.

First, you will need to to register a script app through your Reddit account. You can find instructions here. Be sure to note your app's client ID and client secret when you are done.

Next, create a .env file in the root project directory. Add your Reddit username and password, along with the client ID and secret for the app you created earlier, as variables:

# .env
REDDIT_USERNAME= # your username
REDDIT_PASSWORD= # your password
REDDIT_CLIENT_ID= # your client ID
REDDIT_CLIENT_SECRET= # your client secret

You can optionally add a LOG_LEVEL variable to this file to configure logging in your Python scripts (list of Python log levels). LOG_LEVEL=20 will give you detailed logs for debugging or monitoring the progress of this project's Python scripts.

# .env
LOG_LEVEL=20 # INFO

Running scripts

For convenience, you can use Yarn or npm as a script runner for the Python scripts. You can run all scripts in sequence or each individually.

# Run scripts/scrape.py (WARNING: This is slow!)
yarn scripts:scrape

# Run scripts/analyze.py
yarn scripts:analyze

# Run scripts/transform.py
yarn scripts:transform

# Run all scripts in sequence
yarn scripts:all

You can also run the Python scripts directly using Pipenv (ex. pipenv run python scrape.py).

Project Overview

Data sets (./data)

  • comments.csv contains the corpus of Reddit comments used for this analysis. There are approximately 300,000 comments from episode discussion threads posted in three Game of Thrones-related subreddits: /r/gameofthrones (all seasons), r/asoiaf (all seasons), and /r/freefolk (Season 8 only). These data are collected automatically by scripts/scrape.py (see “Scripts”).

  • threads.csv contains the list of episode discussion threads to scrape for comments. This file is updated by hand.

  • characters.json contains the list of Game of Thrones characters to search for in the comments. The list includes the characters’ canonical (i.e. “real”) names and any corresponding aliases (first names, nicknames, common misspellings, etc.) they might go by. This file is updated by hand.

  • comment-sentiments.csv records the sentiment analysis score of each comment. This score is calculated using VADER, a sentiment analysis tool designed specifically for social media content like Reddit comments. Each row includes the comment ID along with the compound sentiment score output by VADER. This file is updated automatically by scripts/analyze.py (see “Scripts”).

  • character-mentions.csv records every mention of a character in the corpus of comments. Each row includes the character’s canonical name, the ID of the comment that contains the mention, the sentence in the comment that mentions the character, and the compound sentiment score for that sentence. This file is updated automatically by scripts/analyze.py (see “Scripts”).

  • mean-scores.json records the mean compound sentiment score of every episode, both overall and by character. It also contains confidence intervals for each mean, and for each episode, it includes the proportion of total character mentions that went to each character (with at least 30 mentions). This file is updated automatically by scripts/transform.py (see “Scripts”).

Scripts (./scripts)

  • scrape.py uses the Reddit API to collect comments from the episode discussion threads in data/threads.csv (see “Data sets”). Note that due to time and storage limitations, this script does not collect all comments from every thread. This script is responsible for updating the corpus of comments in data/comments.csv.

  • analyze.py uses VADER to calculate sentiment scores from all the scraped comments as well as from any sentence in those comments which mentions one of the characters in data/characters.json (see “Data sets”). This script is responsible for updating the records in data/comment-sentiments.csv and data/character-mentions.csv.

  • transform.py uses the prior sentiment analysis to produce a summary of mean scores, confidence intervals, and proportions of character mentions that went to each character for each episode. This script is responsible for updating data/mean-scores.json (see “Data sets”).

Citations

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

License

The datasets in this repository are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License.

Releases

No releases published

Packages

No packages published