WARNING: This README and this repository contain spoilers for the Game of Thrones television series.
What can Reddit comments from fans of HBO’s Game of Thrones tell us about how their emotional responses to the show’s characters changed over time? This repository contains scripts and data files from my attempt to answer that question.
This project requires that you have the following software installed on your system:
- Git
- Python ≥ 3.7
- Pipenv
In addition, the following software is recommended for using some of the project’s optional features:
- yarn for running scripts and managing JavaScript dependencies
- R and RStudio for quickly analyzing the data and making visualizations
Clone this repository using Git:
$ git clone https://github.com/CMessinides/thrones-sentiment-analysis.git
Then navigate into the project directory:
$ cd thrones-sentiment-analysis
Use Pipenv to install Python dependencies:
$ pipenv install --dev
Use Yarn (or npm) to install JavaScript dependencies:
$ yarn install
# npm install
If you plan on scraping comments with scripts/scrape.py
(see “Scripts”), you will have to configure authentication for the Reddit API.
First, you will need to to register a script app through your Reddit account. You can find instructions here. Be sure to note your app's client ID and client secret when you are done.
Next, create a .env
file in the root project directory. Add your Reddit username and password, along with the client ID and secret for the app you created earlier, as variables:
# .env
REDDIT_USERNAME= # your username
REDDIT_PASSWORD= # your password
REDDIT_CLIENT_ID= # your client ID
REDDIT_CLIENT_SECRET= # your client secret
You can optionally add a LOG_LEVEL
variable to this file to configure logging in your Python scripts (list of Python log levels). LOG_LEVEL=20
will give you detailed logs for debugging or monitoring the progress of this project's Python scripts.
# .env
LOG_LEVEL=20 # INFO
For convenience, you can use Yarn or npm as a script runner for the Python scripts. You can run all scripts in sequence or each individually.
# Run scripts/scrape.py (WARNING: This is slow!)
yarn scripts:scrape
# Run scripts/analyze.py
yarn scripts:analyze
# Run scripts/transform.py
yarn scripts:transform
# Run all scripts in sequence
yarn scripts:all
You can also run the Python scripts directly using Pipenv (ex. pipenv run python scrape.py
).
-
comments.csv
contains the corpus of Reddit comments used for this analysis. There are approximately 300,000 comments from episode discussion threads posted in three Game of Thrones-related subreddits: /r/gameofthrones (all seasons), r/asoiaf (all seasons), and /r/freefolk (Season 8 only). These data are collected automatically byscripts/scrape.py
(see “Scripts”). -
threads.csv
contains the list of episode discussion threads to scrape for comments. This file is updated by hand. -
characters.json
contains the list of Game of Thrones characters to search for in the comments. The list includes the characters’ canonical (i.e. “real”) names and any corresponding aliases (first names, nicknames, common misspellings, etc.) they might go by. This file is updated by hand. -
comment-sentiments.csv
records the sentiment analysis score of each comment. This score is calculated using VADER, a sentiment analysis tool designed specifically for social media content like Reddit comments. Each row includes the comment ID along with the compound sentiment score output by VADER. This file is updated automatically byscripts/analyze.py
(see “Scripts”). -
character-mentions.csv
records every mention of a character in the corpus of comments. Each row includes the character’s canonical name, the ID of the comment that contains the mention, the sentence in the comment that mentions the character, and the compound sentiment score for that sentence. This file is updated automatically byscripts/analyze.py
(see “Scripts”). -
mean-scores.json
records the mean compound sentiment score of every episode, both overall and by character. It also contains confidence intervals for each mean, and for each episode, it includes the proportion of total character mentions that went to each character (with at least 30 mentions). This file is updated automatically byscripts/transform.py
(see “Scripts”).
-
scrape.py
uses the Reddit API to collect comments from the episode discussion threads indata/threads.csv
(see “Data sets”). Note that due to time and storage limitations, this script does not collect all comments from every thread. This script is responsible for updating the corpus of comments indata/comments.csv
. -
analyze.py
uses VADER to calculate sentiment scores from all the scraped comments as well as from any sentence in those comments which mentions one of the characters indata/characters.json
(see “Data sets”). This script is responsible for updating the records indata/comment-sentiments.csv
anddata/character-mentions.csv
. -
transform.py
uses the prior sentiment analysis to produce a summary of mean scores, confidence intervals, and proportions of character mentions that went to each character for each episode. This script is responsible for updatingdata/mean-scores.json
(see “Data sets”).
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
The datasets in this repository are available under the Creative Commons Attribution 4.0 International License, and the code is available under the MIT License.