GitHub - Overhaug/HuJuRecSys: A repo for my master thesis project

Developing and Comparing Similarity Functions for the News Recommender Domain Using Human Judgements

Scripts for processing the 2017-version of the Wasington Post Corpus, provided by TREC (https://trec.nist.gov/data/wapost/)
Scripts for downloading embedded images
Various methods for computing similarity on a number of article properties. Can be viewed in (sim -> similarity_functions.py)
Various methods for creating graphs of the resulting dataset/sample, e.g. graphs of date of publication distributions
Methods for sampling using a multi-stage approach, where articles with nan's are ignored, identifies articles with corrupted images, and more

Various methods for processing images, i.e. checking for corrupted images using multiprocessing. Takes about 10 minutes with 650K images on an SSD.

And many more, small methods to help exploring the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
IFE		IFE
R_analysis		R_analysis
analysis		analysis
client		client
news-study		news-study
sim		sim
twpc		twpc
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
definitions.py		definitions.py