Developing and Comparing Similarity Functions for the News Recommender Domain Using Human Judgements
- Scripts for processing the 2017-version of the Wasington Post Corpus, provided by TREC (https://trec.nist.gov/data/wapost/)
- Scripts for downloading embedded images
- Various methods for computing similarity on a number of article properties. Can be viewed in (sim -> similarity_functions.py)
- Various methods for creating graphs of the resulting dataset/sample, e.g. graphs of date of publication distributions
- Methods for sampling using a multi-stage approach, where articles with nan's are ignored, identifies articles with corrupted images, and more
- Various methods for processing images, i.e. checking for corrupted images using multiprocessing. Takes about 10 minutes with 650K images on an SSD.
And many more, small methods to help exploring the dataset.