README

To run the project, use the following command line arguments:

python [LAUNCHER] [MAPPER] [DATASET_TO_COMPARE_WITH] [DATASET_TO_READ_FROM]

By default, the script runs locally. With a properly configured .mrjob file, the flag -emr can be used to run the script on Amazon EMR. For example:

python mr_launcher2.py matrix_mr.py s3://temp-205/large_dataset.csv large_dataset.csv -emr

To change the number of mapper instances you need to go into mr_launcher2.py. While there, you can also set a limit on the number of times to launch a job on EMR.

The result of a successful run is the creation of file scoreDict.txt and a line in the file perflog.csv. Lines in perflog.csv have the following format:

[ROWS COMPUTED],[TOTAL RUNTIME],[AVGRUNTIME],[TIMESTAMP]

You can also just directly run one of the scripts in the typical mr.job way using the following form:

python [MAPPER] [-r emr] [--jobconfmapred.red.tasks=n] [DATASET_TO_COMPARE_WITH] &gt; [OUTPUT_FILE]

However, you need to be careful that you give mappers the correctly formatted data file.

The files serial_example.py, n2_inside_mr.py and foremr_headts.py all need to be run locally and need some hard coded values to be altered in order to run properly. They are run with the following command line arguments:

python [MAPPER] [DATABASE] e.g. python foremr_headts.py user2_export.csv

serial_example.py needs the path to the database hard-coded to the variable filename

n2_inside_mr.py needs the path to the database hard-coded to the variable filename.

foremr_headts.py needs the path to the database of pre-computed teachers and the list of results from that computation. The output of n2_inside_mr.py can serve as this list as it is the correct format. Running n2_inside_mr.py on a subset of the database was how we generated our list.