-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
31 lines (15 loc) · 1.76 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
To run the project, use the following command line arguments:
python [LAUNCHER] [MAPPER] [DATASET_TO_COMPARE_WITH] [DATASET_TO_READ_FROM]
By default, the script runs locally. With a properly configured .mrjob file, the flag -emr can be used to run the script on Amazon EMR. For example:
python mr_launcher2.py matrix_mr.py s3://temp-205/large_dataset.csv large_dataset.csv -emr
To change the number of mapper instances you need to go into mr_launcher2.py. While there, you can also set a limit on the number of times to launch a job on EMR.
The result of a successful run is the creation of file scoreDict.txt and a line in the file perflog.csv. Lines in perflog.csv have the following format:
[ROWS COMPUTED],[TOTAL RUNTIME],[AVGRUNTIME],[TIMESTAMP]
You can also just directly run one of the scripts in the typical mr.job way using the following form:
python [MAPPER] [-r emr] [--jobconfmapred.red.tasks=n] [DATASET_TO_COMPARE_WITH] > [OUTPUT_FILE]
However, you need to be careful that you give mappers the correctly formatted data file.
The files serial_example.py, n2_inside_mr.py and foremr_headts.py all need to be run locally and need some hard coded values to be altered in order to run properly. They are run with the following command line arguments:
python [MAPPER] [DATABASE] e.g. python foremr_headts.py user2_export.csv
serial_example.py needs the path to the database hard-coded to the variable filename
n2_inside_mr.py needs the path to the database hard-coded to the variable filename.
foremr_headts.py needs the path to the database of pre-computed teachers and the list of results from that computation. The output of n2_inside_mr.py can serve as this list as it is the correct format. Running n2_inside_mr.py on a subset of the database was how we generated our list.