-
Notifications
You must be signed in to change notification settings - Fork 92
Analyze historical cluster traces and synthesize representative workload
Standard first steps in the Hadoop workload analysis process involves extracting per job statistics from job history logs, and synthesizing representative test workloads from historical trace data. These shorter duration test workloads can then drive performance comparisons on real life, production systems.
This tool parses Hadoop job history logs created in the default Hadoop logging format.
Usage:
perl parse-hadoop-jobhistory.pl [job history dir] > outputFile.tsv
This script prints the output to STDOUT, which can then be piped to file or consumed for further analysis. The output format is tab separated values (.tsv), one row per job. The output fields are:
1. unique_job_id
2. job_name
3. map_input_bytes
4. shuffle_bytes
5. reduce_output_bytes
6. submit_time_seconds (epoch format)
7. duration_seconds
8. map_time_task_seconds (2 tasks of 10 seconds = 20 task-seconds)
9. red_time_task_seconds
10. total_time_task_seconds
11. map_tasks_count
12. reduce_tasks_count
13. hdfs_input_path (if available in history log directory)
14. hdfs_output_path (if available in history log directory)
For example, one could parse the Facebook history logs by calling
perl parse-hadoop-jobhistory.pl FB-2009_LogRepository > FacebookTrace.tsv
The subsequent output can be directly fed into Step 2. Alternately, human analysts can use the output to distill other insights.
This step synthesizes representative workloads of short duration using potentially months of trace data parsed by Step 1 previously. The key synthesis technique is continuous time window sampling in multiple dimensions. Our IEEE MASCOTS 2011 paper explains this method in more detail.
Usage:
perl WorkloadSynthesis.pl
--inPath=STRING
--outPrefix=STRING
--repeats=INT
--samples=INT
--length=INT
--traceStart=INT
--traceEnd=INT
The script samples the Hadoop trace in inPath, and generates several synthetic workloads stored with file names prefixed by outPrefix.
Arguments in the above:
-
inPath
Path to file of the full trace . Expected format: Tab separated fields, one record per row per job, each row contains fields in the following order:unique_job_id job_name map_input_bytes shuffle_bytes reduce_output_bytes submit_time_seconds (epoch format) duration_seconds map_time_task_seconds (2 tasks of 10 seconds = 20 task-seconds) red_time_task_seconds total_time_task_seconds map_tasks_count (optional) reduce_tasks_count (optional) hdfs_input_path (optional) hdfs_output_path (optional)
Additional fields would not trigger error but would be ignored.
-
outPrefix
Prefix of synthetic workloads output files. Output format: Tab separated fields, one record per row per job, each row contains fields in the following order:new_unique_job_id submit_time_seconds (relative to the start of the workload) inter_job_submit_gap_seconds map_input_bytes shuffle_bytes reduce_output_bytes
-
repeats
The number of synthetic workloads to create. -
samples
The number of samples in each synthetic workload. -
length
The time length of each sample window in seconds. -
traceStart
Start time of the trace (epoch format) -
traceEnd
End time of the trace (epoch format)
E.g., FB-2009_samples_24_times_1hr_0.tsv and FB-2009_samples_24_times_1hr_0.tsv were created by running
perl WorkloadSynthesis.pl
--inPath=FacebookTrace.tsv
--outPrefix=FB-2009_samples_24_times_1hr_
--repeats=2
--samples=24
--length=3600
--traceStart=FacebookTraceStart
--traceEnd=FacebookTraceEnd