-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.html
96 lines (96 loc) · 3.51 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: Home
menu_order: 1
---
<div class="row" id="home-top">
<div class="col-lg-6">
<img id="home-logo" src="images/d3l.png">
<div id="home-name">
<span class="acronym-letter">D</span>econstructing<br>
<span class="acronym-letter">D</span>istributed<br>
<span class="acronym-letter">D</span>eep
<span class="acronym-letter">L</span>earning
</div>
</div>
<div class="col-lg-6">
<div class="postit">
Our project develops performance models and scheduling
algorithms for distributed deep learning jobs. We analyze the
incentives for federations of datacenters with different
resource affinities to this type of job. Key research results
are put into practice through Kubernetes and TensorFlow.
</div>
</div>
</div>
<div class="row" id="home-desc">
<div class="col-lg-4">
<h2>Motivation</h2>
<p>
While machine learning frameworks (such as TensorFlow, PyTorch
or Caffe) ease the development of DNNs and training jobs, they
do not assist the user in provisioning and sharing cloud
resources, or in the integration of DNN training workloads into
existing datacenters.
</p>
<p>
In fact, most users need to try different configurations of a
job (e.g., number of server/worker nodes, mini-batch size,
network capacity) to check the resulting training performance
(e.g., throughput measured as examples/second).
</p>
<p>
At scale, when resources must be shared among hundreds of jobs,
this approach quickly becomes infeasible. At a larger scale,
when multiple datacenters need to manage deep learning
workloads, different degrees of <i>affinity</i> of their
resources create economic incentives to collaborate, as in
<i>cloud federations</i>.
</p>
</div>
<div class="col-lg-4">
<h2>Research Plan</h2>
<p>
Our approach will develop models to predict metrics (such as
training throughput) needed to guide the allocation of job
resources.
</p>
<p>
Our project plans to design scheduling algorithms for parallel
jobs (such as as deep learning training jobs) and to evaluate
them in the field, running Kubernetes in clusters at USC and in
the cloud.
</p>
<p>
Workload characterization and strategic models of individual
datacenters will allow us to evaluate the incentives for their
cooperation.
</p>
<p>
This project is committed to diversity in research and
education, involving undergraduate and graduate students,
coupled with an existing extensive K-12 outreach effort. Our
experimental testbed will support both research and education at
USC.
</p>
</div>
<div class="col-lg-4 ">
<h2>News</h2>
<ul>
<li>Dec. 18, 2019: Our work on throughput prediction of
<a href="http://arxiv.org/abs/1911.04650">distributed SGD in TensorFlow</a>
was accepted at <a href="https://icpe2020.spec.org/">ICPE</a>.
</li>
<li>Sep. 28, 2018: Our work on performance analysis and
scheduling of
<a href="http://qed.usc.edu/paolieri/papers/2018_mascots_sgd_tensorflow_scheduling.pdf">asynchronous SGD</a>
was presented at MASCOTS.
</li>
</ul>
<p class="text-center mt-5">
<img src="/images/nsf.png" class="mb-2"><br>
Project supported by the<br>
<a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=1816887&HistoricalAwards=false">NSF CCF-1816887</a>
award
</p>
</div>
</div>