This is the flow to run experiments on Cloudlab's c6525-25g machines, which are usually immediately available.
- Reserve N machines from the Cloudlab Utah cluster.
- Create a profile with N bare metal machines connected together on a TOR switch
- Note the SSH names of the machines you got. Optionally set a shortened alias in .ssh/config:
Host amd*
HostName [HOSTNAME PROVIDED FROM CLOUDLAB]
User [YOUR CLOUDLAB USERNAME]
The following steps mainly involve the setup repo (TODO LINK) In params.py
-
set variable servers as:
SERVERS = ["HOSTNAME1","HOSTNAME2",...]
whereHOSTNAMEX
are the names you defined in the.ssh/config
-
Use the variable values for
REMOTE_SCRIPT
,NIC_IFACE_CALADAN
,NIC_IFACE_SSH
,NIC_PCI_CALADAN
,NUM_CONFIG_CORES
that correspond to the c6525-25g.#### For c6525-25g (amd*) cloudlab machines REMOTE_SCRIPT = "open_c6525-25.sh" NIC_IFACE_CALADAN = "enp65s0f0np0" NIC_IFACE_SSH = "eno33np0" NIC_PCI_CALADAN = "0000:41:00.0" NUM_CONFIG_CORES = 26
-
Download
MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64
from here (Archive Versions -> 24.04-0.6.6.0 -> Ubuntu -> Ubuntu 24.04 -> x86_64 -> tgz) -
point
OFED_DIR
tothe directory containing (local copy of) the.tgz
file- ie: if the tgz is
~/sw/MLNX[...].tgz
you would just provide~/code
- ie: if the tgz is
-
point
CALADAN_DIR
to parent directory of (local copy of) the cloned SIRD/Caladan repo (ie: the repo containing this readme)- ie: if the repo is
~/code/SIRD-Caladan
you would just provide~/code
- ie: if the repo is
-
set
LOCAL_USER
to your local system's username -
set
REMOTE_USER
to your cloudlab username -
set
PUBKEY
to the type of pubkey file you have (ieid_ed25519.pub
orid_rsa.pub
) -
Note: after setting up, check that the machines use the expected NICs/Interfaces as defined in params.py: For c6525-25g machines these are:
- Caladan Interface: enp65s0f0np0 (
ip addr show enp65s0f0np0
) - PCIe address of Caladan interface: 0000:41:00.0 (
ethtool -i enp65s0f0np0 | grep bus-info
) - SSH interface: eno33np0
- Caladan Interface: enp65s0f0np0 (
-
Then run
python3 provision-script.py setup-cloudlab "amd182","amd184","amd190","amd192" &> setup.txt
which will do all the setup steps for the listed machines. Check setup.txt for connection errors etc. If so, a reboot of the machine usually fixes it.
- Now run an experiment, for example:
python3 run-script.py manual configs/machine-25g-amd/6machines/4-to-1-srpt.txt
- Plotting is done using plotting repo (TODO LINK
- First you must use the correct variables from params.py:
NIC_IFACE_SSH
,ASSUMED_LINK_SPEED
, andSERVERS
. - Run a command like this:
python3 sird-graph-accum.py res/machine-25g-amd/6machines/4-to-1-srpt 1 0
- This will create plots in
res/machine-25g-amd/6machines/4-to-1-srpt
- The argument after that is whether the script should actually fetch the data from servers or run on the data already existing in the specified directory.
- The second decides whether to run a best-effort algo that tries to find how many racks were involved in the experiment.
Caladan is a system that enables servers in datacenters to simultaneously provide low tail latency and high CPU efficiency, by rapidly reallocating cores across applications.
For any questions about Caladan, please email caladan@csail.mit.edu.
-
Clone the Caladan repository.
-
Install dependencies.
sudo apt install make gcc cmake pkg-config libnl-3-dev libnl-route-3-dev libnuma-dev uuid-dev libssl-dev libaio-dev libcunit1-dev libclang-dev libncurses-dev meson python3-pyelftools
- Set up submodules (e.g., DPDK, SPDK, and rdma-core).
make submodules
- Build the scheduler (IOKernel), the Caladan runtime, and Ksched and perform some machine setup.
Before building, set the parameters in build/config (e.g.,
CONFIG_SPDK=y
to use storage,CONFIG_DIRECTPATH=y
to use directpath, and the MLX4 or MLX5 flags to use MLX4 or MLX5 NICs, respectively, ). To enable debugging, setCONFIG_DEBUG=y
before building.
make clean && make
pushd ksched
make clean && make
popd
sudo ./scripts/setup_machine.sh
- Install Rust and build a synthetic client-server application.
curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain=nightly
cd apps/synthetic
cargo clean
cargo update
cargo build --release
- Run the synthetic application with a client and server. The client sends requests to the server, which performs a specified amount of fake work (e.g., computing square roots for 10us), before responding.
On the server:
sudo ./iokerneld
./apps/synthetic/target/release/synthetic 192.168.1.3:5000 --config server.config --mode spawner-server
On the client:
sudo ./iokerneld
./apps/synthetic/target/release/synthetic 192.168.1.3:5000 --config client.config --mode runtime-client
This code has been tested most thoroughly on Ubuntu 18.04 with kernel 5.2.0 and Ubuntu 20.04 with kernel 5.4.0.
This code has been tested with Intel 82599ES 10 Gbits/s NICs,
Mellanox ConnectX-3 Pro 10 Gbits/s NICs, and Mellanox Connect X-5 40 Gbits/s NICs.
If you use Mellanox NICs, you should install the Mellanox OFED as described in DPDK's
documentation. If you use
Intel NICs, you should insert the IGB UIO module and bind your NIC
interface to it (e.g., using the script ./dpdk/usertools/dpdk-setup.sh
).
To enable Jumbo Frames for higher throughput, first enable them in Linux on the relevant interface like so:
ip link set eth0 mtu 9000
Then use the (host_mtu
) option in the config file of each runtime to set the
MTU to the value you'd like, up to the size of the MTU set for the interface.
Directpath allows runtime cores to directly send packets to/receive packets from the NIC, enabling higher throughput than when the IOKernel handles all packets. Directpath is currently only supported with Mellanox ConnectX-5 using Mellanox OFED v4.6 or newer. NIC firmware must include support for User Context Objects (DEVX) and Software Managed Steering Tables. For the ConnectX-5, the firmware version must be at least 16.26.1040. Additionally, directpath requires Linux kernel version 5.0.0 or newer.
To enable directpath, set CONFIG_DIRECTPATH=y
in build/config before building and add enable_directpath
to the config file for all runtimes that should use directpath. Each runtime launched with directpath must
currently run as root and have a unique IP address.
This code has been tested with an Intel Optane SSD 900P Series NVMe device. If your device has op latencies that are greater than 10us, consider updating the device_latency_us variable (or the known_devices list) in runtime/storage.c.
Ensure that you have compiled Caladan with storage support by setting the appropriate flag in build/config, and that you have built the synthetic client application.
Compile the C++ bindings and the storage server:
make -C bindings/cc
make -C apps/storage_service
On the server:
sudo ./iokerneld
sudo spdk/scripts/setup.sh
sudo apps/storage_service/storage_server storage_server.config
On the client:
sudo ./iokerneld
sudo apps/synthetic/target/release/synthetic --config=storage_client.config --mode=runtime-client --mpps=0.55 --protocol=reflex --runtime=10 --samples=10 --threads=20 --transport=tcp 192.168.1.3:5000
Ensure that you have built the synthetic application on client and server.
Compile the C++ bindings and the memory/cache antagonist:
make -C bindings/cc
make -C apps/netbench
On the server, run the IOKernel with the interference-aware scheduler (ias), the synthetic application, and the cache antagonist:
sudo ./iokerneld ias
./apps/synthetic/target/release/synthetic 192.168.1.8:5000 --config victim.config --mode spawner-server
./apps/netbench/stress antagonist.config 20 10 cacheantagonist:4090880
On the client:
sudo ./iokerneld
./apps/synthetic/target/release/synthetic 192.168.1.8:5000 --config client.config --mode runtime-client
You should observe that you can stop and start the antagonist and that the
synthetic application's latency is not impacted. In contrast, if you use
Shenango's default scheduler (sudo ./iokerneld
) on the server, when you run
the antagonist with the synthetic application, the synthetic application's
latency degrades.