This directory contains scripts to generate CSV files compatible with the scripts in
03-dns-empirical/plot
which utilize the Pandas library. It contains 4 scripts:
scan_iot_data.py
to transform collected IoT data in PCAP files into Pandas-compatible CSV files.run_parallel_ixp_dns.sh
to transform collected remote IXP data in PCAP files into anonymized CSV files.reformat_dns_week_2022_2.py
takes the IXP data reformatted byrun_parallel_ixp_dns.sh
and parses them into a Pandas-compatible CSV formatcount_names.sh
can be used to gauge the number of unique names in the CSV files generated byscan_iot_data.py
.count_tshark_names.sh
was used to confirm that using TShark generates the same set of unique names asscan_iot_data.py
.
You can skip the steps in this section if you are in the pre-configured vagrant VM.
The scripts all were tested on Ubuntu 22.04. While the scripts should be possible to run in other
operating systems (especially the Python scripts), we do not guarantee successful execution.
To run the commands described below, first run, e.g., apt
on Ubuntu 22.04 to install dependencies:
sudo apt update
sudo apt install python3-pip python3-virtualenv
All commands in these instructions are assumed to be run from the 03-dns-empirical/collect
directory.
All required python libraries are listed in requirements.txt
. They can be
installed using pip with the commands below.
We recommend installing them to a Virtualenv as shown, but it is not strictly necessary.
virtualenv env
. env/bin/activate
pip install -r requirements.txt
To generate the IXP data set, you also need TShark, the sFlow Toolkit as well as access to the collected sFlow samples of an IXP.
To execute count_tshark_names.sh
, TShark and GNU Parallel need to be
installed. On Ubuntu 22.04 this can be done, e.g., with apt
:
sudo apt install tshark parallel
The python scripts are tested for python versions 3.7 to 3.11 using tox. To test and lint the
code, run the following in this directory (03-dns-empirical/collect
). If the python version
under test is installed, the tests for it will be executed.
tox
The pytest test cases can be found in the tests/
directory.
scan_iot_data.py
transforms the PCAP files provided by the YourThings, IoTFinder, and
MonIoTr studies into CSV files that can be parsed by the Pandas scripts in
03-dns-empirical/plot
. It expects either a tar file or a directory as input. Depending
on the source data set, the execution may take a long time.
./scan_iot_data.py ../results/iotfinder-iot-data/
# resulting CSV will be named ../results/iotfinder-iot-data.csv
./scan_iot_data.py ../results/moniotr-iot-data.tgz
# resulting CSV will be named ../results/moniotr-iot-data.tgz.csv
Specifically, the following files are needed from the websites of the respective studies:
For YourThings, from their data webpage, gather in a single directory:
iot_traffic20180320.tgz
iot_traffic20180321.tgz
iot_traffic20180328.tgz
iot_traffic20180410.tgz
iot_traffic20180411.tgz
iot_traffic20180412.tgz
iot_traffic20180413.tgz
iot_traffic20180414.tgz
iot_traffic20180415.tgz
iot_traffic20180416.tgz
iot_traffic20180417.tgz
iot_traffic20180418.tgz
iot_traffic20180419.tgz
device_mapping.csv
For IoTFinder, from their data webpage, gather in a single directory:
dns_2019_08.tgz
dns_2019_09.tgz
device_mapping.csv
from the YourThings study
For MonIoTr download:
iot-data.tgz
You need to agree to the sharing agreement of the authors for the storage link to the data, see their website):
run_parallel_ixp_dns.sh
transforms sampling data from an IXP to an intermediate, pseudonymized CSV
format. The IXP data is expected to be stored as PCAP files in /mnt/data01/tcpdumpFiles
. The
location can be changed by modifying the LOGDIR
environment variable.
The script only takes files that were modified between 2022-01-10 and 2022-01-17 (the second
calendar week of 2022) into account. For another time span, modify the TS_START
and TS_END
environment variables. TS_START
is the start date for the PCAPs and TS_END
the last date for the
PCAPs to take into account.
LOGDIR=./myTCPdumps/ TS_START=2022-01-10 TS_END=2022-01-17 ./run_parallel_ixp_dns.sh
The resulting .csv.gz
file will be stored to dns_packets_ixp_2022_week.csv.gz
.
reformat_dns_week_2022_2.py
will provide the IXP data generated with run_parallel_ixp_dns.sh
in
a format that can be parsed by the Pandas scripts in 03-dns-empirical/plot
. It
expects the .csv.gz
file generated with run_parallel_ixp_dns.sh
as input (which is also provided
by us in results/ixp-data-set
):
./reformat_dns_week_2022_2.py ../results/ixp-data-set/dns_packets_ixp_2022_week.csv.gz
The resulting .csv.gz
file will be stored in ../results/dns_packets_ixp_2022_week.csv.gz
.
count_names.sh
can be used to gauge the number unique of names in the CSV files generated by
scan_iot_data.py
. It expects the CSV file as input and prints the number of
unique names with various filters to stdout.
count_tshark_names.sh
was used to confirm that our scan_iot_data.py
script
generates the same set of unique names when using TShark instead of Scapy. This method was used
in a yet to be published follow-up study of ours.
It requires TShark and GNU Parallel to be installed and the following directory structure in the directory the script is executed:
.
iotfinder
: The unpacked contents of the tarballs of the IoTFinder studymoniotr
: The unpacked contents of the tarball of the MonIoTr studyyourthings
: The unpacked contents of the tarballs of the YourThings study
Those directories are called the input directories in the following.
With the selected tools tools and since the contents of the tarballs are required to be unpacked, it
is significantly faster than the Scapy-based scan_iot_data.py
script, but the input data
requires more hard disk space and the script itself yields a smaller amount of data points (see the
intermediary results below).
To execute run
./count_tshark_names.sh
As intermediary results it generates the following files in the directory the script is executed
- a directory
names_addr_w_mdns/
that contains one CSV for each PCAP found in the input directories. The CSVs have the following columns, which are separated by semicolons (;
), based on TShark field designators:dataset
: The name of the input directory the original PCAP was found in._ws.col.Source
: The IP source address of a DNS message._ws.col.Destination
: The IP destination address of a DNS message.dns.qry.name
: The queried name(s) in the DNS message. If more than one name was queried by the DNS message, they are separated by a|
delimiter.
- a file
names_addr_w_mdns.csv
which is the concatenation of all CSVs innames_addr_w_mdns/
. - a file
names_w_mdns_filtered.csv
which only contains thedns.qry.name
column ofnames_addr_w_mdns.csv
after filtering the source and destination addresses forEXCLUDED_DEVICES
The outputs the number of unique queried names for all devices (before filtering for
EXCLUDED_DEVICES
) and without (w/o) the excluded devices (after filtering for EXCLUDED_DEVICES
).