Data Collection Scripts for Empirical View on IoT DNS Traffic

This directory contains scripts to generate CSV files compatible with the scripts in 03-dns-empirical/plot which utilize the Pandas library. It contains 4 scripts:

scan_iot_data.py to transform collected IoT data in PCAP files into Pandas-compatible CSV files.
run_parallel_ixp_dns.sh to transform collected remote IXP data in PCAP files into anonymized CSV files.
reformat_dns_week_2022_2.py takes the IXP data reformatted by run_parallel_ixp_dns.sh and parses them into a Pandas-compatible CSV format
count_names.sh can be used to gauge the number of unique names in the CSV files generated by scan_iot_data.py.
count_tshark_names.sh was used to confirm that using TShark generates the same set of unique names as scan_iot_data.py.

Requirements

You can skip the steps in this section if you are in the pre-configured vagrant VM.

The scripts all were tested on Ubuntu 22.04. While the scripts should be possible to run in other operating systems (especially the Python scripts), we do not guarantee successful execution. To run the commands described below, first run, e.g., apt on Ubuntu 22.04 to install dependencies:

sudo apt update
sudo apt install python3-pip python3-virtualenv

All commands in these instructions are assumed to be run from the 03-dns-empirical/collect directory.

All required python libraries are listed in requirements.txt. They can be installed using pip with the commands below. We recommend installing them to a Virtualenv as shown, but it is not strictly necessary.

virtualenv env
. env/bin/activate
pip install -r requirements.txt

To generate the IXP data set, you also need TShark, the sFlow Toolkit as well as access to the collected sFlow samples of an IXP.

To execute count_tshark_names.sh, TShark and GNU Parallel need to be installed. On Ubuntu 22.04 this can be done, e.g., with apt:

sudo apt install tshark parallel

Testing

The python scripts are tested for python versions 3.7 to 3.11 using tox. To test and lint the code, run the following in this directory (03-dns-empirical/collect). If the python version under test is installed, the tests for it will be executed.

tox

The pytest test cases can be found in the tests/ directory.

`scan_iot_data.py`

scan_iot_data.py transforms the PCAP files provided by the YourThings, IoTFinder, and MonIoTr studies into CSV files that can be parsed by the Pandas scripts in 03-dns-empirical/plot. It expects either a tar file or a directory as input. Depending on the source data set, the execution may take a long time.

./scan_iot_data.py ../results/iotfinder-iot-data/
# resulting CSV will be named ../results/iotfinder-iot-data.csv
./scan_iot_data.py ../results/moniotr-iot-data.tgz
# resulting CSV will be named ../results/moniotr-iot-data.tgz.csv

Specifically, the following files are needed from the websites of the respective studies:

For YourThings, from their data webpage, gather in a single directory:

iot_traffic20180320.tgz
iot_traffic20180321.tgz
iot_traffic20180328.tgz
iot_traffic20180410.tgz
iot_traffic20180411.tgz
iot_traffic20180412.tgz
iot_traffic20180413.tgz
iot_traffic20180414.tgz
iot_traffic20180415.tgz
iot_traffic20180416.tgz
iot_traffic20180417.tgz
iot_traffic20180418.tgz
iot_traffic20180419.tgz
device_mapping.csv

For IoTFinder, from their data webpage, gather in a single directory:

dns_2019_08.tgz
dns_2019_09.tgz
device_mapping.csv from the YourThings study

For MonIoTr download:

iot-data.tgz

You need to agree to the sharing agreement of the authors for the storage link to the data, see their website):

`run_parallel_ixp_dns.sh`

run_parallel_ixp_dns.sh transforms sampling data from an IXP to an intermediate, pseudonymized CSV format. The IXP data is expected to be stored as PCAP files in /mnt/data01/tcpdumpFiles. The location can be changed by modifying the LOGDIR environment variable.

The script only takes files that were modified between 2022-01-10 and 2022-01-17 (the second calendar week of 2022) into account. For another time span, modify the TS_START and TS_END environment variables. TS_START is the start date for the PCAPs and TS_END the last date for the PCAPs to take into account.

LOGDIR=./myTCPdumps/ TS_START=2022-01-10 TS_END=2022-01-17 ./run_parallel_ixp_dns.sh

The resulting .csv.gz file will be stored to dns_packets_ixp_2022_week.csv.gz.

`reformat_dns_week_2022_2.py`

reformat_dns_week_2022_2.py will provide the IXP data generated with run_parallel_ixp_dns.sh in a format that can be parsed by the Pandas scripts in 03-dns-empirical/plot. It expects the .csv.gz file generated with run_parallel_ixp_dns.sh as input (which is also provided by us in results/ixp-data-set):

./reformat_dns_week_2022_2.py ../results/ixp-data-set/dns_packets_ixp_2022_week.csv.gz

The resulting .csv.gz file will be stored in ../results/dns_packets_ixp_2022_week.csv.gz.

`count_names.sh`

count_names.sh can be used to gauge the number unique of names in the CSV files generated by scan_iot_data.py. It expects the CSV file as input and prints the number of unique names with various filters to stdout.

`count_tshark_names.sh`

count_tshark_names.sh was used to confirm that our scan_iot_data.py script generates the same set of unique names when using TShark instead of Scapy. This method was used in a yet to be published follow-up study of ours.

It requires TShark and GNU Parallel to be installed and the following directory structure in the directory the script is executed:

.
- iotfinder: The unpacked contents of the tarballs of the IoTFinder study
- moniotr: The unpacked contents of the tarball of the MonIoTr study
- yourthings: The unpacked contents of the tarballs of the YourThings study

Those directories are called the input directories in the following.

With the selected tools tools and since the contents of the tarballs are required to be unpacked, it is significantly faster than the Scapy-based scan_iot_data.py script, but the input data requires more hard disk space and the script itself yields a smaller amount of data points (see the intermediary results below).

To execute run

./count_tshark_names.sh

As intermediary results it generates the following files in the directory the script is executed

a directory names_addr_w_mdns/ that contains one CSV for each PCAP found in the input directories. The CSVs have the following columns, which are separated by semicolons (;), based on TShark field designators:
- dataset: The name of the input directory the original PCAP was found in.
- _ws.col.Source: The IP source address of a DNS message.
- _ws.col.Destination: The IP destination address of a DNS message.
- dns.qry.name: The queried name(s) in the DNS message. If more than one name was queried by the DNS message, they are separated by a | delimiter.
a file names_addr_w_mdns.csv which is the concatenation of all CSVs in names_addr_w_mdns/.
a file names_w_mdns_filtered.csv which only contains the dns.qry.name column of names_addr_w_mdns.csv after filtering the source and destination addresses for EXCLUDED_DEVICES

The outputs the number of unique queried names for all devices (before filtering for EXCLUDED_DEVICES) and without (w/o) the excluded devices (after filtering for EXCLUDED_DEVICES).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Collection Scripts for Empirical View on IoT DNS Traffic

Requirements

Testing

`scan_iot_data.py`

`run_parallel_ixp_dns.sh`

`reformat_dns_week_2022_2.py`

`count_names.sh`

`count_tshark_names.sh`

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Collection Scripts for Empirical View on IoT DNS Traffic

Requirements

Testing

scan_iot_data.py

run_parallel_ixp_dns.sh

reformat_dns_week_2022_2.py

count_names.sh

count_tshark_names.sh

`scan_iot_data.py`

`run_parallel_ixp_dns.sh`

`reformat_dns_week_2022_2.py`

`count_names.sh`

`count_tshark_names.sh`