A tool for downloading and managing arXiv documents in bulk using Amazon S3.
- Amazon AWS Account - required for accessing arXiv's bulk data on Amazon S3
- Python 2.x to use the
s3cmd
package - Python 3.x for manifest file analysis
-
Install s3cmd, a command-line tool for interacting with Amazon S3:
pip install s3cmd # Python 2 only
-
Configure s3cmd with your AWS credentials:
s3cmd --configure
Note: You'll need your AWS credentials from the Account Management tab on the AWS website.
-
Install required Python packages for manifest file analysis:
pip install pandas # For Python 3.x
First, download the manifest files containing the complete list of available arXiv files:
For PDF documents:
s3cmd get --requester-pays \
s3://arxiv/pdf/arXiv_pdf_manifest.xml \
local-directory/arXiv_pdf_manifest.xml
For source documents:
s3cmd get --requester-pays \
s3://arxiv/src/arXiv_src_manifest.xml \
local-directory/arXiv_src_manifest.xml
Use the eda_manifest.py
script to analyze the manifest files:
python eda_manifest.py
This script provides useful statistics about the arXiv dataset:
- Total size of the dataset in bytes, MB, and GB
- Total number of articles
- Average file sizes
- Number of files per time period
- Detailed statistics for recent years (2022-2023)
Use the download.py
script to fetch the actual files:
For PDF files:
python download.py \
--manifest_file /path/to/pdf-manifest \
--mode pdf \
--output_dir /path/to/output
For source files:
python download.py \
--manifest_file /path/to/src-manifest \
--mode src \
--output_dir /path/to/output
The files will be downloaded to your specified output directory. Each file is in .tar
format and approximately 500MB in size.
After downloading the tar files, use extract_pdfs.py
to extract PDFs into organized directories:
python extract_pdfs.py \
--data_dir /path/to/tar/files \
--output_dir /path/to/output \
[--keep_tars] # Optional: keep original tar files
The script will create and extract pdf files to year-month subdirectories (e.g., "2310" for October 2023). Example output structure:
output_dir/
2310/ # October 2023
paper1.pdf
paper2.pdf
2311/ # November 2023
paper3.pdf
paper4.pdf
- For metadata downloads, consider using metha
- The arXiv files are stored in requester-pays buckets on Amazon S3
- Each archive file is approximately 500MB in size and uses the
.tar
format