A powerful and customizable Python-based web scraper designed to extract HTML content, including images, JavaScript, and CSS files. This tool allows you to scrape and download entire web pages, making it ideal for data analysis, offline viewing, or site mirroring.
- Scrape HTML content, images, JavaScript, and CSS files.
- Save the scraped content in a specified directory.
- Simple configuration via environment variables.
- Python 3.7 or higher
- Virtual environment (optional but recommended)
git clone https://github.com/moinulict/python-html-web-scraper.git
cd python-html-web-scraper
It's recommended to use a virtual environment to manage dependencies. Here's how you can set it up using venv:
python3 -m venv venv
source venv/bin/activate
Note: If you don't have venv installed, you can install it using:
sudo apt-get update
sudo apt-get install python3-venv
With the virtual environment activated, install the required Python packages:
pip install -r requirements.txt
Set up the environment variables required for the scraper. Create a .env file in the project's root directory with the following content:
SCRAPE_URL=https://your-website.com
SCRAPE_DIR=website_content
SCRAPE_URL
: Replace https://your-website.com with the URL of the website you want to scrape.SCRAPE_DIR
: Specifies the directory where the scraped content will be saved. You can change website_content to any directory name you prefer.
After completing the above steps, you can run the scraper using:
python main.py
The scraper will start downloading the content from the specified SCRAPE_URL and save it in the SCRAPE_DIR directory.
Once you're done, you can deactivate the virtual environment by running:
deactivate
Contributions are welcome! Please feel free to submit a pull request or open an issue.
This project is licensed under the MIT License - see the LICENSE file for details.