This project is a Django-based web scraping API that extracts product URLs from various eCommerce platforms. The API takes domain URLs as input and asynchronously crawls them using Selenium and BeautifulSoup.
- Accepts a list of domain URLs to scrape.
- Uses Selenium to load web pages dynamically and extract product URLs.
- Supports multiple platforms with predefined URL patterns.
- Runs scraping tasks asynchronously.
- Stores extracted product URLs in a MySQL database.
- Provides an API to scrape data and return the results.
- Django 4.2 (Web framework)
- Selenium (For automated web scraping)
- BeautifulSoup (For parsing HTML content)
- MySQL (Database for storing extracted URLs)
- Threading (For asynchronous task execution)
- Python 3.10+
- MySQL database setup
- Chrome browser installed
- Chromedriver (Managed automatically by
webdriver-manager
)
-
Clone the repository:
git clone https://github.com/Aman7818/ecommerce_crawler.git cd <project-directory>
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Configure the database settings in
settings.py
. Use the environment variables for database credentials:DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', 'NAME': os.getenv('DB_NAME'), 'USER': os.getenv('DB_USER'), 'PASSWORD': os.getenv('DB_PASSWORD'), 'HOST': os.getenv('DB_HOST'), 'PORT': os.getenv('DB_PORT'), } }
Add the following environment variables to your
.env
file:DB_NAME=your_db_name DB_USER=your_db_user DB_PASSWORD=your_db_password DB_HOST=your_db_host DB_PORT=3306
-
Run database migrations:
python manage.py migrate
-
Start the Django server:
python manage.py runserver
Method: POST
Request Body:
{
"domains": [
"https://www.amazon.in/s?k=laptops",
"https://www.flipkart.com/search?q=smartphones"
]
}
Response:
{
"message": "Crawling started in background"
}
This endpoint starts the web scraping task for a list of domains asynchronously.
Method: POST
Request Body:
{
"domains": [
"https://www.amazon.in/s?k=laptops",
"https://www.flipkart.com/search?q=smartphones"
]
}
Response:
{
"message": "Crawling completed",
"data": [
"https://www.amazon.in/product1",
"https://www.flipkart.com/product2"
]
}
This endpoint returns the scraped data (product URLs) from the provided domain URLs.
- The API receives a list of domain URLs.
- For the
/crawl/
endpoint, each domain is processed asynchronously to avoid blocking the request. - Selenium loads the webpage, scrolls to load dynamic content, and extracts product URLs.
- URLs matching predefined patterns are saved to the database.
- For the
/scrape/
endpoint, the scraping process completes synchronously, and the product URLs are returned in the response.
- Scraping eCommerce websites must comply with their Terms of Service.
- Running multiple requests can consume significant system resources.
- Ensure that your Chrome browser and Chromedriver versions are compatible.