Django Web Scraping API

Overview

This project is a Django-based web scraping API that extracts product URLs from various eCommerce platforms. The API takes domain URLs as input and asynchronously crawls them using Selenium and BeautifulSoup.

Features

Accepts a list of domain URLs to scrape.
Uses Selenium to load web pages dynamically and extract product URLs.
Supports multiple platforms with predefined URL patterns.
Runs scraping tasks asynchronously.
Stores extracted product URLs in a MySQL database.
Provides an API to scrape data and return the results.

Technologies Used

Django 4.2 (Web framework)
Selenium (For automated web scraping)
BeautifulSoup (For parsing HTML content)
MySQL (Database for storing extracted URLs)
Threading (For asynchronous task execution)

Installation

Prerequisites

Python 3.10+
MySQL database setup
Chrome browser installed
Chromedriver (Managed automatically by webdriver-manager)

Steps

Clone the repository:

git clone https://github.com/Aman7818/ecommerce_crawler.git
cd <project-directory>

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

Configure the database settings in settings.py. Use the environment variables for database credentials:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': os.getenv('DB_NAME'),
        'USER': os.getenv('DB_USER'),
        'PASSWORD': os.getenv('DB_PASSWORD'),
        'HOST': os.getenv('DB_HOST'),
        'PORT': os.getenv('DB_PORT'),
    }
}

Add the following environment variables to your .env file:

DB_NAME=your_db_name
DB_USER=your_db_user
DB_PASSWORD=your_db_password
DB_HOST=your_db_host
DB_PORT=3306

Run database migrations:
```
python manage.py migrate
```
Start the Django server:
```
python manage.py runserver
```

API Usage

Endpoint: `/crawl/`

Method: POST

Request Body:

{
    "domains": [
        "https://www.amazon.in/s?k=laptops",
        "https://www.flipkart.com/search?q=smartphones"
    ]
}

Response:

{
    "message": "Crawling started in background"
}

This endpoint starts the web scraping task for a list of domains asynchronously.

Endpoint: `/scrape/`

Method: POST

Request Body:

{
    "domains": [
        "https://www.amazon.in/s?k=laptops",
        "https://www.flipkart.com/search?q=smartphones"
    ]
}

Response:

{
    "message": "Crawling completed",
    "data": [
        "https://www.amazon.in/product1",
        "https://www.flipkart.com/product2"
    ]
}

This endpoint returns the scraped data (product URLs) from the provided domain URLs.

How It Works

The API receives a list of domain URLs.
For the /crawl/ endpoint, each domain is processed asynchronously to avoid blocking the request.
Selenium loads the webpage, scrolls to load dynamic content, and extracts product URLs.
URLs matching predefined patterns are saved to the database.
For the /scrape/ endpoint, the scraping process completes synchronously, and the product URLs are returned in the response.

Notes

Scraping eCommerce websites must comply with their Terms of Service.
Running multiple requests can consume significant system resources.
Ensure that your Chrome browser and Chromedriver versions are compatible.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
crawler		crawler
ecommerce_crawler		ecommerce_crawler
.gitignore		.gitignore
Dockerfile		Dockerfile
Dummy_links.csv		Dummy_links.csv
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Django Web Scraping API

Overview

Features

Technologies Used

Installation

Prerequisites

Steps

API Usage

Endpoint: `/crawl/`

Endpoint: `/scrape/`

How It Works

Notes

About

Releases

Packages

Contributors 2

Languages

Aman7818/ecommerce_crawler

Folders and files

Latest commit

History

Repository files navigation

Django Web Scraping API

Overview

Features

Technologies Used

Installation

Prerequisites

Steps

API Usage

Endpoint: /crawl/

Endpoint: /scrape/

How It Works

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Endpoint: `/crawl/`

Endpoint: `/scrape/`

Packages