Skip to content

A boilerplate project to crawl websites and parse their html content with Python 🏊‍♀️

Notifications You must be signed in to change notification settings

bvelitchkine/scraping-boilerplate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


scraping-boilerplate

📍 Scrape with ease using this Boilerplate!

⚙️ Developed with the software and tools below:

Python


📚 Table of Contents


📍 Overview

This codebase provides a web crawling and data extraction tool for scraping websites using regular expressions. The main script initiates the crawling from a CSV input file using a Parser object and saves the progress of the crawler.

The Crawler class provides the ability to extract data and process it, while the Parser class uses regular expressions to extract new emails, phone numbers, linkedin URLs, and Facebook URLs from the HTML of a website.

The value proposition of this project is to automate the scraping process and extract relevant data efficiently and accurately but, most importantly, to provide a boilerplate for future scraping projects.


📂 Project Structure


🧩 Modules

Root
File Summary Module
main.py The code initiates a web crawler from the provided CSV input file using a Parser object to parse web data, limits the crawling to only 5 web pages (but could be more depanding on the chosen value), and saves the progress of the crawler. In case of any exception, it prints an error message. main.py
Crawler.py The provided code snippet consists of a Crawler class that has the ability to crawl websites and extract data from them. The class takes in parameters such as the parser to use, input file, output folder, and rate limit. It contains methods to compute header indexes, process extracted data, log progression, and save the crawling progression. One can use the crawl method to scrape websites and write the extracted data to an output file. utils\Crawler.py
Parser.py The provided code snippet contains a Parser class that uses regular expressions to extract new emails, phone numbers, linkedin URLs, and Facebook URLs from the HTML of a website. These extraction methods are decorated with the @set_func_headers decorator to specify headers and order. The Parser class also has methods to get headers and extract content from HTML using all extractor methods in order. utils\Parser.py
utilities.py The provided code snippet contains two utility functions. The first function is a decorator that assigns attributes such as headers, func_type, and order to a function. The second function normalizes a phone number based on predefined rules, such as removing spaces and adding country codes. utils\utilities.py

🚀 Getting Started

🖥 Installation

  1. Clone the scraping-boilerplate repository:
git clone D:/Utilisateurs/Bastien/Documents/Programmation/scraping-boilerplate
  1. Change to the project directory:
cd scraping-boilerplate
  1. Install the dependencies:
pip install -r requirements.txt

🤖 Using scraping-boilerplate

Add your own extractors in the Crawler class following the same structure as the linkedin and facebook extractors for instance:

@set_func_headers(headers=["linkedin_urls"], func_type="extractor", order=3)
def __extract_linkedin_urls(self, html_page_content):
    """
    Extract linkedin urls from the website HTML.

    Args:
        html_page_content: The content of the website HTML page

    Returns:
        A dictionary with the key 'linkedin_urls' and the urls found as value
    """
    # Regular expression pattern to match linkedin urls
    linkedin_regex_pattern = (
        r"https?:\/\/(www\.)?linkedin\.com\/[a-zA-Z%\däëüïöâêûîôàèùìòé\-_,\/]{4,}"
    )

    # Extract linkedin_urls from the website HTML using regular expressions
    matches = re.finditer(linkedin_regex_pattern, html_page_content, re.IGNORECASE)
    return {"linkedin_urls": [match.group() for match in matches]}

Change the limit of the crawler in the main.py script. It's currently set to 5, but you can change it to any number you want.

my_crawler.crawl(limit=5)

Then, run the main.py script to start the crawler:

python main.py

About

A boilerplate project to crawl websites and parse their html content with Python 🏊‍♀️

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages