Skip to content

Data extraction package which supports CLI and API requests.

Notifications You must be signed in to change notification settings

lsegg/scraper-api-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper API and CLI

This is a step by step guide on how to use this simple data extraction package which supports CLI and API requests.

Built with 🛠️

  • Axios - Promise based HTTP client
  • Cheerio - Library for parsing and manipulating HTML
  • Express - Web framework for Node.js
  • Jest - JavaScript Testing Framework
  • Node.js - JavaScript runtime environment
  • NPM - Package manager for Node.js

Installation ⚙️

  1. Run npm i to install the package dependencies.

CLI Usage ✅

Run:

  node cli-scraper.js <htmlSource> <selectorSource>
  • htmlSource can either be an html file or a web URl.
  • selectorSource is a JSON of keys with css selectors as values.
  • In case of repetitive data, the property __root is required.

E.g. node cli-scraper.js examples/input1.html examples/selector1.json

The results will be logged in the console and written in the scrapedData.json file inside the examples folder.

API Usage ✅

  1. Run npm run dev to start the server.
  2. Use curl, postman or another API testing tool to make your API requests.
  3. The HTTP method should be POST and the body should be a JSON with html and selectors properties:
  • html can either be an html file stringified or a web URl.
  • selectors is an object of keys with css selectors as values.
  • In case of repetitive data, the property __root is required.

E.g.

curl -X POST http://localhost:3000/scrape -H "Content-Type: application/json" -d '{"html": "https://github.com/", "selectors": {"title": "h1:first-child"}}'

Requirements ⚙️

Notes 📋

  • I based my libraries decision on most popular and downloaded npm options.
  • The first example provided in the challenge description is wrong since there's no "p" element child of "h1"
  • The second example provided in the challenge description was modified to include tbody because of the cheerio load function behaviour

About

Data extraction package which supports CLI and API requests.

Topics

Resources

Stars

Watchers

Forks