This is a step by step guide on how to use this simple data extraction package which supports CLI and API requests.
- Axios - Promise based HTTP client
- Cheerio - Library for parsing and manipulating HTML
- Express - Web framework for Node.js
- Jest - JavaScript Testing Framework
- Node.js - JavaScript runtime environment
- NPM - Package manager for Node.js
- Run
npm i
to install the package dependencies.
Run:
node cli-scraper.js <htmlSource> <selectorSource>
- htmlSource can either be an html file or a web URl.
- selectorSource is a JSON of keys with css selectors as values.
- In case of repetitive data, the property __root is required.
E.g. node cli-scraper.js examples/input1.html examples/selector1.json
The results will be logged in the console and written in the scrapedData.json file inside the examples folder.
- Run
npm run dev
to start the server. - Use curl, postman or another API testing tool to make your API requests.
- The HTTP method should be POST and the body should be a JSON with html and selectors properties:
- html can either be an html file stringified or a web URl.
- selectors is an object of keys with css selectors as values.
- In case of repetitive data, the property __root is required.
E.g.
curl -X POST http://localhost:3000/scrape -H "Content-Type: application/json" -d '{"html": "https://github.com/", "selectors": {"title": "h1:first-child"}}'
- Node.js
- NPM
- A text editor like Visual Studio Code
- An API testing platform like Postman
- I based my libraries decision on most popular and downloaded npm options.
- The first example provided in the challenge description is wrong since there's no "p" element child of "h1"
- The second example provided in the challenge description was modified to include tbody because of the cheerio load function behaviour