As described in the README, a strategy requires an extractor
or a transformer
or both:
- every run of the
update
function of theextractor
should pull a chunk of the data from the source, be it a page, a time range etc - once the extraction process is finished, the transformation part begins. This is executed on a per-line basis (
onLine
) and should prepare the data to be compliant with the schema of the final dataset.
Depending on the data source, different initial configurations may be needed.
A web 2 strategy crawls data from an https
source (rss feed, apis, etc...).
This guide will show you the steps to crawl XKCD.
First off, the core
expects, both for init
and update
, a message compliant to the schema definition for https strategies.
The init
function message should provide the starting state of the crawling. For XKCD, we would write something like:
{
write: null,
messages: [
{
type: "https",
version,
options: {
url: https://xkcd.com/1/info.0.json,
method: "GET",
header: null,
body: null
},
},
],
};
write
: specifies the data to be written in string format.null
during init, as there is no datatype
: the protocol. must behttps
for this type of strategyversion
: can be anything, it's currently not usedoptions
:url
: the URL pointing to the first chunk to crawl. First page on XKCD is1
method
: GET, POSTheaders
: if any specific header should be sent to the source, it shall be specified here (for instance authorization headers)body
: same, value must be type ofString
results
: this field will be used by the core to pass the output of the crawling to theupdate
function
Once the init
message has been returned, the core
will fetch the first chunk of data and pass it to update
. The update
function can perform operations on the result and store it and/or fetch more data by returning messages in a similar way to init
. For each non-exit message returned by the update
, the function is called with the results of the crawl.
The function, ideally, should take care of the following:
- validate the outcome of the crawl
- validate the data against a schema
- prepare the data for storing. This step should make sure that any future consumer of the data will find all that's required to process it.
- define if and what to crawl next
The core
provided message will contain the following data along with the original message:
{
"error": string,
"results": object
}
- error: null or the crawling error message
- results: the direct output of the crawler
The strategy will have to
validate the outcome of the crawl
if(message.error) {
// handle the error
console.error(message.error);
// continue the crawling if possible (in this case, we are not able to retrieve the next page)
return {
write: null,
messages: [],
};
}
validate the data
const data = message.results;
if (validate(data)) {
console.error(validate.errors);
return {
type: "exit",
version: "1.0"
};
}
Prepare the data for storage
// Simply dumping the data into a string is sufficient in this case
const toBeStored = JSON.stringify(data);
Define if and what to crawl next
// Let's assume we want to crawl up to MAX_PAGE
const { num } = message.results;
if(num >= MAX_PAGE) {
return {
type: "exit",
version: "1.0"
};
}
// Instruct core to crawl next page
let options = {
url: templateURI(num + 1),
method: "GET"
}
Return the message back to the
core
return {
write: toBeStored,
messages: [
{
type: "https",
version,
options: options,
},
],
};
Check out the code for more details.