as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

dpriskorn · 2023-01-10T16:23:09Z

WIP algorithm version 0:

Generation phase:

hash the article wikitext (article_wikitext_hash)
Parse the article Wikitext
generate the article_hash
generate the base item using WBI
Store the json data using the hash (in ssdb)
hash the wikitext of all the references found (reference_wikitext_hash)
generate the reference item if an identifier was found
Store the generated reference json in ssdb with the reference_hash as key
Store the reference wikitext using the reference_wikitext_hash as key in ssdb
keep a record of which articles has which raw reference hashes in ssdb with key=article_hash+"refs" as key and a list of reference_wikitext_hash as value if any
keep a record of hashed references for each article in ssdb with key=article_hash+reference_hash, value list of identifier hashes if any)

We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.

Upload phase:

Open a connection to Wikibase using WikibaseIntegrator.
Loop over all references and upload the json to Wikibase for each unique reference
store the resulting wcdqid in ssdb (key=reference_hash+"wcdqid" value=wcdqid)
loop over all articles and finish generating the item using unihash list and get the wcdqids for references from ssdb.
    Upload up to a max of 500 references on an article in one go, discard any above that.

Improvements for next iteration:

add any surplus references using addclaim to avoid throwing a way good data.

The text was updated successfully, but these errors were encountered:

dpriskorn · 2023-01-10T16:33:41Z

asked for help in https://t.me/c/1478172663/10276 (Wikibase community)
"One of the main issues is that Wikibase is very slow. Therefore I have split the process in 2 parts: generation and upload. This enables me to switch out Wikibase.cloud with a local Wikibase or perhaps another (faster) graph backend.
We have 40 million articles with 200 million references to handle in total but I'm starting on a small scale with a small sample from enwiki.
See https://github.com/internetarchive/wcdimportbot for background and the code that analyzes the articles and extract the different types of references."

dpriskorn · 2023-02-15T16:33:46Z

this is put on ice for now

dpriskorn · 2023-05-10T10:48:23Z

abandoned

dpriskorn added this to the 3.0.0-alpha3 milestone Jan 10, 2023

dpriskorn added this to Internet Archive Reference Inventory Jan 10, 2023

github-project-automation bot moved this to New in Internet Archive Reference Inventory Jan 10, 2023

dpriskorn added algorithm design labels Jan 10, 2023

dpriskorn moved this from New to Save for future sprint in Internet Archive Reference Inventory Jan 10, 2023

dpriskorn removed this from the 3.0.0-alpha3 milestone Jan 18, 2023

dpriskorn closed this as completed May 10, 2023

github-project-automation bot moved this from Save for future sprint to Done in Internet Archive Reference Inventory May 10, 2023

dpriskorn moved this from Done to Abandoned in Internet Archive Reference Inventory May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

dpriskorn commented Jan 10, 2023 •

edited

Loading

dpriskorn commented Jan 10, 2023

dpriskorn commented Feb 15, 2023

dpriskorn commented May 10, 2023

as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

Comments

dpriskorn commented Jan 10, 2023 • edited Loading

dpriskorn commented Jan 10, 2023

dpriskorn commented Feb 15, 2023

dpriskorn commented May 10, 2023

dpriskorn commented Jan 10, 2023 •

edited

Loading