You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hash the article wikitext (article_wikitext_hash)
Parse the article Wikitext
generate the article_hash
generate the base item using WBI
Store the json data using the hash (in ssdb)
hash the wikitext of all the references found (reference_wikitext_hash)
generate the reference item if an identifier was found
Store the generated reference json in ssdb with the reference_hash as key
Store the reference wikitext using the reference_wikitext_hash as key in ssdb
keep a record of which articles has which raw reference hashes in ssdb with key=article_hash+"refs" as key and a list of reference_wikitext_hash as value if any
keep a record of hashed references for each article in ssdb with key=article_hash+reference_hash, value list of identifier hashes if any)
We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.
Upload phase:
Open a connection to Wikibase using WikibaseIntegrator.
Loop over all references and upload the json to Wikibase for each unique reference
store the resulting wcdqid in ssdb (key=reference_hash+"wcdqid" value=wcdqid)
loop over all articles and finish generating the item using unihash list and get the wcdqids for references from ssdb.
Upload up to a max of 500 references on an article in one go, discard any above that.
Improvements for next iteration:
add any surplus references using addclaim to avoid throwing a way good data.
The text was updated successfully, but these errors were encountered:
asked for help in https://t.me/c/1478172663/10276 (Wikibase community)
"One of the main issues is that Wikibase is very slow. Therefore I have split the process in 2 parts: generation and upload. This enables me to switch out Wikibase.cloud with a local Wikibase or perhaps another (faster) graph backend.
We have 40 million articles with 200 million references to handle in total but I'm starting on a small scale with a small sample from enwiki.
See https://github.com/internetarchive/wcdimportbot for background and the code that analyzes the articles and extract the different types of references."
WIP algorithm version 0:
Generation phase:
We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.
Upload phase:
Improvements for next iteration:
The text was updated successfully, but these errors were encountered: