Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as a developer I want to write a simple graph generation algorithm so we can effectively generate a graph of all references #520

Closed
dpriskorn opened this issue Jan 10, 2023 · 3 comments

Comments

@dpriskorn
Copy link
Collaborator

dpriskorn commented Jan 10, 2023

WIP algorithm version 0:

Generation phase:

hash the article wikitext (article_wikitext_hash)
Parse the article Wikitext
generate the article_hash
generate the base item using WBI
Store the json data using the hash (in ssdb)
hash the wikitext of all the references found (reference_wikitext_hash)
generate the reference item if an identifier was found
Store the generated reference json in ssdb with the reference_hash as key
Store the reference wikitext using the reference_wikitext_hash as key in ssdb
keep a record of which articles has which raw reference hashes in ssdb with key=article_hash+"refs" as key and a list of reference_wikitext_hash as value if any
keep a record of hashed references for each article in ssdb with key=article_hash+reference_hash, value list of identifier hashes if any)

We intentionally do not generate website items, nor handle the non-hashable references in this first iteration.

Upload phase:

Open a connection to Wikibase using WikibaseIntegrator.
Loop over all references and upload the json to Wikibase for each unique reference
store the resulting wcdqid in ssdb (key=reference_hash+"wcdqid" value=wcdqid)
loop over all articles and finish generating the item using unihash list and get the wcdqids for references from ssdb.
    Upload up to a max of 500 references on an article in one go, discard any above that.

Improvements for next iteration:

add any surplus references using addclaim to avoid throwing a way good data.
@dpriskorn
Copy link
Collaborator Author

asked for help in https://t.me/c/1478172663/10276 (Wikibase community)
"One of the main issues is that Wikibase is very slow. Therefore I have split the process in 2 parts: generation and upload. This enables me to switch out Wikibase.cloud with a local Wikibase or perhaps another (faster) graph backend.
We have 40 million articles with 200 million references to handle in total but I'm starting on a small scale with a small sample from enwiki.
See https://github.com/internetarchive/wcdimportbot for background and the code that analyzes the articles and extract the different types of references."

@dpriskorn dpriskorn removed this from the 3.0.0-alpha3 milestone Jan 18, 2023
@dpriskorn
Copy link
Collaborator Author

this is put on ice for now

@dpriskorn
Copy link
Collaborator Author

abandoned

@github-project-automation github-project-automation bot moved this from Save for future sprint to Done in Internet Archive Reference Inventory May 10, 2023
@dpriskorn dpriskorn moved this from Done to Abandoned in Internet Archive Reference Inventory May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Abandoned
Development

No branches or pull requests

1 participant