-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin #22
base: main
Are you sure you want to change the base?
Conversation
|
||
At the moment, any tool wishing to support storing IPFS files/directories larger than 32GiB will need to store these IPFS files/directories as "raw blocks", throwing away all the DAG structural information. This will make future retrieval deals for subsets of this data infeasible and will make IPFS interop extremely difficult. | ||
|
||
This is only one 🔥 because there are plenty of useful sub-32GiB datasets and non-IPFS datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true although there is additional impact here which is enabling people to store compositions of data sets.
If deals already exist on Filecoin for a dataset and then someone wants to reference that dataset (or some part of it) within theirs then the data has to be duplicated and stored in two separate deals. With this feature as long as there is a way to discover mappings of CID -> miner with CID (currently out of band, but is a required part of retrieval market work) then users don't need to store the same data twice (or worry about compositions exceeding 32GiB)
Co-authored-by: Vasco Santos <vasco.santos@ua.pt> Co-authored-by: Marcin Rataj <lidel@lidel.org>
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_ | ||
|
||
1. Don't support datasets > 32GiB. | ||
2. Store large datasets as raw objects instead of IPFS files and accept the fact that these datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a 3rd one like this could include allowing users to send a parallel DAG structure that only contains links if they want to be queryable and accepting that our selector options will be limited and some dealing with this manifest may be a pain
Title could do with some work. "in Filecoin" would be helpful, but this is about partial dags too, so maybe it's "Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin". But maybe this is up to three separate projects:
|
Yeah, this could be split into 3 mini projects. But the overarching goal is to be able to support large datasets, both for storage and retrieval. I'm not sure what breaking it into three parts would give us. |
#### Dependencies/prerequisites | ||
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.--> | ||
|
||
None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should think about if this can be deferred or done in parallel with having the lotus client / market work using ipld-prime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using ipld-prime doesn't get us much. I need to be able to (a) make a deal over a selector and (b) retrieve a selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#27 is probably a dependency.
#### Counterpoints & pre-mortem | ||
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_ | ||
|
||
The primary risk is that there may be a lack of demand to store large IPFS-formatted datasets in Filecoin. That is, users storing large datasets (> 32GiB) may all be using custom formats and may not care about IPFS files/directories, partial retrieval, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the required path for partial retrievability - or is that a somewhat orthogonal (if related) problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how the comment relates to the paragraph so I may be misinterpreting it.
Step 2 of the "plan of attack" is required for partial retrieval.
@Kubuxu could you review this please? |
data for both storage and retrieval. This is especially true when interacting with IPFS. | ||
2. This workaround requires storing an "overlay" DAG in Filecoin (paying for that storage). | ||
|
||
Second, it should be possible to retrieve subsets of DAGs. While the underlying protocols support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocols support this - I think this is referring to graphsync and the other IPLD pieces down to the data storage - but the CLI doesn't. What about the miner side of this? The wording of this suggests that it's just the client CLI that's blocked on this, is that true? Can an alternative retrieval client use the protocols today to retrieve an arbitrary sub-DAG from a miner or is there more to be done on that side too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested selector-based retrievals way back in August ( using a hardcoded selector in the client directly ) - they worked, in the context of everything else being flaky.
It's not a CLI issue, rather we do not have a decent selector interchange format in general ( a gob of cbor is not something to use over API/CLI )
In other words:
- if today I want to specify a cid - I usually get to do the funny
{ "/":"baf..." }
thing - if today I want to express a selector - I do... ❓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhhh back to the "selector syntax" problem, we should just solve that properly eh? so close ipld/specs#239
No description provided.