Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin #22

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Stebalien
Copy link
Member

No description provided.


At the moment, any tool wishing to support storing IPFS files/directories larger than 32GiB will need to store these IPFS files/directories as "raw blocks", throwing away all the DAG structural information. This will make future retrieval deals for subsets of this data infeasible and will make IPFS interop extremely difficult.

This is only one 🔥 because there are plenty of useful sub-32GiB datasets and non-IPFS datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true although there is additional impact here which is enabling people to store compositions of data sets.

If deals already exist on Filecoin for a dataset and then someone wants to reference that dataset (or some part of it) within theirs then the data has to be duplicated and stored in two separate deals. With this feature as long as there is a way to discover mappings of CID -> miner with CID (currently out of band, but is a required part of retrieval market work) then users don't need to store the same data twice (or worry about compositions exceeding 32GiB)

Co-authored-by: Vasco Santos <vasco.santos@ua.pt>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_

1. Don't support datasets > 32GiB.
2. Store large datasets as raw objects instead of IPFS files and accept the fact that these datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a 3rd one like this could include allowing users to send a parallel DAG structure that only contains links if they want to be queryable and accepting that our selector options will be limited and some dealing with this manifest may be a pain

@rvagg
Copy link
Contributor

rvagg commented Feb 18, 2021

Title could do with some work. "in Filecoin" would be helpful, but this is about partial dags too, so maybe it's "Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin".

But maybe this is up to three separate projects:

  • Support arbitrarily large DAGs in Filecoin
  • Support arbitrary and incomplete DAGs natively in Filecoin
  • Selector support for retrieval of partial DAGs

@Stebalien Stebalien changed the title Support Large IPLD/IPFS DAGs Support Large IPLD/IPFS DAGs in Filecoin Feb 18, 2021
@Stebalien
Copy link
Member Author

Yeah, this could be split into 3 mini projects. But the overarching goal is to be able to support large datasets, both for storage and retrieval. I'm not sure what breaking it into three parts would give us.

@Stebalien Stebalien changed the title Support Large IPLD/IPFS DAGs in Filecoin Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin Feb 18, 2021
#### Dependencies/prerequisites
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.-->

None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should think about if this can be deferred or done in parallel with having the lotus client / market work using ipld-prime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using ipld-prime doesn't get us much. I need to be able to (a) make a deal over a selector and (b) retrieve a selector.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#27 is probably a dependency.

#### Counterpoints &amp; pre-mortem
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

The primary risk is that there may be a lack of demand to store large IPFS-formatted datasets in Filecoin. That is, users storing large datasets (> 32GiB) may all be using custom formats and may not care about IPFS files/directories, partial retrieval, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the required path for partial retrievability - or is that a somewhat orthogonal (if related) problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the comment relates to the paragraph so I may be misinterpreting it.

Step 2 of the "plan of attack" is required for partial retrieval.

@momack2
Copy link
Contributor

momack2 commented Apr 1, 2021

@Kubuxu could you review this please?

data for both storage and retrieval. This is especially true when interacting with IPFS.
2. This workaround requires storing an "overlay" DAG in Filecoin (paying for that storage).

Second, it should be possible to retrieve subsets of DAGs. While the underlying protocols support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocols support this - I think this is referring to graphsync and the other IPLD pieces down to the data storage - but the CLI doesn't. What about the miner side of this? The wording of this suggests that it's just the client CLI that's blocked on this, is that true? Can an alternative retrieval client use the protocols today to retrieve an arbitrary sub-DAG from a miner or is there more to be done on that side too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested selector-based retrievals way back in August ( using a hardcoded selector in the client directly ) - they worked, in the context of everything else being flaky.

It's not a CLI issue, rather we do not have a decent selector interchange format in general ( a gob of cbor is not something to use over API/CLI )

In other words:

  • if today I want to specify a cid - I usually get to do the funny { "/":"baf..." } thing
  • if today I want to express a selector - I do... ❓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhhh back to the "selector syntax" problem, we should just solve that properly eh? so close ipld/specs#239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants