Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lockup while running files cp #7844

Open
RubenKelevra opened this issue Dec 26, 2020 · 16 comments
Open

lockup while running files cp #7844

RubenKelevra opened this issue Dec 26, 2020 · 16 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Dec 26, 2020

Version information:

go-ipfs version: 0.9.0-dev-2ed925442
Repo version: 11
System version: amd64/linux
Golang version: go1.15.6

Description:

I have a script that had run successfully for 160 days, now I updated it to 0.8rc1 and had run the garbage collection and it seems that the garbage collection did damage the MFS.

When I run the following commands, the IPFS API call will just get stuck:

$ ipfs files rm /x86-64.archlinux.pkg.pacman.store/lastsync
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
1608780003
$ ipfs files cp /ipfs/bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc /x86-64.archlinux.pkg.pacman.store/lastsync

I aborted it after 3 minutes.

Btw: It would be nice if there would be a timeout for such operations, that it at least fails with a timeout itself,

@RubenKelevra RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Dec 26, 2020
@RubenKelevra
Copy link
Contributor Author

It isn't a database issue, since the issue is fixed as soon as the daemon is restarted.

It happened again, this time I collected the debug information and killed it with sigabrt to get the stack trace. But it was too long for the system log.

Hope that's helpful.

debug_info.tar.gz

@gammazero
Copy link
Contributor

@RubenKelevra We are looking into this, and I have a couple of questions.

  1. When you reported, "It happened again, this time ...", Was this after another upgrade attempt, or during normal operation with ipfs daemon running?
  2. Was GC running when you saw this?

If GC were to run between:
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
and
$ ipfs files cp ...
Then the block for bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc could have been removed, causing ipfs files cp ... to then try to reacquire it from the ipfs network, and waiting.

@RubenKelevra
Copy link
Contributor Author

@RubenKelevra We are looking into this, and I have a couple of questions.

Thanks!

  1. When you reported, "It happened again, this time ...", Was this after another upgrade attempt, or during normal operation with ipfs daemon running?

Normal operation. I've rebooted the machine because I did a minor kernel update.

  1. Was GC running when you saw this?

GC isn't expected to run, since there's a lot of free space for ipfs configured. But auto-gc is on for running when there's 90% of storage filled.


The GC was run after upgrading the daemon was upgraded while the cluster daemon was turned off.

So no operation should have been running on the ipfs client.

After the update and the GC run the normal cluster operation is run again.

I can't say what exactly the cluster was doing at the time that this happened, but the daemon got stuck multiple times when this commands was run. So not sure how the block can disappear between both operations.

Tbh I think the MFS is corrupted since I'm now no longer able to start the daemon, see #7845.

I've removed the startup timeout (which was 15 minutes) which killed the daemon multiple times due to the timeout reached. Now the daemon won't start within 24 hours on a pretty high performance machine with a flatfs storage on a fast SSD.

If GC were to run between:
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
and
$ ipfs files cp ...
Then the block for bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc could have been removed, causing ipfs files cp ... to then try to reacquire it from the ipfs network, and waiting.

The network cannot provide this block, since I've just added this file with this content and try to move it in the MFS before recursively sharing the folder with it in it in the cluster.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jan 18, 2021

I forgot to make clear that cat is always successful, even after the files cp operation got stuck I can successfully cat the file.

I see currently three possible scenarios:

  • Both commands cannot agree if a block is locally available or not due to different code paths taken.

  • They run on different threads which have different data (cached) about the availability of this block.

  • The MFS cp operation is what actually get stuck creating the file, not the fetching of the block.


Edit:

Since IPFS runs on ZFS we can snapshot the current state and do some commands to try to work around this state.

Additionally its all publicly available data, so I'm happy to share the ipfs database and the flatfs content as a tar.gz when this helps the debugging efforts, too.

@gammazero
Copy link
Contributor

So far, I have not been able to reproduce this problem. It does appear that the cause is that MFS is somehow corrupted, particularly given the related issues. At this point, I think it would be useful to get your db and flat fs content -- if possible a minimum dataset that still exhibits the problem. Hopefully, the nature of any corruption found will give some indication of the possible cause.

@RubenKelevra
Copy link
Contributor Author

@gammazero wrote:

So far, I have not been able to reproduce this problem. It does appear that the cause is that MFS is somehow corrupted, particularly given the related issues. At this point, I think it would be useful to get your db and flat fs content -- if possible a minimum dataset that still exhibits the problem. Hopefully, the nature of any corruption found will give some indication of the possible cause.

I've packed the whole ipfs folder and just removed the key files and the identity from the config.

The server providing it is a bit slow, hope this works for you.

/ipfs/QmVx4BqSsQnhiYdnLbqA3zCXzteXBb7hvj6rQXDfyqxRJ8

@bqv
Copy link

bqv commented Jan 28, 2021

Will pin to my cluster, to help deliver :)

@bqv
Copy link

bqv commented Jan 28, 2021

@RubenKelevra could you see if you can get a gateway to see it? My node's been searching for yours for ages now

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jan 28, 2021

@bqv

@RubenKelevra could you see if you can get a gateway to see it? My node's been searching for yours for ages now

Just connect to the node, I guess with ipfs swarm connect <address>, since it is probably still in the providing step.

"Addresses": [
  "/ip4/94.176.233.122/tcp/443/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip4/94.176.233.122/udp/443/quic/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip6/2a02:7b40:5eb0:e97a::1/tcp/443/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip6/2a02:7b40:5eb0:e97a::1/udp/443/quic/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ"
]

I'll also run ipfs dht provide manually on the root ids, so it should be available soon. :)

@bqv
Copy link

bqv commented Jan 28, 2021

Got it. Pinned!

@bqv
Copy link

bqv commented Jan 28, 2021

@RubenKelevra correct me if I'm wrong but is your repo 140GB?! I may have to pin it on only one machine, and temporarily at best, if so

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jan 28, 2021

@RubenKelevra correct me if I'm wrong but is your repo 140GB?! I may have to pin it on only one machine, and temporarily at best, if so

yeah, it's around that size - that's why I had to put it on a slower server ;D

You can just pin it and unpin it again, we just need it for providing it a bit faster :)

@alexandreteles
Copy link

alexandreteles commented May 23, 2021

I just found this problem as well, trying to copy files from libgen to my localstorage using ipfs files cp /ipfs/cid /file.pdf. The ipfs diag cmds -v shows the files/cp command as running (for minutes even for small files). If I let it go even for half an hour it will still not copy the file, requiring a daemon restart. After restart I can run the files cp command again and it'll fetch the file in a couple seconds, but after two or three successful copies it'll hang again, requiring a new restart. I'm on Windows 10, running v0.8.0 with plenty of space for the localstorage.

EDIT: for clarification, if I let the daemon restart and wait a couple minutes before making a copy it will hang. Copies apparently only work for me right after the restart.

EDIT 2: after some tests, I've noticed that sometimes even right after the daemon restart copies are impossible. Maybe this is a problem with the protocols? Is there a way to deactivate protocols like QUIC for me to test and report on it?

Thank you!

@schomatis
Copy link
Contributor

Probably related to #6113.

@schomatis schomatis self-assigned this Dec 23, 2021
@RubenKelevra
Copy link
Contributor Author

@schomatis maybe just block any ipfs files ... operations until a GC run is completed? If the GC runs everything is extremely slow anyway. So blocking it completely might actually help here.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Dec 23, 2021

I decided to remove the automatic operation and do the operation after completing my loop of tasks. So if the Repo gets too big I'll run a GC:

https://github.com/RubenKelevra/rsync2ipfs-cluster/blob/1fd9712371f0315a35a80e9680340655ba751d7a/bin/rsync2cluster.sh#L659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

5 participants