lockup while running `files cp` #7844

RubenKelevra · 2020-12-26T00:48:10Z

Version information:

go-ipfs version: 0.9.0-dev-2ed925442
Repo version: 11
System version: amd64/linux
Golang version: go1.15.6

Description:

I have a script that had run successfully for 160 days, now I updated it to 0.8rc1 and had run the garbage collection and it seems that the garbage collection did damage the MFS.

When I run the following commands, the IPFS API call will just get stuck:

$ ipfs files rm /x86-64.archlinux.pkg.pacman.store/lastsync
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
1608780003
$ ipfs files cp /ipfs/bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc /x86-64.archlinux.pkg.pacman.store/lastsync

I aborted it after 3 minutes.

Btw: It would be nice if there would be a timeout for such operations, that it at least fails with a timeout itself,

RubenKelevra · 2020-12-28T01:54:56Z

It isn't a database issue, since the issue is fixed as soon as the daemon is restarted.

It happened again, this time I collected the debug information and killed it with sigabrt to get the stack trace. But it was too long for the system log.

Hope that's helpful.

debug_info.tar.gz

gammazero · 2021-01-18T21:01:57Z

@RubenKelevra We are looking into this, and I have a couple of questions.

When you reported, "It happened again, this time ...", Was this after another upgrade attempt, or during normal operation with ipfs daemon running?
Was GC running when you saw this?

If GC were to run between:
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
and
$ ipfs files cp ...
Then the block for bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc could have been removed, causing ipfs files cp ... to then try to reacquire it from the ipfs network, and waiting.

RubenKelevra · 2021-01-18T21:40:25Z

@RubenKelevra We are looking into this, and I have a couple of questions.

Thanks!

When you reported, "It happened again, this time ...", Was this after another upgrade attempt, or during normal operation with ipfs daemon running?

Normal operation. I've rebooted the machine because I did a minor kernel update.

Was GC running when you saw this?

GC isn't expected to run, since there's a lot of free space for ipfs configured. But auto-gc is on for running when there's 90% of storage filled.

The GC was run after upgrading the daemon was upgraded while the cluster daemon was turned off.

So no operation should have been running on the ipfs client.

After the update and the GC run the normal cluster operation is run again.

I can't say what exactly the cluster was doing at the time that this happened, but the daemon got stuck multiple times when this commands was run. So not sure how the block can disappear between both operations.

Tbh I think the MFS is corrupted since I'm now no longer able to start the daemon, see #7845.

I've removed the startup timeout (which was 15 minutes) which killed the daemon multiple times due to the timeout reached. Now the daemon won't start within 24 hours on a pretty high performance machine with a flatfs storage on a fast SSD.

If GC were to run between:
$ ipfs cat bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc
and
$ ipfs files cp ...
Then the block for bafykbzaceafqnjhgsdwwz3bm66y7ti7z6te4773vgh4pn4axiqoluxjgydmhc could have been removed, causing ipfs files cp ... to then try to reacquire it from the ipfs network, and waiting.

The network cannot provide this block, since I've just added this file with this content and try to move it in the MFS before recursively sharing the folder with it in it in the cluster.

RubenKelevra · 2021-01-18T21:50:56Z

I forgot to make clear that cat is always successful, even after the files cp operation got stuck I can successfully cat the file.

I see currently three possible scenarios:

Both commands cannot agree if a block is locally available or not due to different code paths taken.
They run on different threads which have different data (cached) about the availability of this block.
The MFS cp operation is what actually get stuck creating the file, not the fetching of the block.

Edit:

Since IPFS runs on ZFS we can snapshot the current state and do some commands to try to work around this state.

Additionally its all publicly available data, so I'm happy to share the ipfs database and the flatfs content as a tar.gz when this helps the debugging efforts, too.

gammazero · 2021-01-21T06:45:15Z

So far, I have not been able to reproduce this problem. It does appear that the cause is that MFS is somehow corrupted, particularly given the related issues. At this point, I think it would be useful to get your db and flat fs content -- if possible a minimum dataset that still exhibits the problem. Hopefully, the nature of any corruption found will give some indication of the possible cause.

RubenKelevra · 2021-01-28T14:14:37Z

@gammazero wrote:

So far, I have not been able to reproduce this problem. It does appear that the cause is that MFS is somehow corrupted, particularly given the related issues. At this point, I think it would be useful to get your db and flat fs content -- if possible a minimum dataset that still exhibits the problem. Hopefully, the nature of any corruption found will give some indication of the possible cause.

I've packed the whole ipfs folder and just removed the key files and the identity from the config.

The server providing it is a bit slow, hope this works for you.

/ipfs/QmVx4BqSsQnhiYdnLbqA3zCXzteXBb7hvj6rQXDfyqxRJ8

bqv · 2021-01-28T14:36:30Z

Will pin to my cluster, to help deliver :)

bqv · 2021-01-28T14:52:35Z

@RubenKelevra could you see if you can get a gateway to see it? My node's been searching for yours for ages now

RubenKelevra · 2021-01-28T14:56:52Z

@bqv

@RubenKelevra could you see if you can get a gateway to see it? My node's been searching for yours for ages now

Just connect to the node, I guess with ipfs swarm connect <address>, since it is probably still in the providing step.

"Addresses": [
  "/ip4/94.176.233.122/tcp/443/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip4/94.176.233.122/udp/443/quic/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip6/2a02:7b40:5eb0:e97a::1/tcp/443/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ",
  "/ip6/2a02:7b40:5eb0:e97a::1/udp/443/quic/p2p/QmVBVA4wNqXXqLWjft8WWf2YLNH2xdq2iCVjGS9dPtA6JJ"
]

I'll also run ipfs dht provide manually on the root ids, so it should be available soon. :)

bqv · 2021-01-28T15:00:43Z

Got it. Pinned!

bqv · 2021-01-28T17:31:07Z

@RubenKelevra correct me if I'm wrong but is your repo 140GB?! I may have to pin it on only one machine, and temporarily at best, if so

RubenKelevra · 2021-01-28T18:37:29Z

@RubenKelevra correct me if I'm wrong but is your repo 140GB?! I may have to pin it on only one machine, and temporarily at best, if so

yeah, it's around that size - that's why I had to put it on a slower server ;D

You can just pin it and unpin it again, we just need it for providing it a bit faster :)

alexandreteles · 2021-05-23T18:55:13Z

I just found this problem as well, trying to copy files from libgen to my localstorage using ipfs files cp /ipfs/cid /file.pdf. The ipfs diag cmds -v shows the files/cp command as running (for minutes even for small files). If I let it go even for half an hour it will still not copy the file, requiring a daemon restart. After restart I can run the files cp command again and it'll fetch the file in a couple seconds, but after two or three successful copies it'll hang again, requiring a new restart. I'm on Windows 10, running v0.8.0 with plenty of space for the localstorage.

EDIT: for clarification, if I let the daemon restart and wait a couple minutes before making a copy it will hang. Copies apparently only work for me right after the restart.

EDIT 2: after some tests, I've noticed that sometimes even right after the daemon restart copies are impossible. Maybe this is a problem with the protocols? Is there a way to deactivate protocols like QUIC for me to test and report on it?

Thank you!

schomatis · 2021-12-23T16:06:42Z

Probably related to #6113.

RubenKelevra · 2021-12-23T17:45:26Z

@schomatis maybe just block any ipfs files ... operations until a GC run is completed? If the GC runs everything is extremely slow anyway. So blocking it completely might actually help here.

RubenKelevra · 2021-12-23T17:48:14Z

I decided to remove the automatic operation and do the operation after completing my loop of tasks. So if the Repo gets too big I'll run a GC:

https://github.com/RubenKelevra/rsync2ipfs-cluster/blob/1fd9712371f0315a35a80e9680340655ba751d7a/bin/rsync2cluster.sh#L659

RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Dec 26, 2020

jacobheun assigned gammazero Jan 15, 2021

RubenKelevra mentioned this issue Jan 19, 2021

extremly long startup time #7845

Closed

RubenKelevra mentioned this issue Jan 28, 2021

Ipfs daemon hangs when MFS root is not available locally #7183

Open

RubenKelevra mentioned this issue Apr 17, 2021

Data Transfer Stall #7972

Closed

Stebalien unassigned gammazero Apr 22, 2021

schomatis self-assigned this Dec 23, 2021

schomatis mentioned this issue Jan 21, 2022

GC and MFS are not safe #6113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lockup while running `files cp` #7844

lockup while running `files cp` #7844

RubenKelevra commented Dec 26, 2020 •

edited

Loading

RubenKelevra commented Dec 28, 2020

gammazero commented Jan 18, 2021

RubenKelevra commented Jan 18, 2021

RubenKelevra commented Jan 18, 2021 •

edited

Loading

gammazero commented Jan 21, 2021

RubenKelevra commented Jan 28, 2021

bqv commented Jan 28, 2021

bqv commented Jan 28, 2021

RubenKelevra commented Jan 28, 2021 •

edited

Loading

bqv commented Jan 28, 2021

bqv commented Jan 28, 2021

RubenKelevra commented Jan 28, 2021 •

edited

Loading

alexandreteles commented May 23, 2021 •

edited

Loading

schomatis commented Dec 23, 2021

RubenKelevra commented Dec 23, 2021

RubenKelevra commented Dec 23, 2021 •

edited

Loading

lockup while running files cp #7844

lockup while running files cp #7844

Comments

RubenKelevra commented Dec 26, 2020 • edited Loading

Version information:

Description:

RubenKelevra commented Dec 28, 2020

gammazero commented Jan 18, 2021

RubenKelevra commented Jan 18, 2021

RubenKelevra commented Jan 18, 2021 • edited Loading

gammazero commented Jan 21, 2021

RubenKelevra commented Jan 28, 2021

bqv commented Jan 28, 2021

bqv commented Jan 28, 2021

RubenKelevra commented Jan 28, 2021 • edited Loading

bqv commented Jan 28, 2021

bqv commented Jan 28, 2021

RubenKelevra commented Jan 28, 2021 • edited Loading

alexandreteles commented May 23, 2021 • edited Loading

schomatis commented Dec 23, 2021

RubenKelevra commented Dec 23, 2021

RubenKelevra commented Dec 23, 2021 • edited Loading

lockup while running `files cp` #7844

lockup while running `files cp` #7844

RubenKelevra commented Dec 26, 2020 •

edited

Loading

RubenKelevra commented Jan 18, 2021 •

edited

Loading

RubenKelevra commented Jan 28, 2021 •

edited

Loading

RubenKelevra commented Jan 28, 2021 •

edited

Loading

alexandreteles commented May 23, 2021 •

edited

Loading

RubenKelevra commented Dec 23, 2021 •

edited

Loading