-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgresql process segfault when running drop_chunks #2143
Comments
Thanks for the detailed report with the stacktrace, @akamensky . |
Added ending of that bt. It appears a bit too deep with repeated blocks. Seems to be unexpected loop? |
Thanks @akamensky for the backtrace. Is this the same situation as in #1986? |
@mkindahl not sure what situation you are referring to. It is same system, but different environment where we see this one. We see this in our prod only (which is the busiest and most loaded environment obv). Setup is the same with the exception of how frequently we drop chunks -- in prod it is every 24 hours, so it is a lot of data that needs to be dropped all at once, while #1986 still happens in our staging environment where drop_chunks called every hour. |
@akamensky Since these are often race conditions of some sort, it is good to know what other jobs are running on the database. AIUI, you have an connection doing inserts at a high rate running in parallel with the |
@mkindahl as described in #1986 database is being written by kafkaconnect based applications. It is continuous stream of writes. There are multiple databases on the host (same postgres process), each has multiple hypertables in it, each hypertable is destination of such kafkaconnect application doing writes of stream of events. To give you an idea of amount of writes -- disk writes are constant flow of about 5 Gbps data. That is nearly 24*7. |
Happened once again in our Prod environment. Though stacktrace looks similar (we are on 1.7.3 now):
|
And another one just a bit earlier (the top of the stack looks a bit different tho):
|
And one more also earlier:
and another one also a bit earlier:
|
Today (last night actually) this crashed our production timescaledb instance. It did not restart and stayed down overnight, so we have about 10 hours data lost (as not written to DB)...
|
We had a crash similar to the original post. It occured just after we finished moving lots of old data to TimescaleDB hypertables. The crash seemingly occured during the execution on an unrelated query. When we moved data, we disabled compression job, did
|
Perhaps, it's actually #2200 that wasn't fixed after all? |
Thanks for the continued reports @akamensky @WGH- . Just to confirm that you are running 1.7.3? We had thought this was fixed in that, but a quick glance at it seems to be this might be a slightly different issue not triggered by We believe this was fixed here: #2259 (which was in 1.7.3), particularly because because cache_invalidate_callback() -> ts_extension_is_loaded() is part of the trace. |
@mfreed we have all our timescale servers updated to 1.7.4. As you can see above (my last stack trace) it still is happening there. We have all DBs updated as per documentation using "alter extension timescaledb update;". On every upgrade we always check that "\dx" shows correct extension number and the postgres logs don't show any warnings or errors. |
Hit the same issue again in our production environment (missing all data over the weekend). If the issue supposed to be fixed in 1.7.4 (and we are currently running 1.7.4 in production), is there any issue with fix? Or RPMs that are published do not contain this fix?
|
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes timescale#2143
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes timescale#2143
@akamensky If you have any ability to test against source, would be welcome =) |
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes timescale#2143
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes #2143
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes #2143
Inside the `cache_invalidate_callback` the OID of the namespace is read using `get_namespace_oid`, which usually goes to the cache. However, if there is a cache miss inside that function, it will attempt to read the OID from the heap, which involves invalidating the cache, which further leads to `cache_invalidate_callback` being called recursively. This commit prevent an infinite recursion by not doing a recursive call and instead considering the cache invalidated. This will allow `get_namespace_oid` (and other functions) to proceed with reading the OID from the heap tables, fill in the cache, and return it to callers. Closes #2143
@mfreed I don't have the capacity to test against the source (would need to setup build env and spend time building this). Would appreciate if there were either rpms, or drop-in patched binaries for this to test. |
Are you going to release a bug fix 1.7.5 or something? We're kinda hesitant to upgrade all the way to 2.x, fearing other regressions. |
Relevant system information:
Describe the bug
Segfault when running drop_chunks on entire database. Conditions on system:
drop_chunks
is ran by crond for entire database (which sequentially means fo all hypertables in it)drop_chunks
executed once a day at the same time to retain only last 24 hours of dataTo Reproduce
No idea. It happens inconsistently
Expected behavior
No segfault
Actual behavior
Segfault
Additional context
Saved core dump from crash:
The text was updated successfully, but these errors were encountered: