Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgresql process segfault when running drop_chunks #2143

Closed
akamensky opened this issue Jul 28, 2020 · 19 comments · Fixed by #2628
Closed

Postgresql process segfault when running drop_chunks #2143

akamensky opened this issue Jul 28, 2020 · 19 comments · Fixed by #2628
Assignees
Labels
bug segfault Segmentation fault

Comments

@akamensky
Copy link

akamensky commented Jul 28, 2020

Relevant system information:

  • OS: Centos 7.5
  • PostgreSQL version: 12.3
  • TimescaleDB version: 1.7.2
  • Installation method: YUM

Describe the bug
Segfault when running drop_chunks on entire database. Conditions on system:

  1. Master/slave replication configured between two hosts
  2. Instance has multiple databases with hypertables in each
  3. drop_chunks is ran by crond for entire database (which sequentially means fo all hypertables in it)
  4. All drop_chunks executed once a day at the same time to retain only last 24 hours of data
  5. Data is being written all the time (even when running drop_chunks)

To Reproduce
No idea. It happens inconsistently

Expected behavior
No segfault

Actual behavior
Segfault

Additional context
Saved core dump from crash:

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 20329]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: kafkaconnect dbname 10.2.50.250(43588) BI'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000074bb4e in LockAcquireExtended ()
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x000000000074bb4e in LockAcquireExtended ()
#1  0x00000000007492b6 in LockRelationOid ()
#2  0x0000000000490615 in relation_open ()
#3  0x00000000004f5136 in table_open ()
#4  0x0000000000866e74 in SearchCatCacheMiss ()
#5  0x00000000008676c9 in SearchCatCache ()
#6  0x0000000000878746 in GetSysCacheOid ()
#7  0x000000000052ded5 in get_namespace_oid ()
#8  0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#9  0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#10 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#11 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#12 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#13 0x00000000007492c0 in LockRelationOid ()
#14 0x0000000000490615 in relation_open ()
#15 0x00000000004f5136 in table_open ()
#16 0x0000000000866e74 in SearchCatCacheMiss ()
#17 0x00000000008676c9 in SearchCatCache ()
#18 0x0000000000878746 in GetSysCacheOid ()
#19 0x000000000052ded5 in get_namespace_oid ()
#20 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#21 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#22 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#23 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#24 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#25 0x00000000007492c0 in LockRelationOid ()
#26 0x0000000000490615 in relation_open ()
#27 0x00000000004f5136 in table_open ()
#28 0x0000000000866e74 in SearchCatCacheMiss ()
#29 0x00000000008676c9 in SearchCatCache ()
#30 0x0000000000878746 in GetSysCacheOid ()
#31 0x000000000052ded5 in get_namespace_oid ()
#32 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#33 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#34 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#35 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#36 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#37 0x00000000007492c0 in LockRelationOid ()
#38 0x0000000000490615 in relation_open ()
#39 0x00000000004f5136 in table_open ()
#40 0x0000000000866e74 in SearchCatCacheMiss ()
#41 0x00000000008676c9 in SearchCatCache ()
#42 0x0000000000878746 in GetSysCacheOid ()
#43 0x000000000052ded5 in get_namespace_oid ()
#44 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#45 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#46 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#47 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#48 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#49 0x00000000007492c0 in LockRelationOid ()
#50 0x0000000000490615 in relation_open ()
#51 0x00000000004f5136 in table_open ()
#52 0x0000000000866e74 in SearchCatCacheMiss ()
#53 0x00000000008676c9 in SearchCatCache ()
#54 0x0000000000878746 in GetSysCacheOid ()
#55 0x000000000052ded5 in get_namespace_oid ()
#56 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#57 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#58 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#59 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#60 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#61 0x00000000007492c0 in LockRelationOid ()
---Type <return> to continue, or q <return> to quit---
#62 0x0000000000490615 in relation_open ()
#63 0x00000000004f5136 in table_open ()
#64 0x0000000000866e74 in SearchCatCacheMiss ()
#65 0x00000000008676c9 in SearchCatCache ()
#66 0x0000000000878746 in GetSysCacheOid ()
#67 0x000000000052ded5 in get_namespace_oid ()
#68 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#69 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#70 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#71 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#72 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#73 0x00000000007492c0 in LockRelationOid ()
#74 0x0000000000490615 in relation_open ()
#75 0x00000000004f5136 in table_open ()
#76 0x0000000000866e74 in SearchCatCacheMiss ()
#77 0x00000000008676c9 in SearchCatCache ()
#78 0x0000000000878746 in GetSysCacheOid ()
#79 0x000000000052ded5 in get_namespace_oid ()
#80 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#81 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#82 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#83 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#84 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#85 0x00000000007492c0 in LockRelationOid ()
#86 0x0000000000490615 in relation_open ()
#87 0x00000000004f5136 in table_open ()
#88 0x0000000000866e74 in SearchCatCacheMiss ()
#89 0x00000000008676c9 in SearchCatCache ()
#90 0x0000000000878746 in GetSysCacheOid ()
#91 0x000000000052ded5 in get_namespace_oid ()
#92 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#93 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#94 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#95 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#96 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#97 0x00000000007492c0 in LockRelationOid ()
#98 0x0000000000490615 in relation_open ()
#99 0x00000000004f5136 in table_open ()
#100 0x0000000000866e74 in SearchCatCacheMiss ()
#101 0x00000000008676c9 in SearchCatCache ()
#102 0x0000000000878746 in GetSysCacheOid ()
#103 0x000000000052ded5 in get_namespace_oid ()
#104 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#105 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#106 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#107 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#108 0x0000000000746228 in ReceiveSharedInvalidMessages ()
#109 0x00000000007492c0 in LockRelationOid ()
#110 0x0000000000490615 in relation_open ()
#111 0x00000000004f5136 in table_open ()
#112 0x0000000000866e74 in SearchCatCacheMiss ()
#113 0x00000000008676c9 in SearchCatCache ()
#114 0x0000000000878746 in GetSysCacheOid ()
#115 0x000000000052ded5 in get_namespace_oid ()
#116 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#117 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#118 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#119 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#120 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#121 0x00000000007492c0 in LockRelationOid ()
#122 0x0000000000490615 in relation_open ()
#123 0x00000000004f5136 in table_open ()
---Type <return> to continue, or q <return> to quit---


AND SO ON UNTIL:

#102956 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102957 0x00007fe9af03d867 in ts_extension_is_loaded () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102958 0x00007fe9af028078 in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102959 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#102960 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#102961 0x00000000007492c0 in LockRelationOid ()
#102962 0x0000000000490615 in relation_open ()
#102963 0x00000000004f5136 in table_open ()
#102964 0x0000000000866e74 in SearchCatCacheMiss ()
#102965 0x00000000008676c9 in SearchCatCache ()
#102966 0x0000000000878746 in GetSysCacheOid ()
#102967 0x000000000052ded5 in get_namespace_oid ()
#102968 0x00007fe9af03d303 in extension_update_state () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102969 0x00007fe9af03d71f in ts_extension_invalidate () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102970 0x00007fe9af02806f in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.2.so
#102971 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#102972 0x0000000000746228 in ReceiveSharedInvalidMessages ()
#102973 0x00000000007492c0 in LockRelationOid ()
#102974 0x0000000000490615 in relation_open ()
#102975 0x00000000004f5136 in table_open ()
#102976 0x0000000000866e74 in SearchCatCacheMiss ()
#102977 0x00000000008676c9 in SearchCatCache ()
#102978 0x0000000000878746 in GetSysCacheOid ()
#102979 0x000000000052ded5 in get_namespace_oid ()
---Type <return> to continue, or q <return> to quit---
#102980 0x000000000052e0ed in recomputeNamespacePath ()
#102981 0x0000000000531679 in OverrideSearchPathMatchesCurrent ()
#102982 0x000000000086df71 in RevalidateCachedQuery ()
#102983 0x000000000086e316 in GetCachedPlan ()
#102984 0x000000000076237f in PostgresMain ()
#102985 0x0000000000484022 in ServerLoop ()
#102986 0x00000000006f14c3 in PostmasterMain ()
#102987 0x0000000000484f23 in main ()

@mfreed
Copy link
Member

mfreed commented Jul 28, 2020

Thanks for the detailed report with the stacktrace, @akamensky .

@akamensky
Copy link
Author

Added ending of that bt. It appears a bit too deep with repeated blocks. Seems to be unexpected loop?

@mkindahl
Copy link
Contributor

Thanks @akamensky for the backtrace. Is this the same situation as in #1986?

@mkindahl mkindahl added bug segfault Segmentation fault labels Jul 28, 2020
@mkindahl mkindahl self-assigned this Jul 28, 2020
@akamensky
Copy link
Author

@mkindahl not sure what situation you are referring to. It is same system, but different environment where we see this one. We see this in our prod only (which is the busiest and most loaded environment obv). Setup is the same with the exception of how frequently we drop chunks -- in prod it is every 24 hours, so it is a lot of data that needs to be dropped all at once, while #1986 still happens in our staging environment where drop_chunks called every hour.

@mkindahl
Copy link
Contributor

@akamensky Since these are often race conditions of some sort, it is good to know what other jobs are running on the database. AIUI, you have an connection doing inserts at a high rate running in parallel with the drop_chunks?

@akamensky
Copy link
Author

@mkindahl as described in #1986 database is being written by kafkaconnect based applications. It is continuous stream of writes.

There are multiple databases on the host (same postgres process), each has multiple hypertables in it, each hypertable is destination of such kafkaconnect application doing writes of stream of events.

To give you an idea of amount of writes -- disk writes are constant flow of about 5 Gbps data.

That is nearly 24*7.

@akamensky
Copy link
Author

akamensky commented Sep 23, 2020

Happened once again in our Prod environment. Though stacktrace looks similar (we are on 1.7.3 now):

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 37956]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: kafkaconnect db_name 10.2.50.239(39310) BI'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000074bb8e in LockAcquireExtended ()
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) bt
#0  0x000000000074bb8e in LockAcquireExtended ()
#1  0x00000000007492b6 in LockRelationOid ()
#2  0x0000000000490615 in relation_open ()
#3  0x00000000004f5136 in table_open ()
#4  0x000000000086f0e6 in ScanPgRelation ()
#5  0x0000000000870d04 in RelationBuildDesc ()
#6  0x0000000000871e7b in RelationClearRelation ()
#7  0x00000000008733bd in RelationIdGetRelation ()
#8  0x00000000004905e3 in relation_open ()
#9  0x00000000004f5136 in table_open ()
#10 0x0000000000866e74 in SearchCatCacheMiss ()
#11 0x00000000008676c9 in SearchCatCache ()
#12 0x0000000000878746 in GetSysCacheOid ()
#13 0x000000000052ded5 in get_namespace_oid ()
#14 0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#15 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#16 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#17 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#18 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#19 0x00000000007492c0 in LockRelationOid ()
#20 0x0000000000490615 in relation_open ()
#21 0x00000000004f5136 in table_open ()
#22 0x000000000086f0e6 in ScanPgRelation ()
#23 0x0000000000870d04 in RelationBuildDesc ()
#24 0x0000000000871e7b in RelationClearRelation ()
#25 0x00000000008733bd in RelationIdGetRelation ()
#26 0x00000000004905e3 in relation_open ()
#27 0x00000000004f5136 in table_open ()
#28 0x0000000000866e74 in SearchCatCacheMiss ()
#29 0x00000000008676c9 in SearchCatCache ()
#30 0x0000000000878746 in GetSysCacheOid ()
#31 0x000000000052ded5 in get_namespace_oid ()
#32 0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#33 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#34 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#35 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#36 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#37 0x00000000007492c0 in LockRelationOid ()
#38 0x0000000000490615 in relation_open ()
#39 0x00000000004f5136 in table_open ()
#40 0x000000000086f0e6 in ScanPgRelation ()
#41 0x0000000000870d04 in RelationBuildDesc ()
#42 0x0000000000871e7b in RelationClearRelation ()
#43 0x00000000008733bd in RelationIdGetRelation ()
#44 0x00000000004905e3 in relation_open ()
#45 0x00000000004f5136 in table_open ()
#46 0x0000000000866e74 in SearchCatCacheMiss ()

@akamensky
Copy link
Author

And another one just a bit earlier (the top of the stack looks a bit different tho):

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 46031]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: kafkaconnect db_name 10.2.50.146(45478) BI'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004df08f in _bt_first ()
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) bt
#0  0x00000000004df08f in _bt_first ()
#1  0x00000000004dc0db in btgettuple ()
#2  0x00000000004d64b1 in index_getnext_tid ()
#3  0x00000000004d65fb in index_getnext_slot ()
#4  0x00000000004d5a01 in systable_getnext ()
#5  0x000000000086f125 in ScanPgRelation ()
#6  0x0000000000870d04 in RelationBuildDesc ()
#7  0x00000000008733e0 in RelationIdGetRelation ()
#8  0x00000000004905e3 in relation_open ()
#9  0x00000000004d5e86 in index_open ()
#10 0x00000000004d5951 in systable_beginscan ()
#11 0x0000000000866e99 in SearchCatCacheMiss ()
#12 0x00000000008676c9 in SearchCatCache ()
#13 0x0000000000878746 in GetSysCacheOid ()
#14 0x000000000052ded5 in get_namespace_oid ()
#15 0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#16 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#17 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#18 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#19 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#20 0x00000000007492c0 in LockRelationOid ()
#21 0x0000000000490615 in relation_open ()
#22 0x00000000004f5136 in table_open ()
#23 0x0000000000870ef5 in RelationBuildDesc ()
#24 0x00000000008733e0 in RelationIdGetRelation ()
#25 0x00000000004905e3 in relation_open ()
#26 0x00000000004d5e86 in index_open ()
#27 0x00000000004d5951 in systable_beginscan ()
#28 0x0000000000866e99 in SearchCatCacheMiss ()
#29 0x00000000008676c9 in SearchCatCache ()
#30 0x0000000000878746 in GetSysCacheOid ()
#31 0x000000000052ded5 in get_namespace_oid ()
#32 0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#33 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#34 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#35 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#36 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#37 0x00000000007492c0 in LockRelationOid ()
#38 0x0000000000490615 in relation_open ()
#39 0x00000000004f5136 in table_open ()

@akamensky
Copy link
Author

And one more also earlier:

#0  0x000000000088935d in hash_search_with_hash_value ()
#1  0x000000000074bbe7 in LockAcquireExtended ()
#2  0x00000000007492b6 in LockRelationOid ()
#3  0x0000000000490615 in relation_open ()
#4  0x00000000004f5136 in table_open ()
#5  0x0000000000866e74 in SearchCatCacheMiss ()
#6  0x00000000008676c9 in SearchCatCache ()
#7  0x0000000000878746 in GetSysCacheOid ()
#8  0x000000000052ded5 in get_namespace_oid ()
#9  0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#10 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#11 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so

and another one also a bit earlier:

#0  0x00000000004f5131 in table_open ()
#1  0x0000000000866e74 in SearchCatCacheMiss ()
#2  0x00000000008676c9 in SearchCatCache ()
#3  0x0000000000878746 in GetSysCacheOid ()
#4  0x000000000052ded5 in get_namespace_oid ()
#5  0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#6  0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#7  0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#8  0x00000000008698ca in LocalExecuteInvalidationMessage ()
#9  0x0000000000746188 in ReceiveSharedInvalidMessages ()
#10 0x00000000007492c0 in LockRelationOid ()
#11 0x0000000000490615 in relation_open ()
#12 0x00000000004f5136 in table_open ()
#13 0x0000000000866e74 in SearchCatCacheMiss ()
#14 0x00000000008676c9 in SearchCatCache ()
#15 0x0000000000878746 in GetSysCacheOid ()
#16 0x000000000052ded5 in get_namespace_oid ()
#17 0x00007f1b7180e49f in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#18 0x00007f1b7180e694 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.3.so
#19 0x00007f1b7180e19d in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.3.so

@akamensky
Copy link
Author

akamensky commented Oct 21, 2020

Today (last night actually) this crashed our production timescaledb instance. It did not restart and stayed down overnight, so we have about 10 hours data lost (as not written to DB)...

gdb /usr/pgsql-12/bin/postmaster /data/dump/core.postmaster.1603206004.43480 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 43480]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: kafkaconnect db_name 10.2.50.146(51318) BI'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000088935d in hash_search_with_hash_value ()
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x000000000088935d in hash_search_with_hash_value ()
#1  0x00000000007317aa in BufTableLookup ()
#2  0x000000000073419c in ReadBuffer_common ()
#3  0x0000000000734b30 in ReadBufferExtended ()
#4  0x00000000004d9af9 in _bt_getbuf ()
#5  0x00000000004d9f5f in _bt_getroot ()
#6  0x00000000004dea38 in _bt_search ()
#7  0x00000000004df701 in _bt_first ()
#8  0x00000000004dc0db in btgettuple ()
#9  0x00000000004d64b1 in index_getnext_tid ()
#10 0x00000000004d65fb in index_getnext_slot ()
#11 0x00000000004d5a01 in systable_getnext ()
#12 0x000000000086f125 in ScanPgRelation ()
#13 0x0000000000870d04 in RelationBuildDesc ()
#14 0x0000000000871e7b in RelationClearRelation ()
#15 0x00000000008733bd in RelationIdGetRelation ()
#16 0x00000000004905e3 in relation_open ()
#17 0x00000000004f5136 in table_open ()
#18 0x0000000000866e74 in SearchCatCacheMiss ()
#19 0x00000000008676c9 in SearchCatCache ()
#20 0x0000000000878746 in GetSysCacheOid ()
#21 0x000000000052ded5 in get_namespace_oid ()
#22 0x00007f33d461d4ef in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#23 0x00007f33d461d6e4 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#24 0x00007f33d461d1ed in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#25 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#26 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#27 0x00000000007492c0 in LockRelationOid ()
#28 0x0000000000490615 in relation_open ()
#29 0x00000000004f5136 in table_open ()
#30 0x0000000000870ef5 in RelationBuildDesc ()
#31 0x0000000000871e7b in RelationClearRelation ()
#32 0x00000000008733bd in RelationIdGetRelation ()
#33 0x00000000004905e3 in relation_open ()
#34 0x00000000004f5136 in table_open ()
#35 0x0000000000866e74 in SearchCatCacheMiss ()
#36 0x00000000008676c9 in SearchCatCache ()
#37 0x0000000000878746 in GetSysCacheOid ()
#38 0x000000000052ded5 in get_namespace_oid ()
#39 0x00007f33d461d4ef in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#40 0x00007f33d461d6e4 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#41 0x00007f33d461d1ed in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#42 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#43 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#44 0x00000000007492c0 in LockRelationOid ()
#45 0x0000000000490615 in relation_open ()

@WGH-
Copy link

WGH- commented Oct 22, 2020

We had a crash similar to the original post.

It occured just after we finished moving lots of old data to TimescaleDB hypertables. The crash seemingly occured during the execution on an unrelated query.

When we moved data, we disabled compression job, did INSERT INTO new_table SELECT * FROM old_table, did synchronous compress_chunk for old data, and restarted jobs.

0  0x000055e80bfd7c30 in ?? ()
#1  0x000055e80bfd85c1 in SearchCatCache ()
#2  0x000055e80bfeb3fe in GetSysCacheOid ()
#3  0x000055e80bc8fbd1 in get_namespace_oid ()
#4  0x00007f102af6ca0e in ts_catalog_get () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#5  0x00007f102af6c52d in ?? () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#6  0x000055e80bfda9bf in LocalExecuteInvalidationMessage ()
#7  0x000055e80bebad70 in ReceiveSharedInvalidMessages ()
#8  0x000055e80bebe2f2 in LockRelationOid ()
#9  0x000055e80bc218ed in relation_open ()
#10 0x000055e80bc21afe in heap_open ()
#11 0x000055e80bfd7cd1 in ?? ()
#12 0x000055e80bfd85c1 in SearchCatCache ()
#13 0x000055e80bfeb3fe in GetSysCacheOid ()
#14 0x000055e80bc8fbd1 in get_namespace_oid ()
#15 0x00007f102af6ca0e in ts_catalog_get () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#16 0x00007f102af6c52d in ?? () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#17 0x000055e80bfda9bf in LocalExecuteInvalidationMessage ()
#18 0x000055e80bebad70 in ReceiveSharedInvalidMessages ()
#19 0x000055e80bebe2f2 in LockRelationOid ()
#20 0x000055e80bc218ed in relation_open ()
#21 0x000055e80bc21afe in heap_open ()
#22 0x000055e80bfd7cd1 in ?? ()
...
#82237 0x000055e80bfd85c1 in SearchCatCache ()
#82238 0x000055e80bfeb3fe in GetSysCacheOid ()
#82239 0x000055e80bc8fbd1 in get_namespace_oid ()
#82240 0x00007f102af6ca0e in ts_catalog_get () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#82241 0x00007f102af6c52d in ?? () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#82242 0x000055e80bfda9bf in LocalExecuteInvalidationMessage ()
#82243 0x000055e80bebad70 in ReceiveSharedInvalidMessages ()
#82244 0x000055e80bebe2f2 in LockRelationOid ()
#82245 0x000055e80bc218ed in relation_open ()
#82246 0x000055e80bc21afe in heap_open ()
#82247 0x000055e80bfd7cd1 in ?? ()
#82248 0x000055e80bfd85c1 in SearchCatCache ()
#82249 0x000055e80bfeb3fe in GetSysCacheOid ()
#82250 0x000055e80bc8fbd1 in get_namespace_oid ()
#82251 0x00007f102af6ca0e in ts_catalog_get () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#82252 0x00007f102af6c52d in ?? () from ./usr/lib/postgresql/11/lib/timescaledb-1.7.4.so
#82253 0x000055e80bfda9bf in LocalExecuteInvalidationMessage ()
#82254 0x000055e80bebae18 in ReceiveSharedInvalidMessages ()
#82255 0x000055e80bc21a6f in relation_openrv_extended ()
#82256 0x000055e80bc21c61 in heap_openrv_extended ()
#82257 0x000055e80bcecfd9 in parserOpenTable ()
#82258 0x000055e80bcd853b in setTargetTable ()
#82259 0x000055e80bcb841e in transformStmt ()
#82260 0x000055e80bcb93d5 in transformTopLevelStmt ()
#82261 0x000055e80bcb9448 in parse_analyze ()
#82262 0x000055e80bed6787 in pg_analyze_and_rewrite ()
#82263 0x000055e80bed6ec8 in ?? ()
#82264 0x000055e80bed829f in PostgresMain ()
#82265 0x000055e80bbe4fd7 in ?? ()
#82266 0x000055e80be62692 in PostmasterMain ()
#82267 0x000055e80bbe63b5 in main ()

@WGH-
Copy link

WGH- commented Oct 22, 2020

Perhaps, it's actually #2200 that wasn't fixed after all?

@mfreed
Copy link
Member

mfreed commented Oct 29, 2020

Thanks for the continued reports @akamensky @WGH- . Just to confirm that you are running 1.7.3?

We had thought this was fixed in that, but a quick glance at it seems to be this might be a slightly different issue not triggered by ALTER EXTENSION ? Are your last 2 stack traces believed from similar issue?

We believe this was fixed here: #2259 (which was in 1.7.3), particularly because because cache_invalidate_callback() -> ts_extension_is_loaded() is part of the trace.

@akamensky
Copy link
Author

@mfreed we have all our timescale servers updated to 1.7.4. As you can see above (my last stack trace) it still is happening there. We have all DBs updated as per documentation using "alter extension timescaledb update;". On every upgrade we always check that "\dx" shows correct extension number and the postgres logs don't show any warnings or errors.

@akamensky
Copy link
Author

Hit the same issue again in our production environment (missing all data over the weekend). If the issue supposed to be fixed in 1.7.4 (and we are currently running 1.7.4 in production), is there any issue with fix? Or RPMs that are published do not contain this fix?

$ gdb /usr/pgsql-12/bin/postmaster core.postmaster.1604674805.20524 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-12/bin/postgres...Reading symbols from /usr/pgsql-12/bin/postgres...(no debugging symbols found)...done.
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 20524]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: kafkaconnect dbname 10.2.50.239(48300) BI'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000867521 in SearchCatCache ()
Missing separate debuginfos, use: debuginfo-install postgresql12-server-12.3-1PGDG.rhel7.x86_64
(gdb) where
#0  0x0000000000867521 in SearchCatCache ()
#1  0x0000000000878746 in GetSysCacheOid ()
#2  0x000000000052ded5 in get_namespace_oid ()
#3  0x00007f041b9ea4ef in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#4  0x00007f041b9ea6e4 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#5  0x00007f041b9ea1ed in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#6  0x00000000008698ca in LocalExecuteInvalidationMessage ()
#7  0x0000000000746188 in ReceiveSharedInvalidMessages ()
#8  0x00000000007492c0 in LockRelationOid ()
#9  0x0000000000490615 in relation_open ()
#10 0x00000000004d5e86 in index_open ()
#11 0x00000000004d5951 in systable_beginscan ()
#12 0x000000000086f11a in ScanPgRelation ()
#13 0x0000000000870d04 in RelationBuildDesc ()
#14 0x00000000008733e0 in RelationIdGetRelation ()
#15 0x00000000004905e3 in relation_open ()
#16 0x00000000004f5136 in table_open ()
#17 0x0000000000866e74 in SearchCatCacheMiss ()
#18 0x00000000008676c9 in SearchCatCache ()
#19 0x0000000000878746 in GetSysCacheOid ()
#20 0x000000000052ded5 in get_namespace_oid ()
#21 0x00007f041b9ea4ef in ts_catalog_table_info_init () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#22 0x00007f041b9ea6e4 in ts_catalog_get () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#23 0x00007f041b9ea1ed in cache_invalidate_callback () from /usr/pgsql-12/lib/timescaledb-1.7.4.so
#24 0x00000000008698ca in LocalExecuteInvalidationMessage ()
#25 0x0000000000746188 in ReceiveSharedInvalidMessages ()
#26 0x00000000007492c0 in LockRelationOid ()
#27 0x0000000000490615 in relation_open ()
#28 0x00000000004d5e86 in index_open ()

@mkindahl
Copy link
Contributor

mkindahl commented Nov 9, 2020

Perhaps, it's actually #2200 that wasn't fixed after all?

It does indeed look the same as #2200 with the same recursion in the cache invalidation code. However, the recursion is taking a different code path compared to what #2259 added, so investigating what we can do to fix this.

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 9, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes timescale#2143
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 9, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes timescale#2143
@mfreed
Copy link
Member

mfreed commented Nov 9, 2020

@akamensky If you have any ability to test against source, would be welcome =)

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Nov 10, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes timescale#2143
mkindahl added a commit that referenced this issue Nov 10, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes #2143
erimatnor pushed a commit that referenced this issue Nov 11, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes #2143
erimatnor pushed a commit that referenced this issue Nov 12, 2020
Inside the `cache_invalidate_callback` the OID of the namespace is read
using `get_namespace_oid`, which usually goes to the cache. However, if
there is a cache miss inside that function, it will attempt to read the
OID from the heap, which involves invalidating the cache, which further
leads to `cache_invalidate_callback` being called recursively.

This commit prevent an infinite recursion by not doing a recursive call
and instead considering the cache invalidated. This will allow
`get_namespace_oid` (and other functions) to proceed with reading the
OID from the heap tables, fill in the cache, and return it to callers.

Closes #2143
@akamensky
Copy link
Author

@mfreed I don't have the capacity to test against the source (would need to setup build env and spend time building this). Would appreciate if there were either rpms, or drop-in patched binaries for this to test.

@WGH-
Copy link

WGH- commented Jan 29, 2021

Are you going to release a bug fix 1.7.5 or something? We're kinda hesitant to upgrade all the way to 2.x, fearing other regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug segfault Segmentation fault
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants