Skip to content

Commit a60921b

Browse files
committed
Troubleshooting: Rework section, add dedicated pages for jcmd, JFR, CFR
1 parent 2ddbe16 commit a60921b

File tree

8 files changed

+563
-232
lines changed

8 files changed

+563
-232
lines changed

docs/admin/troubleshooting/cfr.md

+96
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
(cfr)=
2+
# CrateDB Flight Recorder (CFR)
3+
4+
:::{rubric} About
5+
:::
6+
In a similar spirit like the [](#jfr), CFR helps to collect information about
7+
CrateDB clusters for support requests and self-service debugging.
8+
9+
CFR is a utility application to acquire and export diagnostic information from
10+
CrateDB's [system tables](#systables) into an archive file. You can transmit
11+
this file to support engineers, in order to optimally convey relevant
12+
information about your cluster, mostly for debugging and troubleshooting
13+
purposes.
14+
15+
:::{rubric} Details
16+
:::
17+
The CrateDB Flight Recorder (CFR) is an ETL application dumping all database
18+
tables in the `sys` schema into a timestamped tarball archive file.
19+
On the receiving end, the recording can be imported into another CrateDB
20+
instance, in order to inspect and analyze it.
21+
22+
Flight recordings can be started against any running CrateDB cluster at runtime.
23+
The utility connects to CrateDB like a regular client, talking SQL.
24+
CFR is part of the CrateDB Toolkit (`ctk cfr`), and is also available as a
25+
standalone application `cratedb-cfr(.exe)`.
26+
27+
28+
## Synopsis
29+
30+
:Export:
31+
32+
`cratedb-cfr sys-export` invokes the export operation.
33+
34+
:Import:
35+
36+
`cratedb-cfr sys-import` invokes the import operation.
37+
38+
39+
## Install
40+
41+
Select one of the standalone application bundles, matching the platform
42+
and architecture of the corresponding system where you intend to run CFR.
43+
44+
::::{grid} 1 2 2 2
45+
46+
:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Linux x64
47+
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929097
48+
:link-alt: CFR for Linux x64
49+
:padding: 0
50+
:class-title: sd-fs-5
51+
+++
52+
cratedb-cfr-linux-x64.zip
53+
:::
54+
55+
:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS x64
56+
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674929134
57+
:link-alt: CFR for macOS x64
58+
:padding: 0
59+
:class-title: sd-fs-5
60+
+++
61+
cratedb-cfr-macos-x64.zip
62+
:::
63+
64+
:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` Windows x64
65+
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674930132
66+
:link-alt: CFR for Windows x64
67+
:padding: 0
68+
:class-title: sd-fs-5
69+
+++
70+
cratedb-cfr-windows-x64.zip
71+
:::
72+
73+
:::{grid-item-card} {material-outlined}`download_for_offline;1.4em` macOS ARM64
74+
:link: https://github.com/crate-workbench/cratedb-toolkit/actions/runs/9826830191/artifacts/1674927962
75+
:link-alt: CFR for macOS ARM64
76+
:padding: 0
77+
:class-title: sd-fs-5
78+
+++
79+
cratedb-cfr-macos-arm64.zip
80+
:::
81+
82+
::::
83+
84+
85+
86+
## Learn
87+
88+
:::{card} {material-outlined}`library_books;1.6em` CrateDB Cluster Flight Recorder (CFR)
89+
:link: ctk:cfr
90+
:link-type: ref
91+
Learn about the concepts of CFR, and how to use it.
92+
:::
93+
94+
95+
[Java Flight Recorder]: https://en.wikipedia.org/wiki/JDK_Flight_Recorder
96+
[jcmd]: https://docs.oracle.com/en/java/javase/17/docs/specs/man/jcmd.html

docs/admin/troubleshooting/crate-node.rst

+51-53
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,17 @@
22

33
.. _use-crate-node:
44

5-
===============================================
6-
Troubleshooting with the ``crate-node`` command
7-
===============================================
5+
==========================
6+
The ``crate-node`` command
7+
==========================
88

9-
This document shows you how to troubleshoot CrateDB nodes with the
10-
`crate-node`_ command. Using this command, you can:
9+
Use the `crate-node`_ command to troubleshoot CrateDB cluster nodes.
10+
Using this command, you can:
1111

12-
* Repurpose nodes and clean up their old data
12+
* Repurpose nodes and clean up their old data.
1313
* Force the election of a master node (and the creation of a new cluster) in
14-
the event that you lose too many nodes to be able to form a quorum
15-
* Detach nodes from an old cluster so they can be moved to a new cluster
14+
the event that you lose too many nodes to be able to form a quorum.
15+
* Detach nodes from an old cluster so they can be moved to a new cluster.
1616

1717
.. rubric:: Table of contents
1818

@@ -28,38 +28,35 @@ This document shows you how to troubleshoot CrateDB nodes with the
2828
Repurpose a node
2929
================
3030

31+
.. rubric:: About
32+
3133
In a situation where you have irrecoverably lost the majority of the
3234
master-eligible nodes in a cluster, you may need to form a new cluster.
33-
3435
When forming a new cluster, you may have to change the `role`_ of one or more
3536
nodes. Changing the role of a node is referred to as *repurposing* a node.
3637

3738
Each node checks the contents of its :ref:`data path <crate-reference:conf-env>`
38-
at startup. If CrateDB
39-
discovers unexpected data, it will refuse to start. Specifically:
39+
at startup. If CrateDB discovers unexpected data, it will refuse to start.
40+
The specific rules are:
4041

4142
- Nodes configured with `node.data`_ set to ``false`` will refuse to start if
42-
they find any shard data at startup
43+
they find any shard data at startup.
4344

4445
- Nodes configured with both `node.master`_ set to ``false`` and `node.data`_
4546
set to ``false`` will refuse to start if they have any index metadata at
46-
startup
47+
startup.
4748

4849
The `crate-node`_ :ref:`repurpose command <crate-reference:cli-crate-node-commands>`
49-
can help you clean up the necessary
50-
node data so that CrateDB can be restarted with a new role.
50+
can help you clean up the necessary node data, so that CrateDB can be restarted
51+
with a new role.
5152

52-
53-
Procedure
54-
---------
53+
.. rubric:: Procedure
5554

5655
To repurpose a node, first of all, you must stop the node.
57-
5856
Then, update the settings `node.data`_ and `node.master`_ in the ``crate.yml``
5957
:ref:`configuration file <crate-reference:config>` as needed.
60-
6158
The ``node.data`` and ``node.master`` settings can be configured in four
62-
different ways, each corresponding to a different type of node:
59+
different ways, each corresponding to a different type of node.
6360

6461
+-------------------+------------------------+-----------------------------+
6562
| Role | Configuration | After repurposing |
@@ -95,7 +92,7 @@ deleted (i.e., "cleaned up") after repurposing the node to that configuration.
9592
Before running the ``repurpose`` command, make sure that any data you want
9693
to keep is available on other nodes in the cluster.
9794

98-
Then, run the ``repurpose`` command:
95+
Then, invoke the ``repurpose`` command.
9996

10097
.. code-block:: console
10198
@@ -112,33 +109,36 @@ Then, run the ``repurpose`` command:
112109
Node successfully repurposed to master and no data.
113110
114111
As mentioned in the command output, you can pass in ``-v`` to get a more
115-
verbose output, like so:
112+
verbose output.
116113

117114
.. code-block:: console
118115
119116
sh$ ./bin/crate-node repurpose -v
120117
121-
Finally, start the node again.
122-
123-
The node has been successfully repurposed.
118+
Finally, start the node again. After that, the node has been successfully
119+
repurposed.
124120

125121

126122
.. _crate-node-unsafe-bootstrap:
127123

128124
Perform an unsafe cluster bootstrap
129125
===================================
130126

127+
.. rubric:: About
128+
131129
When communication is lost between one or more nodes in a cluster (e.g., during
132-
a *cluster partition*), the situation is assumed to be temporary and safeguards
130+
a `network partition`_), the situation is assumed to be temporary and safeguards
133131
exist to prevent the election of a master node unless a `quorum`_ can be
134132
established.
135133

136134
However, if the situation is permanent (i.e., you have irrecoverably lost a
137-
majority of the nodes in your cluster), you will need to force the election of
135+
majority of the nodes in your cluster), also known as a `split-brain`_ situation,
136+
you will need to force the election of
138137
a master. Forcing a master election without quorum is referred to as an *unsafe
139138
cluster bootstrap*.
140139

141-
The `crate-node`_ ``unsafe-bootstrap`` command can help you choose a new master
140+
The :ref:`unsafe-bootstrap command <crate-reference:cli-crate-node-commands>`
141+
can support you to choose a new master
142142
node and subsequently perform an unsafe cluster bootstrap.
143143

144144
.. WARNING::
@@ -160,8 +160,7 @@ node and subsequently perform an unsafe cluster bootstrap.
160160
have access to the file system.
161161

162162

163-
Procedure
164-
---------
163+
.. rubric:: Procedure
165164

166165
Before you continue, you must stop all master-eligible nodes in the cluster.
167166

@@ -175,12 +174,11 @@ Before you continue, you must stop all master-eligible nodes in the cluster.
175174
Once all master-eligible nodes in the cluster have been stopped, you can
176175
manually select a new master.
177176

178-
To help you select a new master, the ``unsafe-bootstrap`` command returns
179-
information about the node cluster state as a pair of values in the form
180-
*(term, version)*.
181-
177+
To support you selecting a new master node, the ``unsafe-bootstrap`` command
178+
returns information about the node cluster state as a pair of values in the
179+
form *(term, version)*.
182180
You can gather this information (safely) by issuing the ``unsafe-bootstrap``
183-
command and answering "no" (``n``) at the confirmation prompt, like so:
181+
command and answering "no" (``n``) at the confirmation prompt.
184182

185183
.. code-block:: console
186184
@@ -211,8 +209,8 @@ value, select any one of them.
211209
that you elect a master node with the freshest state data. This, in turn,
212210
minimizes the potential for data loss and inconsistency.
213211

214-
Once you have selected a node to elect to master, run the ``unsafe-bootstrap``
215-
command on that node and answer yes (``y``) at the confirmation prompt:
212+
Once you have selected a node to elect to master, invoke the ``unsafe-bootstrap``
213+
command on that node and answer yes (``y``) at the confirmation prompt.
216214

217215
.. code-block:: console
218216
@@ -226,46 +224,45 @@ command on that node and answer yes (``y``) at the confirmation prompt:
226224
227225
Confirm [y/N] y
228226
229-
If the operation was successful, the command will output:
227+
If the operation was successful, the program will acknowledge it.
228+
**Note:** This success message indicates that the operation was completed.
229+
You may still experience data loss and inconsistencies.
230230

231231
.. code-block:: console
232232
233233
Master node was successfully bootstrapped
234234
235-
.. NOTE::
236-
237-
This success message indicates that the operation was completed. You may
238-
still experience data loss and inconsistencies.
239-
240-
Start the bootstrapped node and verify that it has started a new cluster with
235+
Now, start the bootstrapped node and verify that it has started a new cluster with
241236
one node and elected itself as the master.
242237

243238
Before you can add the rest of the nodes to the new cluster, you must detach
244239
them from the old cluster (see the :ref:`next section
245240
<crate-node-detach-cluster>`).
246241

247-
When that's done, start the nodes and verify that they join the new cluster.
242+
After that's done, start the nodes and verify that they join the new cluster.
248243

249244
.. NOTE::
250245

251246
Once the new cluster is up-and-running and all recoveries are complete, you
252-
are responsible for assessing the cluster for data loss and
253-
inconsistencies.
247+
are advised to assess the database for data loss and inconsistencies.
254248

255249

256250
.. _crate-node-detach-cluster:
257251

258252
Detach a node from its cluster
259253
==============================
260254

255+
.. rubric:: About
256+
261257
To protect nodes from inadvertently rejoining the wrong cluster (e.g., in the
262258
event of a network partition), each node binds to the first cluster it joins.
263259

264260
However, if a cluster has permanently failed (see the :ref:`previous section
265261
<crate-node-unsafe-bootstrap>`) you must detach nodes before you can move them
266262
to a a new cluster.
267263

268-
The `crate-node`_ ``detach-cluster`` command can help you move a node to a new
264+
The :ref:`detach-cluster command <crate-reference:cli-crate-node-commands>`
265+
supports you moving a node to a new
269266
cluster by resetting the cluster it is bound to (i.e., *detaching* it from its
270267
existing cluster).
271268

@@ -278,8 +275,7 @@ existing cluster).
278275
cluster bootstrap <crate-node-unsafe-bootstrap>`.
279276

280277

281-
Procedure
282-
---------
278+
.. rubric:: Procedure
283279

284280
To detach a node, run:
285281

@@ -293,7 +289,7 @@ To detach a node, run:
293289
294290
Confirm [y/N] y
295291
296-
You should see this:
292+
A corresponding message confirms success.
297293

298294
.. code-block:: console
299295
@@ -304,14 +300,16 @@ When the node is started again, it will be able to join a new cluster.
304300
.. NOTE::
305301

306302
You may also have to update the :ref:`discovery configuration
307-
<crate-reference:conf_discovery>` so that
303+
<crate-reference:conf_discovery>`, so that
308304
nodes are able to find the new cluster.
309305

310306

311307
.. _crate-node: https://cratedb.com/docs/crate/reference/en/latest/cli-tools.html#cli-crate-node
312308
.. _data path: https://cratedb.com/docs/crate/reference/en/latest/config/environment.html#application-variables
309+
.. _network partition: https://en.wikipedia.org/wiki/Network_partition
313310
.. _node.data: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
314311
.. _node.master: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
315312
.. _quorum: https://cratedb.com/docs/crate/reference/en/latest/concepts/clustering.html#master-node-election
316313
.. _role: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
314+
.. _split-brain: https://en.wikipedia.org/wiki/Split-brain_(computing)
317315
.. _UUID: https://en.wikipedia.org/wiki/Universally_unique_identifier

0 commit comments

Comments
 (0)