Skip to content

Commit a26c726

Browse files
committed
(PDB-5696) query.monitor: accommodate query socket reuse
Previously, any query a client issued after the first on a given connection might provoke a CancelledKeyException when it tried to register the query/socket/channel with the monitor because the selection key for a given channel for a given selector is unique, and a cancelled key is only cleaned up during the next call to select for the relevant selector. Any attempt to register a channel with a selector between the cancellation of the previous key for that channel and the clean up in the next select will provoke the CancelledKeyException. To address the issue, adjust the registration code to retry whenever it encounters that exception, until it succeeds. While it's retrying, the monitor select loop, which will have been woken up by the previous query's forget, will issue a new select call, removing the canceled key. Fixes: puppetlabs#3866
1 parent 173dd07 commit a26c726

File tree

2 files changed

+73
-8
lines changed

2 files changed

+73
-8
lines changed

src/puppetlabs/puppetdb/query/monitor.clj

+42-8
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,16 @@
3939
some cases customizably) time out.
4040
4141
Every monitored query will have a SelectionKey associated with it,
42-
and that key should be canceled by the monitor code once the query
43-
has been abandoned or has timed out.
42+
The key is cancelled during forget, but won't be removed from the
43+
selector's cancelled set until the next call to select. During that
44+
time, another query on the same socket/connection could try to
45+
re-register the cancelled key. This will throw an exception, which
46+
we suppress and retry until the select loop finally removes the
47+
cancelled key, and we can re-register the socket.
4448
4549
Every monitored query may also have a postgres pid associated with
4650
it, and whenever it does, that pid should be terminated (in
47-
corrdination with the :terminated promise) once the query has been
51+
coordination with the :terminated promise) once the query has been
4852
abandoned or has timed out.
4953
5054
The terminated promise and :forget value coordinate between the
@@ -56,7 +60,20 @@
5660
The client socket monitoring depends on access to the jetty query
5761
response which (at least at the moment) provides indirect access to
5862
the java socket channel which can be read to determine whether the
59-
client is still connected."
63+
client is still connected.
64+
65+
The current implementation is completely incompatible with http
66+
\"pipelining\", but it looks like that is no longer a realistic
67+
concern:
68+
https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-pipelining/
69+
70+
If that turns out to be an incorrect assumption, then we'll have to
71+
reevaluate the implementation and/or feasibility of the monitoring.
72+
That's because so far, the only way we've found to detect a client
73+
disconnection is to attempt to read a byte. At the moment, that's
74+
acceptable because the client shouldn't be sending any data during
75+
the response (which of course wouldn't be true with pipelining,
76+
where it could be sending additional requests)."
6077

6178
(:require
6279
[clojure.tools.logging :as log]
@@ -331,12 +348,29 @@
331348
(.join thread))
332349
(not (.isAlive thread))))))
333350

351+
(defn- register-selector
352+
"Loops until the channel is registered with the selector while
353+
ignoring canceled key exceptions, which should only occur when a
354+
client is re-using the channel, and the previous query has called
355+
forget (and cancelled the key), and the main select loop hasn't
356+
removed it from the cancelled set yet. Returns the new
357+
SelectionKey."
358+
[channel selector ops]
359+
(or (try
360+
(.register ^SelectableChannel channel selector ops)
361+
(catch CancelledKeyException _))
362+
(do
363+
;; The competing forget should have already called wakeup.
364+
;; REVIEW: 10?
365+
(Thread/sleep 10)
366+
(recur channel selector ops))))
367+
334368
(defn stop-query-at-deadline-or-disconnect
335369
[{:keys [^Selector selector queries ^Thread thread] :as _monitor}
336370
id ^SelectableChannel channel deadline-ns db]
337371
(assert (.isAlive thread))
338372
(assert (instance? SelectableChannel channel))
339-
(let [select-key (.register channel selector SelectionKey/OP_READ)
373+
(let [select-key (register-selector channel selector SelectionKey/OP_READ)
340374
info {:query-id id
341375
:selection-key select-key
342376
:deadline-ns deadline-ns
@@ -355,7 +389,7 @@
355389
(-> prev
356390
(update :selector-keys assoc select-key info)
357391
(update :deadlines conj [[deadline-ns select-key] info]))))
358-
(.wakeup selector) ;; so it can recompute deadline
392+
(.wakeup selector) ;; to recompute deadline and start watching select-key
359393
select-key))
360394

361395
(defn register-pg-pid
@@ -386,8 +420,6 @@
386420
[{:keys [queries ^Selector selector ^Thread thread] :as _monitor}
387421
^SelectionKey select-key]
388422
(assert (.isAlive thread))
389-
;; After this some key methods will throw CancelledKeyException
390-
(.cancel select-key)
391423
(let [maybe-await-termination (atom nil)]
392424
(swap! queries
393425
(fn [{:keys [selector-keys] :as state}]
@@ -399,6 +431,8 @@
399431
(update :selector-keys assoc select-key info)
400432
(update :deadlines assoc [deadline-ns select-key] info)))
401433
state)))
434+
(.cancel select-key)
435+
(.wakeup selector) ;; clear out the cancelled keys (see stop-query-at-...)
402436
(if (= ::timeout (some-> @maybe-await-termination (deref 2000 ::timeout)))
403437
::timeout
404438
true)))

test/puppetlabs/puppetdb/query/monitor_test.clj

+31
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
(ns puppetlabs.puppetdb.query.monitor-test
22
(:require
33
[clj-http.client :as http]
4+
[clojure.java.shell :refer [sh]]
45
[clojure.string :as str]
56
[clojure.test :refer :all]
67
[murphy :refer [try!]]
@@ -280,3 +281,33 @@
280281
(update :info dissoc :query-id)
281282
(update-in [:info :pg-pid] int?))
282283
summary)))))))
284+
285+
(deftest connection-reuse
286+
;; Test multiple queries over the same connection. Run the test
287+
;; multiple times because the first problem we encountered could
288+
;; only occur during a narrow window in the monitor loop if a new
289+
;; request came in between select invocations, after the key had
290+
;; been cancelled.
291+
;; https://github.com/puppetlabs/puppetdb/issues/3866
292+
(with-puppetdb nil
293+
(jdbc/with-db-transaction [] (add-certnames certnames))
294+
;; Just use curl, since it's trivally easy to get it to do what we
295+
;; need, and we already require it for test setup (via ext/). (It
296+
;; looks like both clj-http and the JDK HttpClient have more
297+
;; indirect control over reuse.)
298+
(let [nodes (-> (assoc *base-url* :prefix "/pdb/query/v4/nodes")
299+
base-url->str-with-prefix)
300+
cmd ["curl" "--no-progress-meter" "--show-error" "--fail-early"
301+
"--fail" nodes "-o" "/dev/null"
302+
"--next" "--fail" nodes "-o" "/dev/null"
303+
"--next" "--fail" nodes "-o" "/dev/null"]]
304+
(loop [i 0]
305+
(let [{:keys [exit out err]} (apply sh cmd)]
306+
(prn :loop [cmd exit err])
307+
(when (< i 10)
308+
(if (is (= 0 exit))
309+
(recur (inc i))
310+
(do
311+
(apply println "Failed:" cmd)
312+
(print out)
313+
(print err)))))))))

0 commit comments

Comments
 (0)