This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Latest Synapse 1.70.0rc1 (2022-10-19) release seems to have regressed /messages
response time
#14250
Labels
A-Messages-Endpoint
/messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill)
A-Performance
Performance, both client-facing and admin-facing
O-Occasional
Affects or can be seen by some users regularly or most users rarely
S-Minor
Blocks non-critical functionality, workarounds exist.
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
X-Regression
Something broke which worked on a previous release
I think we rolled out the latest Synapse 1.70.0rc1 (2022-10-19) to
matrix.org
this morning. Looking at the graphs (where the spike on the right is), I noticed there seems to be a big up-tick in/messages
response times.We should check to see whether this persists tomorrow or is just a blip.
https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=now-4d&to=now

It looks like we're seeing a major regression in the before backfill processing time (
synapse/handlers/federation.py#L197-L503
). And some in the after backfill processing time (synapse/handlers/federation_event.py#L643-L657
.What changed?
See https://github.com/matrix-org/synapse/blob/develop/CHANGES.md#synapse-1700rc1-2022-10-19
Is it #13816?
One obvious culprit is #13816 but I would expect everything that it touches to be in the after backfill processing timing. And the goal of that is to save time since we bail early if are trying to process an event we know will fail.
My
/messages?dir=b&limit=500
benchmark request to Matrix HQ still takes 24s to complete (expected).. The one thing that stood out to me was a newrecursively fetching redactions
span under_process_pulled_events
in the trace.Is it #14164?
This one seems suspicious. Given the new
recursively fetching redactions
span I saw under_process_pulled_events
in the Jaeger trace which_get_events_from_db
does and changed in this PR. And that we removed a cache.This would account for the after backfill processing time. Just smoke though. I don't know how it performs. We do end up running
_process_pulled_event
less since we only do it for new events.What else?
What about the before backfill processing time?
Needs investigation.
I wish I had a trace that represented one of these long requests. Should be something longer than 180s
But we only recorded two traces in the past 2 days over 120s, https://jaeger.proxy.matrix.org/search?end=1666298780924000&limit=20&lookback=2d&maxDuration&minDuration=120s&operation=RoomMessageListRestServlet&service=matrix.org%20client\_reader&start=1666125980924000
And they seem more related to
outgoing-federation-request
for/backfill
which we don't control,RequestSendFailed
. And wouldn't account for before/after backfill processing timesThe text was updated successfully, but these errors were encountered: