SYN-7062 — v2 Paginator Retry / Throttle Carry-over
This guide documents the follow-up to the
v2 export migration that
closes the page-level resilience gap on the v2 export path. The v1
BaseClient._list_all helper used to wrap every page fetch in a
_get_page_with_retry loop. The v2 cut-over inherited the cursor
paginator's single-attempt fetch instead, so a transient 5xx or
read timeout on page N would surface as Max retries exceeded and
fail the entire export. SYN-7062 carries the v1 behaviour over to the
v2 paginator while preserving full backward compatibility for every
existing caller.
The change is additive — every caller that does not explicitly
opt in keeps the previous single-attempt behaviour. Tighter
BackendV2Client timeout (and the urllib3 backoff factor on the
shared BaseClient) are the only silent behavioural shifts on the
v2 path. The v1 BaseClient timeout default itself stayed at 15s
(review F-3 — the relaxation is scoped to the v2 subclasses); the
backoff change applies to both and can be reverted with the env vars
documented below.
TL;DR
- Page-level retry on the v2 paginator.
SyncCursorPaginator/AsyncCursorPaginatoracceptmax_retries/backoff_factor/retry_on_status/throttle_secondskeyword arguments. The retry loop coversServerError(5xx)(default502/503/504),ClientTimeoutError, andClientConnectionError. - 4-key pass-through on 23 v2 resources.
Every v2 resource
list()acceptspage_retries/page_retry_backoff/throttle_seconds/retry_on_statusand forwards them to the paginator constructor. _collect_then_bulkphase-1 channel via closure. Export handlers forwardpage_retries/page_retry_backoff/throttle_secondsdirectly into thev2.<resource>.list(...)closure (see_paginator_pass_through()). The cursor list step inherits the same retry / throttle posture as the bulk-fetch sequence. SYN-7062 review C-2 removed the four short-livedlist_*parameters that the helper had been carrying as a Step-6 placeholder — they were never read.- v2-scoped transport defaults.
BackendV2Client/AsyncBackendV2Clientdefaulttimeout.read15s → 30s(review F-3 — scoped to the v2 subclasses). urllib3Retry.backoff_factor1 → 2withrespect_retry_after_header=Trueand a hard cap (Retry-After≤ 60s, review C-1 — :class:BoundedRetry). - 503 surface (sync + async).
requests.exceptions.RetryError(sync) maps toServerError(503, ...)with__cause__preserved. Review F-1 extends the symmetry to async —httpx.NetworkError/httpx.RemoteProtocolError(transient transport hiccups) now surface asServerError(503)so the paginator's defaultretry_on_status=(502, 503, 504)catches them. - Environment overrides.
SYNAPSE_BACKEND_V2_TIMEOUT_READ/_CONNECToverride the v2 client's per-process timeouts (review F-4 — v2 scope only). Invalid or non-positive values fall back to the default with a warning log. The sharedBaseClient/AsyncBaseClientdo not read these env vars so v1 caller SLAs are preserved.
Background — what regressed
The v1 export path used BaseClient._list_all's
_get_page_with_retry to absorb transient page failures. The
SYN-6919 cut-over PR
moved every export handler onto the v2 client's cursor paginator,
which fetched each page exactly once. In staging this surfaced as:
- Page 0 returns successfully, page N (N ≥ 1) receives
503or a read timeout, the paginator raises immediately. BackendV2Client._requestthen converts the underlyingurllib3.MaxRetryErrorintoServerError(500, ...)becauseRetryErrorfalls through the genericRequestExceptionbranch. The backend's503signal never reaches the handler.- The handler treats it as a hard failure and the entire export aborts — no partial progress, no recovery.
The operational mitigation
(SYNAPSE_FORCE_V1_EXPORT=1)
remains valid but forces every migrated caller back to v1, which
defeats the purpose of the cut-over. SYN-7062 restores the page-level
retry contract on the v2 path so the kill-switch can stay unset by
default.
Backward compatibility
Every existing caller keeps its current behaviour without a single line of code change:
- New paginator constructor keywords default to
max_retries=0,throttle_seconds=0.0. The paginator behaves exactly as it did before this change when no opt-in is provided. - New v2 resource
list()keywords default to disabled / zero.client.v2_client.tasks.list(project=42)returns the sameCursorPageit did before. _collect_then_bulkkeeps its previous parameter contract for legacy callers (list_method/bulk_method/ids_per_batch/extract_id/throttle_seconds). The four short-livedlist_*parameters added under Step 3 were removed in review C-2 — they had no behaviour; phase-1 retry/throttle is owned by the handler closure. A caller that defensively passed one of thelist_*keys now seesTypeErrorinstead of a silent no-op (no SDK-internal callsite ever used them).- The
BackendV2Clienttimeout bump (15s → 30s) extends, not shortens, the failure ceiling. Callers that explicitly passedtimeout={...}keep their value verbatim. To pin the previous15sdefault, setSYNAPSE_BACKEND_V2_TIMEOUT_READ=15or passtimeout={'connect': 5, 'read': 15}. v1 callers via the plainBaseClientcontinue to default toread=15— the relaxation is intentionally scoped to the v2 path (review F-3). - The new urllib3
backoff_factor=2only affects timing during a retry cascade. Calls that never trigger transport retries are unaffected. RetryError → ServerError(503)is a refinement, not a regression. The previousServerError(500)mapping was a side-effect of the genericRequestExceptionbranch swallowing it; callers that matched on status code500should now match on503(or treat both as transient).
If you encounter a behavioural change that is not covered by this section, please file a follow-up ticket.
Retry / throttle chain
The full layered chain that a request flows through on the v2 export path now looks like this. Each layer has its own opt-in surface; the caller picks the level that matches their concern.
Two retry budgets are layered:
- Transport retries — handled inside urllib3 / the
BaseClientfor connect-level and idempotent5xxretries withbackoff_factor=2. Exhaustion surfaces asServerError(503). - Page-level retries — handled inside
_fetch_with_retryfor the catch list above. Exhaustion re-raises the last exception unchanged.
The two budgets are independent. A single page may consume the full
urllib3 budget, return ServerError(503), and still be retried by
the page-level loop. That layering reproduces the v1
_get_page_with_retry semantics on the v2 path.
Page-level retry timeline
The page-level loop sleeps backoff_factor * (2 ** attempt) seconds
between attempts. With the recommended page_retries=3 /
page_retry_backoff=2.0 the worst-case extra wall time per page is
2 + 4 + 8 = 14 seconds before the loop gives up.
If page N consumes all four attempts (initial + 3 retries) and still
fails, the loop re-raises the last ServerError(503) unchanged so
the handler can surface a meaningful failure code instead of
Max retries exceeded.
Using the new options
Direct paginator construction
For callers that build a paginator by hand (analytics scripts, custom export pipelines):
from synapse_sdk.clients.backend_v2 import BackendV2Client
from synapse_sdk.clients.backend_v2.pagination import SyncCursorPaginator
client = BackendV2Client('https://api.test.synapse.sh',
access_token='syn_...', tenant='acme')
# Page 0 inline; subsequent pages share the same retry policy.
for row in client.v2.tasks.list(
project=42,
list_all=True,
page_retries=3,
page_retry_backoff=2.0,
throttle_seconds=0.2,
):
process(row)
The four keyword arguments are forwarded to the paginator's
constructor. retry_on_status defaults to (502, 503, 504);
override it to add 429 or to drop 502 if your backend treats
it as a permanent error.
_collect_then_bulk (export handler pattern)
Export handlers continue to use _collect_then_bulk. The helper
now exposes a phase-1 channel so the cursor list step inherits the
same retry / throttle posture as the bulk-fetch sequence:
from synapse_sdk.plugins.actions._v2_switch import _collect_then_bulk
# Closure threads the three paginator pass-through keys to the resource
# list step. The handler's _paginator_pass_through() returns the same
# dict and is the SSOT for this pattern.
list_paginator_kwargs = self._paginator_pass_through()
# Resolves to {
# 'page_retries': self.EXPORT_PAGE_RETRIES, # default 3
# 'page_retry_backoff': self.EXPORT_PAGE_RETRY_BACKOFF, # default 2.0
# 'throttle_seconds': self.EXPORT_THROTTLE_SECONDS, # 0.1 / 0.2
# }
rows = _collect_then_bulk(
list_method=lambda: v2.tasks.list(
list_all=True, **slim_params, **list_paginator_kwargs,
),
bulk_method=lambda ids: v2.tasks.bulk_fetch(ids),
ids_per_batch=self.EXPORT_PAGE_SIZE,
# Phase 2 (inter-bulk-fetch) throttle. Phase 1 throttle is delivered
# via the closure into v2.tasks.list above.
throttle_seconds=self.EXPORT_THROTTLE_SECONDS,
)
Plugin authors that build their own _collect_then_bulk callsites
should follow the same closure pattern — drop throttle_seconds /
page_retries / page_retry_backoff / retry_on_status keys
from any caller-supplied slim_params first. The SDK ships a public
helper for this — strip_paginator_reserved_keys in
synapse_sdk.plugins.actions._v2_switch — applied across all five
v2-migrated callsites (3 export handlers, dataset/action.py,
to_task/steps/fetch_tasks.py; see SYN-7062 review C-3 / NEW-1).
This avoids the double-spread TypeError and surfaces caller misuse
via a WARNING log line.
The four list_* parameters that previously appeared on
_collect_then_bulk (list_retries / list_retry_backoff /
list_throttle_seconds / list_retry_on_status) were removed
under review C-2 — they were captured but never read, and the
phase-1 channel is owned exclusively by the handler closure.
Async paginator
The async paginator carries the same options. Pass them through
client.v2.tasks.list(...) exactly as in the sync example; the
keyword names are identical.
async with AsyncBackendV2Client(...) as client:
async for row in client.v2.tasks.list(
project=42,
list_all=True,
page_retries=3,
throttle_seconds=0.2,
):
await sink(row)
The async loop uses asyncio.sleep for backoff so it cooperates
with the surrounding event loop instead of blocking it.
Environment overrides
| Variable | Effect | Default |
|---|---|---|
SYNAPSE_BACKEND_V2_TIMEOUT_READ | Per-request read timeout (seconds) on BackendV2Client / AsyncBackendV2Client. Plain BaseClient ignores this (review F-4). | 30 |
SYNAPSE_BACKEND_V2_TIMEOUT_CONNECT | Per-request connect timeout (seconds) on the v2 clients. Plain BaseClient ignores this. | 5 |
SYNAPSE_FORCE_V1_EXPORT | Inherited from the v2 export migration. Forces every migrated caller back to the v1 path. Still functional and unaffected by SYN-7062. | unset |
Both timeout overrides parse as floats. Invalid input (non-numeric
or ≤ 0) emits a warning via synapse_sdk.clients.utils and
falls back to the default. Precedence (highest first):
- Explicit
timeout=argument to theBaseClient/AsyncBaseClientconstructor. - Environment variable.
- Built-in default.
Use the environment channel for incident response (raise read timeout temporarily without redeploying) or local dev (lower the read timeout to surface staging slowness faster).
Recommended defaults
| Workload | page_retries | page_retry_backoff | throttle_seconds | Notes |
|---|---|---|---|---|
| Small ad-hoc list (< 5 pages) | 0 | n/a | 0 | Opt-out keeps latency low. |
| Production export (Task / Assignment / GroundTruth) | 3 | 2.0 | 0.1–0.2 | Matches the v1 EXPORT_* defaults. Already wired in handlers.py. |
| Bulk analytics over an entire tenant | 3–5 | 2.0 | 0.2–0.5 | Larger throttle reduces backend load when the script runs unsupervised. |
| Streaming consumer (async) | 3 | 2.0 | 0.0 | Throttle only if you observe backend 503 storms; backoff covers transient failures by itself. |
These are starting points, not hard rules. Inspect the backend
Retry-After header on production incidents — urllib3 now respects
it, so the actual wall time may exceed the backoff_factor curve
when the backend issues explicit cooldowns.
Troubleshooting
ServerError(503, ...) raised directly to my caller
This means urllib3 exhausted its transport retry budget for a single
attempt. Inspect the __cause__ chain — it carries the original
RetryError and underlying MaxRetryError so you can read the
last response or connection failure. A direct 503 surface is
usually a sign that:
- The backend is genuinely overloaded (check operational dashboards).
- The endpoint returned a
Retry-Afterheader that exceeded the client-side budget. - A network partition prevented all transport retries from succeeding.
If you want the page-level retry loop to absorb it, ensure
page_retries (or max_retries if you build the paginator
directly) is > 0.
Page-level retry loop terminates on the first attempt
Check that retry_on_status covers the status code returned by the
backend. Defaults are (502, 503, 504) — codes outside that set
(for example 429) raise immediately. Pass a wider tuple if your
deployment uses 429 for soft rate limiting.
Async retries appear to deadlock
The async paginator uses asyncio.sleep for backoff. If the
surrounding event loop is starved (long synchronous CPU work, a
run_until_complete blocked by a sync call) the sleep will appear
to hang. Audit the calling task for synchronous bottlenecks.
Timeout default change broke a test fixture
The sync default moved from 15s to 30s. Set
SYNAPSE_BACKEND_V2_TIMEOUT_READ=15 in the test environment or
pass timeout={'connect': 5, 'read': 15} explicitly when
constructing the client to pin the previous value.
Acceptance contract reference
The following spec lines from
specs/syn-7062-v2-paginator-retry-carryover/
are the source of truth for the behaviour above:
requirements.mdFR-1 ~ FR-6 — paginator retry, resource pass-through,_collect_then_bulkchannel, timeout / backoff defaults, 503 surface, regression guards.specs.mdTS-1 ~ TS-6 — exact signatures, env precedence, urllib3 config diff, exception ladder change.plans.mdStep 1 ~ Step 7 — atomic commit topology (structure → behaviour → docs).
Related
- v2 export migration — base cut-over PR. SYN-7062 is the follow-up that closes its page-level resilience gap.
synapse_sdk/clients/backend_v2/INVENTORY.md— full endpoint catalogue for the 23 resources that gained the 4-key pass-through.synapse_sdk/clients/backend_v2/README.md— quick-start for building a paginator directly.