SYN-7062 — v2 Paginator Retry / Throttle Carry-over

This guide documents the follow-up to the v2 export migration that closes the page-level resilience gap on the v2 export path. The v1 BaseClient._list_all helper used to wrap every page fetch in a _get_page_with_retry loop. The v2 cut-over inherited the cursor paginator's single-attempt fetch instead, so a transient 5xx or read timeout on page N would surface as Max retries exceeded and fail the entire export. SYN-7062 carries the v1 behaviour over to the v2 paginator while preserving full backward compatibility for every existing caller.

The change is additive — every caller that does not explicitly opt in keeps the previous single-attempt behaviour. Tighter BackendV2Client timeout (and the urllib3 backoff factor on the shared BaseClient) are the only silent behavioural shifts on the v2 path. The v1 BaseClient timeout default itself stayed at 15s (review F-3 — the relaxation is scoped to the v2 subclasses); the backoff change applies to both and can be reverted with the env vars documented below.

TL;DR

Page-level retry on the v2 paginator. SyncCursorPaginator / AsyncCursorPaginator accept max_retries / backoff_factor / retry_on_status / throttle_seconds keyword arguments. The retry loop covers ServerError(5xx) (default 502 / 503 / 504), ClientTimeoutError, and ClientConnectionError.
4-key pass-through on 23 v2 resources. Every v2 resource list() accepts page_retries / page_retry_backoff / throttle_seconds / retry_on_status and forwards them to the paginator constructor.
_collect_then_bulk phase-1 channel via closure. Export handlers forward page_retries / page_retry_backoff / throttle_seconds directly into the v2.<resource>.list(...) closure (see _paginator_pass_through()). The cursor list step inherits the same retry / throttle posture as the bulk-fetch sequence. SYN-7062 review C-2 removed the four short-lived list_* parameters that the helper had been carrying as a Step-6 placeholder — they were never read.
v2-scoped transport defaults. BackendV2Client / AsyncBackendV2Client default timeout.read 15s → 30s (review F-3 — scoped to the v2 subclasses). urllib3 Retry.backoff_factor 1 → 2 with respect_retry_after_header=True and a hard cap (Retry-After ≤ 60s, review C-1 — :class:BoundedRetry).
503 surface (sync + async). requests.exceptions.RetryError (sync) maps to ServerError(503, ...) with __cause__ preserved. Review F-1 extends the symmetry to async — httpx.NetworkError / httpx.RemoteProtocolError (transient transport hiccups) now surface as ServerError(503) so the paginator's default retry_on_status=(502, 503, 504) catches them.
Environment overrides. SYNAPSE_BACKEND_V2_TIMEOUT_READ / _CONNECT override the v2 client's per-process timeouts (review F-4 — v2 scope only). Invalid or non-positive values fall back to the default with a warning log. The shared BaseClient / AsyncBaseClient do not read these env vars so v1 caller SLAs are preserved.

Background — what regressed

The v1 export path used BaseClient._list_all's _get_page_with_retry to absorb transient page failures. The SYN-6919 cut-over PR moved every export handler onto the v2 client's cursor paginator, which fetched each page exactly once. In staging this surfaced as:

Page 0 returns successfully, page N (N ≥ 1) receives 503 or a read timeout, the paginator raises immediately.
BackendV2Client._request then converts the underlying urllib3.MaxRetryError into ServerError(500, ...) because RetryError falls through the generic RequestException branch. The backend's 503 signal never reaches the handler.
The handler treats it as a hard failure and the entire export aborts — no partial progress, no recovery.

The operational mitigation (SYNAPSE_FORCE_V1_EXPORT=1) remains valid but forces every migrated caller back to v1, which defeats the purpose of the cut-over. SYN-7062 restores the page-level retry contract on the v2 path so the kill-switch can stay unset by default.

Backward compatibility

Every existing caller keeps its current behaviour without a single line of code change:

New paginator constructor keywords default to max_retries=0, throttle_seconds=0.0. The paginator behaves exactly as it did before this change when no opt-in is provided.
New v2 resource list() keywords default to disabled / zero. client.v2_client.tasks.list(project=42) returns the same CursorPage it did before.
_collect_then_bulk keeps its previous parameter contract for legacy callers (list_method / bulk_method / ids_per_batch / extract_id / throttle_seconds). The four short-lived list_* parameters added under Step 3 were removed in review C-2 — they had no behaviour; phase-1 retry/throttle is owned by the handler closure. A caller that defensively passed one of the list_* keys now sees TypeError instead of a silent no-op (no SDK-internal callsite ever used them).
The BackendV2Client timeout bump (15s → 30s) extends, not shortens, the failure ceiling. Callers that explicitly passed timeout={...} keep their value verbatim. To pin the previous 15s default, set SYNAPSE_BACKEND_V2_TIMEOUT_READ=15 or pass timeout={'connect': 5, 'read': 15}. v1 callers via the plain BaseClient continue to default to read=15 — the relaxation is intentionally scoped to the v2 path (review F-3).
The new urllib3 backoff_factor=2 only affects timing during a retry cascade. Calls that never trigger transport retries are unaffected.
RetryError → ServerError(503) is a refinement, not a regression. The previous ServerError(500) mapping was a side-effect of the generic RequestException branch swallowing it; callers that matched on status code 500 should now match on 503 (or treat both as transient).

If you encounter a behavioural change that is not covered by this section, please file a follow-up ticket.

Retry / throttle chain

The full layered chain that a request flows through on the v2 export path now looks like this. Each layer has its own opt-in surface; the caller picks the level that matches their concern.

Two retry budgets are layered:

Transport retries — handled inside urllib3 / the BaseClient for connect-level and idempotent 5xx retries with backoff_factor=2. Exhaustion surfaces as ServerError(503).
Page-level retries — handled inside _fetch_with_retry for the catch list above. Exhaustion re-raises the last exception unchanged.

The two budgets are independent. A single page may consume the full urllib3 budget, return ServerError(503), and still be retried by the page-level loop. That layering reproduces the v1 _get_page_with_retry semantics on the v2 path.

Page-level retry timeline

The page-level loop sleeps backoff_factor * (2 ** attempt) seconds between attempts. With the recommended page_retries=3 / page_retry_backoff=2.0 the worst-case extra wall time per page is 2 + 4 + 8 = 14 seconds before the loop gives up.

If page N consumes all four attempts (initial + 3 retries) and still fails, the loop re-raises the last ServerError(503) unchanged so the handler can surface a meaningful failure code instead of Max retries exceeded.

Using the new options

Direct paginator construction

For callers that build a paginator by hand (analytics scripts, custom export pipelines):

from synapse_sdk.clients.backend_v2 import BackendV2Client
from synapse_sdk.clients.backend_v2.pagination import SyncCursorPaginator

client = BackendV2Client('https://api.test.synapse.sh',
                        access_token='syn_...', tenant='acme')

# Page 0 inline; subsequent pages share the same retry policy.
for row in client.v2.tasks.list(
    project=42,
    list_all=True,
    page_retries=3,
    page_retry_backoff=2.0,
    throttle_seconds=0.2,
):
    process(row)

The four keyword arguments are forwarded to the paginator's constructor. retry_on_status defaults to (502, 503, 504); override it to add 429 or to drop 502 if your backend treats it as a permanent error.

`_collect_then_bulk` (export handler pattern)

Export handlers continue to use _collect_then_bulk. The helper now exposes a phase-1 channel so the cursor list step inherits the same retry / throttle posture as the bulk-fetch sequence:

from synapse_sdk.plugins.actions._v2_switch import _collect_then_bulk

# Closure threads the three paginator pass-through keys to the resource
# list step. The handler's _paginator_pass_through() returns the same
# dict and is the SSOT for this pattern.
list_paginator_kwargs = self._paginator_pass_through()
# Resolves to {
#     'page_retries':        self.EXPORT_PAGE_RETRIES,        # default 3
#     'page_retry_backoff':  self.EXPORT_PAGE_RETRY_BACKOFF,  # default 2.0
#     'throttle_seconds':    self.EXPORT_THROTTLE_SECONDS,    # 0.1 / 0.2
# }

rows = _collect_then_bulk(
    list_method=lambda: v2.tasks.list(
        list_all=True, **slim_params, **list_paginator_kwargs,
    ),
    bulk_method=lambda ids: v2.tasks.bulk_fetch(ids),
    ids_per_batch=self.EXPORT_PAGE_SIZE,
    # Phase 2 (inter-bulk-fetch) throttle. Phase 1 throttle is delivered
    # via the closure into v2.tasks.list above.
    throttle_seconds=self.EXPORT_THROTTLE_SECONDS,
)

Plugin authors that build their own _collect_then_bulk callsites should follow the same closure pattern — drop throttle_seconds / page_retries / page_retry_backoff / retry_on_status keys from any caller-supplied slim_params first. The SDK ships a public helper for this — strip_paginator_reserved_keys in synapse_sdk.plugins.actions._v2_switch — applied across all five v2-migrated callsites (3 export handlers, dataset/action.py, to_task/steps/fetch_tasks.py; see SYN-7062 review C-3 / NEW-1). This avoids the double-spread TypeError and surfaces caller misuse via a WARNING log line.

The four list_* parameters that previously appeared on _collect_then_bulk (list_retries / list_retry_backoff / list_throttle_seconds / list_retry_on_status) were removed under review C-2 — they were captured but never read, and the phase-1 channel is owned exclusively by the handler closure.

Async paginator

The async paginator carries the same options. Pass them through client.v2.tasks.list(...) exactly as in the sync example; the keyword names are identical.

async with AsyncBackendV2Client(...) as client:
    async for row in client.v2.tasks.list(
        project=42,
        list_all=True,
        page_retries=3,
        throttle_seconds=0.2,
    ):
        await sink(row)

The async loop uses asyncio.sleep for backoff so it cooperates with the surrounding event loop instead of blocking it.

Environment overrides

Variable	Effect	Default
`SYNAPSE_BACKEND_V2_TIMEOUT_READ`	Per-request read timeout (seconds) on `BackendV2Client` / `AsyncBackendV2Client`. Plain `BaseClient` ignores this (review F-4).	`30`
`SYNAPSE_BACKEND_V2_TIMEOUT_CONNECT`	Per-request connect timeout (seconds) on the v2 clients. Plain `BaseClient` ignores this.	`5`
`SYNAPSE_FORCE_V1_EXPORT`	Inherited from the v2 export migration. Forces every migrated caller back to the v1 path. Still functional and unaffected by SYN-7062.	unset

Both timeout overrides parse as floats. Invalid input (non-numeric or ≤ 0) emits a warning via synapse_sdk.clients.utils and falls back to the default. Precedence (highest first):

Explicit timeout= argument to the BaseClient / AsyncBaseClient constructor.
Environment variable.
Built-in default.

Use the environment channel for incident response (raise read timeout temporarily without redeploying) or local dev (lower the read timeout to surface staging slowness faster).

Recommended defaults

Workload	`page_retries`	`page_retry_backoff`	`throttle_seconds`	Notes
Small ad-hoc list (< 5 pages)	`0`	n/a	`0`	Opt-out keeps latency low.
Production export (Task / Assignment / GroundTruth)	`3`	`2.0`	`0.1`–`0.2`	Matches the v1 `EXPORT_*` defaults. Already wired in `handlers.py`.
Bulk analytics over an entire tenant	`3`–`5`	`2.0`	`0.2`–`0.5`	Larger throttle reduces backend load when the script runs unsupervised.
Streaming consumer (async)	`3`	`2.0`	`0.0`	Throttle only if you observe backend `503` storms; backoff covers transient failures by itself.

These are starting points, not hard rules. Inspect the backend Retry-After header on production incidents — urllib3 now respects it, so the actual wall time may exceed the backoff_factor curve when the backend issues explicit cooldowns.

Troubleshooting

`ServerError(503, ...)` raised directly to my caller

This means urllib3 exhausted its transport retry budget for a single attempt. Inspect the __cause__ chain — it carries the original RetryError and underlying MaxRetryError so you can read the last response or connection failure. A direct 503 surface is usually a sign that:

The backend is genuinely overloaded (check operational dashboards).
The endpoint returned a Retry-After header that exceeded the client-side budget.
A network partition prevented all transport retries from succeeding.

If you want the page-level retry loop to absorb it, ensure page_retries (or max_retries if you build the paginator directly) is > 0.

Page-level retry loop terminates on the first attempt

Check that retry_on_status covers the status code returned by the backend. Defaults are (502, 503, 504) — codes outside that set (for example 429) raise immediately. Pass a wider tuple if your deployment uses 429 for soft rate limiting.

Async retries appear to deadlock

The async paginator uses asyncio.sleep for backoff. If the surrounding event loop is starved (long synchronous CPU work, a run_until_complete blocked by a sync call) the sleep will appear to hang. Audit the calling task for synchronous bottlenecks.

Timeout default change broke a test fixture

The sync default moved from 15s to 30s. Set SYNAPSE_BACKEND_V2_TIMEOUT_READ=15 in the test environment or pass timeout={'connect': 5, 'read': 15} explicitly when constructing the client to pin the previous value.

Acceptance contract reference

The following spec lines from specs/syn-7062-v2-paginator-retry-carryover/ are the source of truth for the behaviour above:

requirements.md FR-1 ~ FR-6 — paginator retry, resource pass-through, _collect_then_bulk channel, timeout / backoff defaults, 503 surface, regression guards.
specs.md TS-1 ~ TS-6 — exact signatures, env precedence, urllib3 config diff, exception ladder change.
plans.md Step 1 ~ Step 7 — atomic commit topology (structure → behaviour → docs).

v2 export migration — base cut-over PR. SYN-7062 is the follow-up that closes its page-level resilience gap.
synapse_sdk/clients/backend_v2/INVENTORY.md — full endpoint catalogue for the 23 resources that gained the 4-key pass-through.
synapse_sdk/clients/backend_v2/README.md — quick-start for building a paginator directly.

TL;DR​

Background — what regressed​

Backward compatibility​

Retry / throttle chain​

Page-level retry timeline​

Using the new options​

Direct paginator construction​

_collect_then_bulk (export handler pattern)​

Async paginator​

Environment overrides​

Recommended defaults​

Troubleshooting​

ServerError(503, ...) raised directly to my caller​

Page-level retry loop terminates on the first attempt​

Async retries appear to deadlock​

Timeout default change broke a test fixture​

Acceptance contract reference​

Related​