All updates
QABackend

Stability hardening: persist privileged-access audit logs and stop intermittent 500s on cached read endpoints under load

PR #197StrangeNoobJun 18, 2026 · 10:34 UTC
QAJun 18, 2026

Executive summary

Reaching QA (testing). This release fixes three reliability and compliance issues surfaced by a 500-user / 10-minute stress test. Audit records for elevated (global-scope) access to sensitive data are now saved reliably instead of being silently dropped, and read endpoints that serve cached results no longer fail intermittently with HTTP 500 errors when traffic is high.

Why this was needed

The QA stress test exposed three latent defects. First, the high-stakes audit event written when a user is granted GLOBAL_SCOPE access to sensitive data on certain read surfaces (grid, column-values, etc.) was never committed, so the compliance record was rolled back and silently lost on every such request. Second, the cached-results layer returns a typed object on a cache miss but a plain dictionary on a cache hit; the /admin/api/rules handler accessed an attribute on that result and crashed once the cache warmed up (39 of 40 concurrent requests returned 500). Third, this dictionary-on-hit mismatch was a systemic risk affecting any cached endpoint, not just /rules.

Client / user impact

  • Privileged-access (GLOBAL_SCOPE) audit trail is now durable, satisfying security and compliance expectations that elevated access to sensitive data is always recorded.
  • Admin Rules listing (/admin/api/rules) is stable under concurrency: 40/40 concurrent requests now succeed (was 39/40 failing with 500).
  • The same crash class is closed across all cached endpoints, so other read APIs that return typed responses from cache no longer risk intermittent 500s as traffic scales.
  • A secondary benefit: per-request audit de-duplication reduced redundant write-session churn on affected read calls.

Technical scope

  • app/services/auth/tenant_scope_policy.py: added await write_db.commit() in _maybe_audit_global_scope so the SENSITIVE_DATA_ACCESS audit row persists (it previously only flushed and rolled back). Added per-request de-duplication keyed on (user_id, surface, reason) via the request cache so one request audits a given elevation at most once. The de-dup marker is set only when a real row id is returned, so a transient logging failure stays retryable rather than silently suppressing the audit.
  • app/admin_portal/routers/admin_rules.py: contained fix (B1) coercing a cached dict back to RuleListResponse via model_validate before attribute access.
  • app/utils/cache/manager.py: systemic fix (B2) generalizing the coercion into the cache_result decorator. New _rehydrate_cached_value / _cached_return_model helpers read the wrapped function's declared Pydantic return type and re-hydrate a cached dict on hit; applied to all cache-hit return paths (request cache, local L1, stale-while-revalidate stale/fresh, non-SWR, single-flight poll, post-lock recheck). Handles Optional/PEP 604 unions and forward refs, and fails open (returns the raw dict, never raises) on schema drift. Lint follow-up swapped @lru_cache(maxsize=None) to @functools.cache.
  • Tests added: tests/unit/test_cache_rehydration.py, tests/unit/test_admin_rules_cache_rehydration.py, tests/unit/services/auth/test_global_scope_audit_dedup.py.
  • Two design docs added under docs/superpowers/specs/ (B1 spec and a Phase 2 session-chattiness design). Phase 2 is design-only with no code in this PR. No database migrations and no IAM/permission changes.

Risk & mitigation

Moderate, well-mitigated. The audit change touches the tenant-scope authority (a security-sensitive area) but only adds a commit on a dedicated write session and a read-only de-dup guard, leaving scope decisions and read-only sessions unchanged. The cache re-hydration is defensive and fails open, so schema drift degrades to the prior raw-dict behavior rather than raising new errors. Risk is concentrated in the broad blast radius of the shared cache decorator (every cached function) and the auth path; mitigated by regression unit tests and the existing tenant-scope lint passing with zero violations. Base branch is QA (cut from QA, deploying to the qa namespace), not production.

QA validation focus

  • Confirm a SENSITIVE_DATA_ACCESS audit row is written when a global-scope user hits an audited read surface (grid, column-values), and that the same elevation within one request produces exactly one audit row.
  • Verify a transient audit-write failure does not suppress later audits in the same request (no silent gaps).
  • Hammer /admin/api/rules and /api/v1/rules with concurrent requests (cache warmed); expect all 200s with correct pagination/counts, no AttributeError/500.
  • Smoke-test other cached read endpoints returning typed responses to confirm no regressions from the decorator-level re-hydration.
  • Confirm tenant isolation is unaffected: no scope widening and no cross-tenant data exposure on affected read paths.