Stability hardening: persist privileged-access audit logs and stop intermittent 500s on cached read endpoints under load
Executive summary
Reaching QA (testing). This release fixes three reliability and compliance issues surfaced by a 500-user / 10-minute stress test. Audit records for elevated (global-scope) access to sensitive data are now saved reliably instead of being silently dropped, and read endpoints that serve cached results no longer fail intermittently with HTTP 500 errors when traffic is high.
Why this was needed
The QA stress test exposed three latent defects. First, the high-stakes audit event written when a user is granted GLOBAL_SCOPE access to sensitive data on certain read surfaces (grid, column-values, etc.) was never committed, so the compliance record was rolled back and silently lost on every such request. Second, the cached-results layer returns a typed object on a cache miss but a plain dictionary on a cache hit; the /admin/api/rules handler accessed an attribute on that result and crashed once the cache warmed up (39 of 40 concurrent requests returned 500). Third, this dictionary-on-hit mismatch was a systemic risk affecting any cached endpoint, not just /rules.
Client / user impact
- Privileged-access (GLOBAL_SCOPE) audit trail is now durable, satisfying security and compliance expectations that elevated access to sensitive data is always recorded.
- Admin Rules listing (
/admin/api/rules) is stable under concurrency: 40/40 concurrent requests now succeed (was 39/40 failing with 500). - The same crash class is closed across all cached endpoints, so other read APIs that return typed responses from cache no longer risk intermittent 500s as traffic scales.
- A secondary benefit: per-request audit de-duplication reduced redundant write-session churn on affected read calls.
Technical scope
app/services/auth/tenant_scope_policy.py: addedawait write_db.commit()in_maybe_audit_global_scopeso theSENSITIVE_DATA_ACCESSaudit row persists (it previously only flushed and rolled back). Added per-request de-duplication keyed on(user_id, surface, reason)via the request cache so one request audits a given elevation at most once. The de-dup marker is set only when a real row id is returned, so a transient logging failure stays retryable rather than silently suppressing the audit.app/admin_portal/routers/admin_rules.py: contained fix (B1) coercing a cacheddictback toRuleListResponseviamodel_validatebefore attribute access.app/utils/cache/manager.py: systemic fix (B2) generalizing the coercion into thecache_resultdecorator. New_rehydrate_cached_value/_cached_return_modelhelpers read the wrapped function's declared Pydantic return type and re-hydrate a cached dict on hit; applied to all cache-hit return paths (request cache, local L1, stale-while-revalidate stale/fresh, non-SWR, single-flight poll, post-lock recheck). HandlesOptional/PEP 604 unions and forward refs, and fails open (returns the raw dict, never raises) on schema drift. Lint follow-up swapped@lru_cache(maxsize=None)to@functools.cache.- Tests added:
tests/unit/test_cache_rehydration.py,tests/unit/test_admin_rules_cache_rehydration.py,tests/unit/services/auth/test_global_scope_audit_dedup.py. - Two design docs added under
docs/superpowers/specs/(B1 spec and a Phase 2 session-chattiness design). Phase 2 is design-only with no code in this PR. No database migrations and no IAM/permission changes.
Risk & mitigation
Moderate, well-mitigated. The audit change touches the tenant-scope authority (a security-sensitive area) but only adds a commit on a dedicated write session and a read-only de-dup guard, leaving scope decisions and read-only sessions unchanged. The cache re-hydration is defensive and fails open, so schema drift degrades to the prior raw-dict behavior rather than raising new errors. Risk is concentrated in the broad blast radius of the shared cache decorator (every cached function) and the auth path; mitigated by regression unit tests and the existing tenant-scope lint passing with zero violations. Base branch is QA (cut from QA, deploying to the qa namespace), not production.
QA validation focus
- Confirm a
SENSITIVE_DATA_ACCESSaudit row is written when a global-scope user hits an audited read surface (grid, column-values), and that the same elevation within one request produces exactly one audit row. - Verify a transient audit-write failure does not suppress later audits in the same request (no silent gaps).
- Hammer
/admin/api/rulesand/api/v1/ruleswith concurrent requests (cache warmed); expect all 200s with correct pagination/counts, noAttributeError/500. - Smoke-test other cached read endpoints returning typed responses to confirm no regressions from the decorator-level re-hydration.
- Confirm tenant isolation is unaffected: no scope widening and no cross-tenant data exposure on affected read paths.