All updates
QABackend

Backend Hardening: Search Indexing, DB/Redis Resilience, Security Auditing, and Internal Dashboard Removal (QA)

PR #95aisupport-datagainSep 19, 2025 · 20:09 UTC
QASep 19, 2025

Executive summary

This QA promotion is an infrastructure, performance, and security cleanup batch rather than a user-facing feature drop. It adds database search indexing and full-text triggers for faster mail/label/upload lookups, hardens database and Redis connection handling, introduces a graded security-audit service, and removes a legacy internal sync dashboard (including credential files that had been committed to the repo).

Why this was needed

Search and list queries on mail, labels, upload centers, and notifications were unindexed in places, and database/Redis connections lacked consistent timeouts and resilience, risking slow queries and stuck transactions under load. Separately, an unused internal sync_dashboard module remained in the codebase along with committed OAuth and Google service-account credential files, which is both dead weight and a security exposure that needed to be removed.

Client / user impact

  • Faster, more responsive list and search experiences (mail labels, upload centers, status lookups, notifications) once indexes/migrations are applied.
  • More stable backend behavior under concurrency: bounded statement, idle-in-transaction, and lock timeouts reduce stuck connections and long lock waits.
  • Stronger security posture: graded security-event logging/alerting (failed logins, unauthorized access, brute-force) and removal of credentials that were checked into source control.
  • No intended change to client-facing app behavior; this is groundwork shipping to QA.

Technical scope

  • Search performance migrations: new Alembic revisions add FTS search_vector triggers for MailLabels and UploadCenter, plus trigram/composite indexes for Status_lookup, UploadCenter, MailLabels.name, and user_notifications (created CONCURRENTLY, with pg_trgm enabled best-effort).
  • DB tuning & view fix: a migration sets DB-level GUC defaults (idle-in-transaction timeout, parallel workers) and adds a partial MetaEntities index; another rewrites the v_mails_list_permitted view using NULLIF(...)::type guards so unset session settings no longer cause invalid casts (view is only replaced if it already exists).
  • Connection resilience: app/core/database.py reworked for SQLAlchemy 2.x async + psycopg3 with keepalives and per-session statement/idle/lock timeouts, and unifies on DATABASE_URL (with rollout fallback alias). app/core/redis.py expanded with an Upstash REST client path and leader-elected startup diagnostics.
  • Security auditing: new SecurityAuditService with severity levels and threshold/cooldown alert configs for events like login failures, unauthorized access, and brute-force attempts.
  • Cleanup: entire app/sync_dashboard/ module deleted (~9,700 lines), including committed oauth_client_secret.json and a Google service-account key JSON, plus cached schema/log data; app/main.py rewired accordingly.
  • Broad, mostly mechanical normalization across many app/api/v1/* routers and a net simplification of identity_cache_service.py. Dockerfile/Makefile and docs (CLAUDE.md, README.Docker.md) also updated.
  • Note: PR has no description and generic commit messages; scope above is derived from the actual diff and changed files.

Risk & mitigation

Medium. The main risks are migration-related: CONCURRENTLY index builds and DB-level GUC/view changes touch shared database state and should be applied during a low-traffic window with backups. The database/Redis connection rewrite and DATABASE_URL unification could surface environment/config mismatches. Mitigations: migrations are guarded (skip on insufficient privileges, don't create the view if absent, best-effort extension creation) and downgrades are provided; deleted credentials should be treated as compromised and rotated regardless of removal. Validate timeout env vars per environment before promoting beyond QA.

QA validation focus

  • Run Alembic migrations on a QA-representative dataset; confirm indexes/triggers/view all apply cleanly and verify rollback (downgrade) works.
  • Exercise mail, label, upload-center, status, and notification list/search endpoints; confirm correct results and improved or stable latency.
  • Verify v_mails_list_permitted-backed mail listings still respect permissions and return no cast errors when session settings are unset.
  • Smoke-test app startup, DB connectivity, and Redis under the new connection settings; confirm no stuck/idle transactions appear under load.
  • Confirm the removed sync dashboard routes are gone (404) and the app boots without them.
  • Trigger security-audit scenarios (repeated login failures, unauthorized access) and confirm events/alerts are recorded as expected.
  • Confirm the removed credential files are no longer reachable and that the corresponding secrets have been rotated.