All updates
QABackend

Backend Concurrency, Connection-Pool, and Grid Performance Hardening (QA)

PR #81aisupport-datagainSep 5, 2025 · 20:55 UTC
QASep 5, 2025

Executive summary

A backend stability and performance release reached QA. It reworks how the API handles database concurrency and connection pooling, makes default grid sorting align with database indexes, and reduces per-request overhead from logging and monitoring. No user-facing features change; the goal is faster, more reliable list views and fewer connection-related errors under load.

Why this was needed

Under concurrent load the API risked exhausting the database connection limit, and some list endpoints (mails, batches, documents, folders) used default sort columns that did not match existing indexes, producing slower queries. Hard-coded assumptions (a fixed 103-connection ceiling, always-on file logging, always-on performance monitoring, and pool_pre_ping even behind PgBouncer) added avoidable overhead and made the service harder to tune per environment.

Client / user impact

  • List/grid views for mails, inbox, batches, documents, document types, folders, and organizations should load faster because default sorts now use indexed columns.
  • Fewer connection-pool exhaustion errors under heavy concurrent usage, with limits now configurable per environment.
  • Lower request latency and noise from logging/monitoring that are now opt-in outside development.
  • Permission and user-context checks are de-duplicated within a single request, reducing redundant database queries.

Technical scope

Grounded in the diff (+1579 / -5398 across 44 files):

  • New app/services/db_concurrency.py: helpers (dms_execute, identity_execute, with_new_dms_session, etc.) so parallel batch loaders run on separate sessions instead of sharing one.
  • New app/utils/request_cache.py + middleware reset in main.py: request-scoped memoization for get_user_context, get_entity_type, accessible-entity IDs, and permission sets in dynamic_permission_service_optimized.py.
  • grid_columns.py: index-friendly default sorts per module (e.g. mails/batches/documents created_on desc, folders/document types name asc); strips self-referencing filters in column-value discovery.
  • database.py / pool_monitor.py: configurable DB_MAX_CONNECTIONS/DB_RESERVED_CONNECTIONS (removes hard-coded 103), disables pool_pre_ping when USE_PGBOUNCER.
  • main.py / logging_config.py / monitoring.py: request logging, file logging, and performance middleware/monitoring are now env-gated (default off in production); metric persistence is sampled/slow-only; blocking psutil and Celery health checks offloaded to threads.
  • Cache-key fixes in batch, audit, and permissions services to avoid cross-user/global key collisions.
  • Deletes 12 internal docs/*.md planning files (~5,200 lines); minor schema fix (assigned_by int) and an added requirements.txt dependency.

Risk & mitigation

Medium. The DB-session and connection-pool changes touch hot paths shared by many endpoints, and disabling pool_pre_ping behind PgBouncer plus new env defaults could surface stale-connection or behavior differences if env vars are misconfigured. Mitigation: changes are largely additive and env-gated; default sort changes are deterministic per module; verify QA environment variables (DB_MAX_CONNECTIONS, USE_PGBOUNCER, monitoring/logging toggles) before promoting.

QA validation focus

  • Open grid/list views for mails, inbox, batches, documents, document types, folders, and organizations; confirm correct default ordering and improved load times.
  • Drive concurrent batch/list traffic and watch for connection-pool errors or stalls; confirm pool metrics/alerts report sensible limits.
  • Confirm permission-gated flows (admin hub access, tab access, system folders) still resolve correctly with request-scoped caching.
  • Verify column-value/filter dropdowns still populate after the self-filter change.
  • Confirm startup health logs and Celery worker detection still work with monitoring/logging defaults in QA.