All updates
QABackend

Backend Scalability: Redis Consolidation, Dynamic Worker Autoscaling, and Faster Document Filtering

PR #73aisupport-datagainSep 4, 2025 · 11:05 UTC
QASep 4, 2025

Executive summary

This release promotes a batch of backend infrastructure improvements to QA, centered on making document and mail processing scale more smoothly under load. It consolidates the platform onto a single high-performance Redis connection, adds automatic scaling of background workers, and speeds up document grid filtering. There are no user-facing screen changes; the impact is performance, stability, and operational control.

Why this was needed

The backend previously supported two parallel Redis connection paths (a native TCP path and a slower Upstash HTTP/REST path), which added complexity and a known performance penalty. Background worker capacity was also fixed, so processing could fall behind during upload/processing spikes and waste capacity when idle. In addition, filtering documents by mail number was handled inefficiently after fetching results, and grid caching could miss due to inconsistent (case-sensitive) module names.

Client / user impact

  • More consistent performance and fewer cache misses for document grids and identity/auth lookups.
  • Background jobs (uploads, processing) can scale worker capacity up automatically when a backlog builds and scale down when idle, without interrupting in-flight tasks.
  • Filtering documents by mail number runs more efficiently at the database level.
  • System administrators gain operational endpoints to inspect and control worker pools and to drain queues safely before maintenance.
  • No visible UI changes; existing screens and API consumers behave the same.

Technical scope

  • Redis consolidation: removed the Upstash HTTP/REST client (app/utils/upstash_redis_client.py, ~410 lines) and all use_upstash branching in app/core/redis.py and identity_cache_service.py; the app now requires a single TCP/TLS REDIS_URL and adds tuning env vars (max connections, timeouts, health-check interval, client name).
  • Celery worker autoscaling: new app/vendor_portal/services/autoscale/celery_autoscale_service.py polls active/reserved/scheduled task counts and CPU, growing/shrinking the pool with hysteresis and cooldowns; started from app lifespan in app/main.py behind CELERY_AUTOSCALE_ENABLED.
  • Admin control API: new app/sync_dashboard/routers/celery_admin.py exposes /admin/celery/{inspect,autoscale,grow,shrink,drain_shutdown}, all guarded by a new require_system_admin RBAC dependency in app/api/dependencies/auth.py.
  • Concurrency auto-tuning: celery_config.py and start.py compute worker counts from cgroup CPU/memory limits when concurrency is set to auto.
  • Document filtering: mail_number moved from post-filter to DB-level pushdown in document_column_filter.py and document_query_service_optimized.py.
  • Grid caching: module names lowercased for stable cache keys in grid_columns.py.
  • API responses: document_type_columns.py standardized onto the shared APIResponse envelope.
  • CI/build: GitHub Actions trigger and ECR repo switched from Dev to QA; Docker build switched to Dockerfile; .env.example removed and .gitignore tightened.

Risk & mitigation

Medium. Redis is now mandatory and single-path: if REDIS_URL is misconfigured, startup fails fast rather than degrading, so the QA environment must have valid TCP/TLS Redis credentials. The Celery autoscaler and the auto concurrency tuning change runtime worker behavior; both are env-gated and default conservatively, and pool shrink/drain are graceful (no task termination). Mitigation: autoscaling can be disabled via CELERY_AUTOSCALE_ENABLED=false, concurrency can be pinned to fixed integers, and admin endpoints are restricted to System Administrators. Note: a .env containing live-looking Redis/broker URLs is committed in this diff and should be rotated/secured.

QA validation focus

  • Confirm the app and Celery workers start cleanly against QA Redis (valid REDIS_URL); verify a missing/bad URL fails fast as expected.
  • Verify identity/auth caching and document grid caching still work (cache hits, no stale data) after the Redis consolidation.
  • Exercise document grids filtering by mail number and confirm correct results and counts.
  • Trigger an upload/processing backlog and observe workers scaling up, then scaling down when idle, with no dropped or killed in-flight tasks.
  • As a System Administrator, test /admin/celery/inspect, autoscale, grow, shrink, and drain_shutdown; confirm non-admins receive 403.
  • Smoke-test the standardized document-type column endpoints (list, delete, column document types) for the expected response shape.
  • Verify the QA CI/CD pipeline builds and pushes to the QA ECR repo on merge.