Backend Scalability: Redis Consolidation, Dynamic Worker Autoscaling, and Faster Document Filtering
Executive summary
This release promotes a batch of backend infrastructure improvements to QA, centered on making document and mail processing scale more smoothly under load. It consolidates the platform onto a single high-performance Redis connection, adds automatic scaling of background workers, and speeds up document grid filtering. There are no user-facing screen changes; the impact is performance, stability, and operational control.
Why this was needed
The backend previously supported two parallel Redis connection paths (a native TCP path and a slower Upstash HTTP/REST path), which added complexity and a known performance penalty. Background worker capacity was also fixed, so processing could fall behind during upload/processing spikes and waste capacity when idle. In addition, filtering documents by mail number was handled inefficiently after fetching results, and grid caching could miss due to inconsistent (case-sensitive) module names.
Client / user impact
- More consistent performance and fewer cache misses for document grids and identity/auth lookups.
- Background jobs (uploads, processing) can scale worker capacity up automatically when a backlog builds and scale down when idle, without interrupting in-flight tasks.
- Filtering documents by mail number runs more efficiently at the database level.
- System administrators gain operational endpoints to inspect and control worker pools and to drain queues safely before maintenance.
- No visible UI changes; existing screens and API consumers behave the same.
Technical scope
- Redis consolidation: removed the Upstash HTTP/REST client (
app/utils/upstash_redis_client.py, ~410 lines) and alluse_upstashbranching inapp/core/redis.pyandidentity_cache_service.py; the app now requires a single TCP/TLSREDIS_URLand adds tuning env vars (max connections, timeouts, health-check interval, client name). - Celery worker autoscaling: new
app/vendor_portal/services/autoscale/celery_autoscale_service.pypolls active/reserved/scheduled task counts and CPU, growing/shrinking the pool with hysteresis and cooldowns; started from app lifespan inapp/main.pybehindCELERY_AUTOSCALE_ENABLED. - Admin control API: new
app/sync_dashboard/routers/celery_admin.pyexposes/admin/celery/{inspect,autoscale,grow,shrink,drain_shutdown}, all guarded by a newrequire_system_adminRBAC dependency inapp/api/dependencies/auth.py. - Concurrency auto-tuning:
celery_config.pyandstart.pycompute worker counts from cgroup CPU/memory limits when concurrency is set toauto. - Document filtering:
mail_numbermoved from post-filter to DB-level pushdown indocument_column_filter.pyanddocument_query_service_optimized.py. - Grid caching: module names lowercased for stable cache keys in
grid_columns.py. - API responses:
document_type_columns.pystandardized onto the sharedAPIResponseenvelope. - CI/build: GitHub Actions trigger and ECR repo switched from
DevtoQA; Docker build switched toDockerfile;.env.exampleremoved and.gitignoretightened.
Risk & mitigation
Medium. Redis is now mandatory and single-path: if REDIS_URL is misconfigured, startup fails fast rather than degrading, so the QA environment must have valid TCP/TLS Redis credentials. The Celery autoscaler and the auto concurrency tuning change runtime worker behavior; both are env-gated and default conservatively, and pool shrink/drain are graceful (no task termination). Mitigation: autoscaling can be disabled via CELERY_AUTOSCALE_ENABLED=false, concurrency can be pinned to fixed integers, and admin endpoints are restricted to System Administrators. Note: a .env containing live-looking Redis/broker URLs is committed in this diff and should be rotated/secured.
QA validation focus
- Confirm the app and Celery workers start cleanly against QA Redis (valid
REDIS_URL); verify a missing/bad URL fails fast as expected. - Verify identity/auth caching and document grid caching still work (cache hits, no stale data) after the Redis consolidation.
- Exercise document grids filtering by mail number and confirm correct results and counts.
- Trigger an upload/processing backlog and observe workers scaling up, then scaling down when idle, with no dropped or killed in-flight tasks.
- As a System Administrator, test
/admin/celery/inspect,autoscale,grow,shrink, anddrain_shutdown; confirm non-admins receive 403. - Smoke-test the standardized document-type column endpoints (list, delete, column document types) for the expected response shape.
- Verify the QA CI/CD pipeline builds and pushes to the QA ECR repo on merge.