All updates
QABackend

Background Job Scaling, Worker Health Resilience, and Deployment Config Hardening (QA)

PR #79aisupport-datagainSep 4, 2025 · 16:04 UTC
QASep 4, 2025

Executive summary

This release promotes a set of backend infrastructure improvements to QA focused on the background-processing (Celery) tier: smarter worker auto-scaling, more reliable worker health checks, standardized Docker deployment configuration, and a security-hygiene change that stops the live environment file from being tracked. There are no end-user feature changes here; the goal is steadier throughput under load and safer, more consistent deployments.

Why this was needed

As document and vendor-upload volumes grow, a fixed pool of background workers either sits idle (wasting capacity) or falls behind during spikes. The previous health check could also report "no workers" during startup or when the broker was momentarily slow, producing false alarms. Separately, deployment scripts referenced an inconsistent Docker file name and the real environment file (with secrets) was committed to the repository, which is a security and reproducibility concern.

Client / user impact

  • More resilient processing of background jobs (e.g., vendor uploads) during traffic spikes, with workers scaling up and back down automatically.
  • Fewer false "workers offline" health alerts during startup or transient broker latency.
  • More consistent, reproducible deployments across environments.
  • Improved security posture: live secrets are no longer tracked in source control, and a documented .env.example makes safe configuration easier.
  • No visible change to client-facing application behavior.

Technical scope

Net changes against the QA base:

  • Celery autoscale (celery_autoscale_service.py): backlog threshold is now a configurable ratio of current concurrency (CELERY_AUTOSCALE_BACKLOG_RATIO, replacing a hardcoded 1.5x); auto-detects whether native autoscale is supported and transparently falls back to pool grow/shrink; discovers live worker nodes so control messages skip non-workers (beat/flower) and avoid pidbox errors.
  • Worker health (celery_health.py): reuses the configured Celery app (matching SSL/transport options), adds configurable control-plane timeout, total wait, and retry interval to avoid false negatives.
  • Deployment (docker-compose.all-in-one.yml, docker-start.sh, docker-quick-start.sh): standardize on Dockerfile instead of Dockerfile.all-in-one; add --autoscale with min/max worker bounds; run services as non-root user 1000:1000; remove inter-service depends_on.
  • Worker startup (scripts/start_worker.sh): optional --autoscale flag driven by env; removed --without-heartbeat.
  • Security/config: added .env.example (~205 documented keys, no values); renamed tracked .env to .enva to remove live secrets from version control.
  • Cleanup (app/main.py): removed an unused import asyncio.

Note: the PR's commit history references earlier feature work (filtering, folder fixes, document types, batch document count), but those already existed in QA; the actual code delta in this promotion is the infrastructure work above.

Risk & mitigation

Moderate, infrastructure-only. Autoscaling and worker-command changes affect throughput and resource use; misconfigured min/max bounds or the new backlog ratio could over- or under-provision workers, and removing --without-heartbeat slightly changes broker chatter. The Docker filename switch and non-root user: 1000:1000 could break container builds or volume permissions if the environment is misaligned. Mitigations: all autoscale and timeout values are env-configurable with sensible defaults; native autoscale falls back to pool grow/shrink automatically; health-check failures are non-fatal and logged. Verify the renamed .enva does not leave any deployment relying on a tracked .env.

QA validation focus

  • Confirm background jobs (vendor uploads, document processing) complete normally and the worker pool scales up under load and shrinks when idle.
  • Verify the worker health/status endpoint reports workers online (no false "no workers" errors) during startup and steady state.
  • Bring the all-in-one Docker stack up using the standardized Dockerfile; confirm worker, beat, and flower start, run as user 1000:1000, and have correct volume/log write permissions.
  • Sanity-check Flower for worker registration and that autoscale control messages are not sent to beat/flower.
  • Confirm the application still loads its environment correctly and that no secrets are present in tracked files (.env is untracked; .env.example has keys only).