Debugging guide for Waldur Mastermind
Introduction
This guide provides systematic approaches for troubleshooting Waldur Mastermind deployments. Use these techniques to diagnose issues with API, worker processes, email delivery, and resource provisioning.
Common debugging scenarios
Symptom-based troubleshooting guide
| Symptom | First Check | Next Steps |
|---|---|---|
| Users not receiving emails | Email logs | SMTP configuration, email templates |
| API returning errors | API logs | HTTP status codes, request parameters |
| Resources stuck in provisioning | Worker logs | Backend connectivity, quota issues |
| Slow performance | Database logs | Query performance, connection pooling |
| Authentication failures | API logs with auth filter | IdP configuration, token issues |
Accessing and filtering logs
Email-related events
Docker Compose
Check your specific deployment logs location, typically:
1 2 3 4 | |
Helm
1 2 | |
Component logs
API Logs in Helm
1 2 | |
Worker Logs in Helm
1 2 | |
Advanced log filtering
1 2 3 4 5 6 7 8 | |
Troubleshooting component issues
API server problems
- Check if the API container is running:
1 2 3 4 5 | |
- Verify API health endpoint:
1 2 | |
- Check for configuration issues:
1 2 3 4 5 | |
Worker issues
- Check Celery worker status:
1 2 3 4 5 | |
- Verify task queue connectivity:
1 2 3 4 5 | |
Database troubleshooting
- Check database connectivity:
1 2 3 4 5 | |
- Identify slow queries (requires database access rights):
1 2 3 4 5 | |
API endpoint issues
- Test endpoint with curl (store tokens in environment variables for security):
1 2 3 4 5 | |
- Common HTTP status codes:
- 401/403: Authentication/authorization issue
- 404: Resource not found
-
500: Server error (check logs)
-
For development environments only (⚠️ NEVER in production): Temporarily increase debug verbosity through admin settings panel or by editing configuration files.
Authentication problems
- Check token validity:
1 2 3 4 5 | |
- Verify IdP configuration:
1 2 | |
Resource provisioning failures
- Check resource state:
1 2 3 4 5 | |
- Check backend connectivity (if you have the right plugin installed):
1 2 | |
Log management
Log rotation
Both Docker Compose and Kubernetes deployments typically have configured log rotation:
1 2 3 4 5 | |
Centralized logging
For production deployments, consider:
- Fluentd for log collection
- Elasticsearch for storage and search
- Kibana for visualization
Troubleshooting broker / Celery worker timeouts
This section covers Sentry events of the form SystemExit: 1 originating in
waldur_mastermind/policy/handlers.py (or any handler chain that ends in
kombu/basic_publish_confirm), and HTTP 500s on high-frequency endpoints such
as set_usage and set_state_done. The pattern is gunicorn killing a worker
that is parked in recv() waiting for an AMQP publish ACK that never arrives.
Symptom checklist
The following combination strongly suggests a broker / connection-state issue rather than slow application code:
- Sentry error type is
SystemExit: 1(not a Python exception class) - The deepest in-app frame is in
policy/handlers.py(ormarketplace/handlers.py) callingtasks.X.delay(...) - The bottom of the stack is
kombu/messaging.py → amqp/channel.py:basic_publish_confirm → amqp/transport.py:_read → recv - HTTP transaction shows the worker dying right at the gunicorn worker timeout (default 30 s)
- Restarting the broker temporarily reduces the rate but does not eliminate it
Layered diagnostic plan
When chasing these failures, isolate one layer at a time. The first layer that behaves correctly is the layer above where the bug lives.
Layer 0: confirm the failing call path is broker-bound
Capture AMQP traffic on a single API pod during a failure window:
1 2 3 4 5 6 7 | |
If publish count equals basic-ack count, the broker is sending ACKs the
publisher fails to receive (move to Layer 1). If publish exceeds basic-ack,
the broker is slow to ACK or never ACKs (move to Layer 2).
Layer 1: kernel-level socket health on the publisher
Inspect TCP state of the API pod's AMQP connections while a recv() is stuck:
1 2 3 4 | |
Pay attention to:
State:ESTABLISHEDwith a stuckrecv()indicates the connection silently went half-openunackedandretranscolumns: non-zero values mean the kernel is retransmitting and the peer is not respondingRecv-Q: bytes sitting in the receive buffer but not consumed — extremely unusual whenrecv()is blocked, suggests application-level deadlock
Check whether TCP keepalive is enabled on AMQP sockets (it should be, py-amqp
sets SO_KEEPALIVE=1 unconditionally):
1 2 3 4 5 6 7 8 | |
Layer 2: broker-side per-publish latency
Measure confirm RTT directly from inside the API pod against the real broker.
The probe_broker_latency management command publishes N no-op messages with
a long countdown (so they land in a delay queue without executing) and
reports percentiles:
1 2 3 4 5 6 | |
Interpretation:
- p99 < 50 ms — broker is fast. Failures live in the network / connection layer (Layer 1)
- p99 50–500 ms — broker is slow but usually tolerable; combined with publish amplification can still exceed the worker timeout
- p99 > 1 s — broker has a real latency problem; investigate quorum WAL, disk I/O on the broker PVC, or memory pressure
maxclose to the worker timeout — occasional confirms are nearly fatal; an upper bound on confirm wait is required
The probe payload uses an invalid scope_id so the task body is a no-op
even when it eventually executes. With countdown=86400 the messages land
in a deep celery_delayed_* bucket and will rotate down naturally; no
manual cleanup is required.
Layer 3: Kubernetes network / conntrack on the worker nodes
If broker confirms are fast but a percentage of requests still time out, something on the kernel network path is dropping the ACK frame in transit. The most common cause in busy Kubernetes clusters is conntrack table exhaustion.
1 2 3 4 5 | |
A count / max ratio above ~0.7, or any nf_conntrack: table full,
dropping packets log line, is the signal. Resolution is a node-level
sysctl bump (net.netfilter.nf_conntrack_max) and tighter nf_conntrack_*
TCP timeouts.
If kubectl debug node/<node> is permitted in your cluster, the
equivalent is:
1 2 3 4 | |
Layer 4: gunicorn and Celery configuration
Verify gunicorn's effective timeout and Celery's actual broker config from inside a running API pod (config values flow through Django settings):
1 2 3 4 5 6 7 8 9 10 11 12 | |
If broker_heartbeat is None, the negotiation falls back to the broker's
default — typically 60 s. Combined with two-missed-heartbeat detection,
that means dead connections are detected ~120 s after the fact, which is
after gunicorn has already killed the worker.
Configuration knobs that actually fix this
These are the Celery / kombu settings that matter for broker resilience. Each carries a gotcha that is easy to get wrong.
Heartbeat: must be set via broker_transport_options
The top-level CELERY_BROKER_HEARTBEAT setting is silently dropped on
the publisher path. Celery does not propagate it to
app.broker_connection() or app.producer_pool. To actually reach the
kombu Connection, the value must live inside broker_transport_options:
1 2 3 4 5 | |
py-amqp sends a heartbeat frame at half the configured interval. With
heartbeat=30, the broker drops the connection after roughly 30 s of
missed heartbeats — early enough to interrupt a stuck publisher before
gunicorn's 30 s worker timeout.
Heartbeats do not pump on idle pool connections
py-amqp only sends heartbeat frames from inside drain_events() /
the connection's event loop. A producer connection sitting idle in
the kombu pool between publishes has nothing pumping that loop, so
no client-side heartbeat ever fires. Once the connection is idle for
2 × heartbeat, the broker tears it down. The publisher does not
learn about the close until its next publish — and if the FIN never
arrived (NAT eviction, conntrack drop, replica failover), the next
publish hangs in recv() waiting for an ACK from a peer that is
gone.
In other words: the heartbeat setting is necessary but not sufficient for the publisher path. The TCP keepalive settings below are what actually catch half-open pool connections, because they run at the kernel layer regardless of application activity.
Background: kombu#621, kombu#59 (open since 2011), celery#9259.
TCP keepalive: keys must be integer socket constants
py-amqp enables SO_KEEPALIVE unconditionally but ships defaults of
TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9 — a total ~150 s
detection window that fires after the gunicorn timeout. Override with
tighter values so the kernel closes a stuck socket in ~25 s:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
The socket_settings keys must be integer constants from the
socket module — py-amqp passes them straight to
setsockopt(SOL_TCP, opt, val). String keys raise
TypeError: 'str' object cannot be interpreted as an integer on the
first connection attempt. Some constants (TCP_KEEPIDLE in
particular) are Linux-only; the getattr guard above keeps the
settings importable on macOS dev machines.
Watch for overlay files that re-assign CELERY_BROKER_TRANSPORT_OPTIONS
Django settings are merged by import order. Anything that does:
1 | |
in a settings overlay (image-side /etc/waldur/settings.py, helm
override.conf.py, an environment-specific
local_settings.py, …) loaded after the base
celery_settings.py replaces the dict wholesale. The heartbeat
and socket_settings keys defined upstream are silently dropped. The
process starts cleanly, the broker connects, every publish appears to
succeed under nominal load — and the resilience knobs that catch
half-open connections are simply not there. The first symptom is a
slow drift of WORKER TIMEOUT events in gunicorn logs and audit
events terminating about 30 s after their start.
To avoid the trap, always merge into the existing dict in any overlay layer:
1 2 3 4 5 | |
Or, preferred, keep the full dict in one place (the base settings
file) and stop mentioning CELERY_BROKER_TRANSPORT_OPTIONS in
overlays at all. The audit_broker_config management command (see
below) detects the missing keys, but the static audit only catches
the issue if you run it after the overlay has been applied — i.e.
inside a running pod, not against a checkout of the source tree.
The same caution applies to any Django setting whose value is a dict
or list (DATABASES, CACHES, LOGGING, INSTALLED_APPS,
MIDDLEWARE, …). Replacement is the default; merging is opt-in.
Worker recycling: cost of low max_tasks_per_child
A low CELERY_WORKER_MAX_TASKS_PER_CHILD (e.g. 100) is often set
defensively against memory leaks, but each recycle costs a fork +
full broker reconnect handshake. Under heavy publish load this churn
measurably outweighs the leak protection. Values in the low
thousands (1000–2000) are typical for production celery workers
without leaks.
Per-publish confirm_timeout is plumbed end-to-end but has no global default
In current pinned versions (celery 5.5.x, kombu 5.5.x, amqp 5.3.x) the
confirm_timeout keyword reaches all the way from
task.apply_async(confirm_timeout=N) down into
amqp.Channel.basic_publish_confirm(..., confirm_timeout=N) →
self.wait([Basic.Ack, Basic.Nack], timeout=N). A timed-out wait
raises RecoverableChannelError, which kombu treats as a recoverable
publish error.
There is, however, no Celery configuration setting that supplies a
default confirm_timeout for every publish. Three options to apply
one cluster-wide:
- Wrap
apply_async/delayin a project helper that always setsconfirm_timeout=N(smallest blast radius) - Subclass
kombu.Producerto injectconfirm_timeoutinpublish()and wire the subclass into the Celery app's producer pool (one place to maintain, covers all callsites) - Set
CELERY_BROKER_POOL_LIMIT = 0to disable the producer pool entirely so every publish opens a fresh AMQP connection — this sidesteps pool poisoning at the cost of a ~5–10 ms TCP+AMQP handshake per publish (the workaround used by the original reporter of celery#9259)
The heartbeat + TCP keepalive combination above remains the primary
defense; per-publish confirm_timeout is belt-and-braces for when a
half-open is hit between keepalive probes.
Known upstream caveat
py-amqp#452 (open):
when confirm_timeout fires, py-amqp raises
RecoverableChannelError rather than
RecoverableConnectionError, which can prevent kombu's
ensure() from picking a fresh connection on retry. In practice
the next publish acquires a different connection from the pool,
but a tight retry_policy loop pinned to one connection can
exhaust retries on the same dead socket.
Correlating broker logs with worker timeouts
When a half-open connection is the root cause, the broker logs a
warning each time it observes a client socket close without the
matching AMQP connection.close handshake. Counting these warnings
over time, and correlating against worker-timeout events in the API
logs, gives a quick independent confirmation that publishers are
dying mid-confirm rather than failing in some other layer.
1 2 3 4 5 6 7 8 9 10 | |
The sed strips ANSI colour codes — RabbitMQ tags warning lines with
them and naive grep/awk counters drop the timestamps otherwise.
Interpretation:
- Cluster-wide baseline of tens per hour: routine kombu pool
churn (short-lived publishers, worker recycles). Heartbeat
negotiation is not happening — look for zero
missed heartbeatswarnings as confirmation:
1 2 | |
A count of zero across 24 h means no client has negotiated heartbeats with this cluster.
-
Hundreds-to-thousands per hour, sustained: connection storm. Most often a publisher with
broker_pool_limit=0set on a busy service, or a runaway reconnect loop. Cross-check API and worker pods forConnection.connect()log lines. -
Burst clustered around a known incident: a broker replica flapped, evicted all existing connections, and publishers reconnected. Healthy and expected after a broker restart.
To check whether a specific worker-timeout event correlates with a
broker-side close on the same connection, line up timestamps. The
modal Δ between an audit-log row that ends in failed and the
nearest preceding broker client unexpectedly closed warning on the
same node is the publisher's recv() budget — i.e. the gunicorn
worker timeout. A consistent Δ of ~30 s across many events is a
strong signal that worker timeouts are upstream of (and caused by)
publishers stuck in basic_publish_confirm.
Verifying a configuration change end-to-end
The wrong setting will silently no-op and ship broken. Run the
audit_broker_config management command for a static check, then
probe_broker_latency to confirm the broker actually responds in
the expected latency band:
1 2 3 4 5 | |
The audit command catches each of the gotchas listed above (top-level
BROKER_HEARTBEAT being ignored, string-keyed socket_settings,
missing confirm_publish, etc.) and prints a human-readable summary
plus a remediation hint per finding.
For a deeper inspection of the actual kombu connection state — useful when verifying a brand-new transport option py-amqp may or may not have picked up — drop into the Django shell and introspect the live connection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
heartbeat should be non-zero, confirm_publish should be True,
socket_settings should be a dict of integer keys, and the kernel
values should reflect the configured TCP keepalive parameters.
Common signal handler / publish patterns that compound the problem
Independent of the transport layer, application code that publishes many synchronous Celery tasks per HTTP request multiplies any tail latency by the publish count. Two patterns are worth knowing:
Synchronous fan-out in post_save handlers
If multiple signal handlers are wired to the same model's post_save
and each issues .delay() synchronously, one HTTP request becomes N
broker round-trips. Each round-trip is a chance to hit the slow path.
Debounce repeated triggers on a per-scope cache key (see
waldur_mastermind/policy/handlers.py:_debounced_call) so a burst of
saves collapses to a single Celery publish per scope per debounce
window.
Deferring publishes to transaction.on_commit
When the trigger occurs inside a database transaction, publishing
through transaction.on_commit(lambda: task.delay(...)) moves the
publish out of the transaction's critical section. This also avoids
the race where a Celery worker picks up the task before the DB row
that justifies it is committed.
Cross-references
- Architecture overview — where the broker fits in the topology
- Hardware requirements — broker sizing and PVC recommendations
- Troubleshooting gunicorn workers — complementary section below; broker timeouts manifest as gunicorn worker deaths
Troubleshooting gunicorn workers
The gunicorn master spawns N worker processes. Workers handle HTTP
requests; if a worker doesn't return within --timeout seconds (default
30 s) the master sends it SIGABRT and logs WORKER TIMEOUT (pid:…),
then forks a replacement. Long-running publishes, slow DB queries,
deadlocks, and certain library hangs all manifest as worker timeouts.
This section covers diagnosing what workers are actually doing and why
they die.
Inspect effective gunicorn configuration
Configuration can live in /etc/waldur/gunicorn.conf.py, in command-line
flags on the master process, or in environment variables. Verify all
three:
1 2 3 4 5 6 7 8 9 10 | |
Settings to confirm:
timeout: kill-worker threshold; default 30 sgraceful_timeout: SIGTERM grace before SIGKILL (default 30 s)workers: process countworker_class:sync(default),gthread,gevent,eventletthreads: only meaningful forgthreadpreload_app: if true, the app is imported in the master and inherited by workers via forkmax_requests: requests per worker before recycle (0 = never)max_requests_jitter: randomisation to avoid synchronised recycles
Inventory of worker processes and their ages
Sudden gaps in worker ages (one is 30 s old, others are minutes/hours) mean workers are being killed and respawned. Steady ages mean they're healthy.
1 2 3 4 5 6 7 8 9 | |
If ages are all < gunicorn timeout, workers are dying continuously —
the symptom of stuck requests at the timeout boundary. Cross-check with
the worker-timeout log count below.
Count WORKER TIMEOUT events in logs
The master process logs every timeout-kill. Count them over a window to get a rate:
1 2 3 4 5 6 7 8 | |
A handful per hour can be normal under heavy load. Dozens per hour across pods is a real problem and warrants the deeper investigation in the next sections.
Identify what a worker is currently stuck on
When a Sentry event arrives mid-request the stack at SIGABRT time is
captured automatically. To inspect a still-running worker mid-flight,
you have three tiers of options depending on the deployment's
security posture.
Why kubectl exec ... py-spy/strace usually fails
Production waldur containers ship hardened: they run as a non-root UID
with the bounding capability set stripped (no CAP_SYS_PTRACE), and
the kernel's ptrace_scope is typically 1 (restricted). Both
py-spy and strace use the ptrace(2) syscall, which the kernel
will reject with EPERM in this setup.
Confirm the limitation in your cluster before reaching for a workaround:
1 2 3 4 5 6 | |
CapEff: 0000000000000000→ zero effective capabilities;ptracewill failCapBndmissing bit 19 (0x80000) →CAP_SYS_PTRACEcannot be raisedptrace_scope: 1→ restricted; cannot attach to non-descendant processes- Non-root
id→ no admin override
If all four apply you must use one of the elevated-privilege patterns below.
Pattern A — Ephemeral debug container with --profile=general
kubectl debug injects a sidecar into a running pod. With
--profile=general, recent kubectl versions add the capabilities and
shared namespaces needed for ptrace tools. Permission to use this
requires pods/ephemeralcontainers in the namespace RBAC and a
PodSecurityStandard that allows the elevated profile.
1 2 3 4 5 6 7 8 9 10 11 12 | |
If kubectl debug is denied with cannot create resource ... in
namespace, jump to pattern B.
Pattern B — Standalone diagnostic pod pinned to the same node
When ephemeral containers are blocked, deploy a one-off privileged pod on the same node as the target API pod, sharing the host's PID namespace. The debug pod sees the API worker processes and can ptrace them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Apply, then exec in and run the tool:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Things to watch:
- Creating a pod with
hostPID: trueandSYS_PTRACEtypically requires permissive PodSecurityStandard (baselineorprivileged) on the namespace. If denied, work with the cluster admin to apply a temporaryPodSecurityPolicyexemption. - Pinning to
nodeNameis required so the pod sees the API worker in its (host) PID namespace. - Keep the debug pod's lifetime short — delete it as soon as the dump is captured.
Pattern C — Ptrace-free alternative inside the regular pod
When neither pattern A nor B is available, you can still inspect Python state from inside the regular waldur pod by sending the process a signal it has installed a handler for, or by attaching via Django's shell to introspect threading and module state. This won't give you a stack of a currently blocked gunicorn worker — for that you need ptrace. But it covers many "what's the app doing right now" questions:
1 2 3 4 5 6 7 | |
For a more comprehensive picture without ptrace, write a long-running
diagnostic that the worker imports voluntarily (a debug middleware
that periodically dumps sys._current_frames() to a file when an
env var is set). Outside the scope of this guide.
Syscalls to look for in py-spy / strace output
Once you have a stack from any of the patterns above:
- Repeated
recvfrom(<fd> ... <unfinished ...>— worker is parked in a blocking read (broker confirm, HTTP upstream, slow DB) - Repeated
poll([...], timeout=…)— async event loop waiting futex(... FUTEX_WAIT …)— Python GIL or threading lock contention- py-spy
Python frameoutput names the exact file/function/line in the worker's Python stack — typically the fastest path to "what's slow"
Worker memory and file descriptor usage
Memory leaks and FD leaks are common silent killers. Both grow until
the worker is OOM-killed by the kernel or hits RLIMIT_NOFILE.
1 2 3 4 5 6 7 8 9 10 | |
If RSS grows monotonically with age and resets when the worker recycles,
the app has a leak. max_requests (with jitter) is the configuration
workaround; the real fix is finding and closing the leak.
If FD count grows with age, sockets or files are being opened without
being closed. ls -l /proc/$pid/fd shows what types of FDs they are.
preload_app and post-fork state
When preload_app = True, the master imports the Django/Celery app
once, then forks workers. Module-level state survives the fork and is
shared across workers via COW pages. This is good for memory (shared
code) but dangerous for:
- Open sockets (broker connections, DB connections) — multiple workers may inadvertently share a socket FD and corrupt the protocol
- RNG state — workers start with identical seeds unless re-seeded after fork
- Module-global mutable singletons (caches, connection pools)
Kombu and most DB drivers detect os.fork() and reset state, but only
if their register_at_fork hook was actually registered before the
fork. With preload_app = False (the default in waldur), each worker
imports the app independently after fork — no shared state, but
slightly slower boot. This is the safer baseline; deviate only with
specific reason.
Worker-class implications for blocking I/O
| Worker class | Concurrency | Blocking I/O is OK? |
|---|---|---|
sync (default) |
1 request/worker | Yes — each blocking call blocks one worker |
gthread |
N threads/worker | Yes — Python releases the GIL on I/O |
gevent / eventlet |
hundreds via monkey-patching | No — must use async-aware libraries everywhere |
If the deployment uses gevent or eventlet, any synchronous
network library that wasn't monkey-patched at startup will block the
entire worker's event loop, not just one greenlet. Common culprits:
psycopg, kombu pre-monkey-patch, native extensions. Always pair these
worker classes with monkey.patch_all() very early in app boot.
sync is the safest baseline. waldur ships with sync.
Signal flow when a worker times out
- Worker is parked >
timeoutseconds since lastnotify()to master - Master sends
SIGABRTto the worker (NOTSIGKILL) - Python's default
SIGABRThandler callssys.exit(1)→ Sentry capturesSystemExit: 1with the current stack - Kernel cleans up worker's open sockets — RST sent to peers, hence the broker logs "client unexpectedly closed TCP connection"
- Master forks a replacement worker
The "SystemExit" type in Sentry is the giveaway that gunicorn killed the worker; if it were the application itself, you'd see a real exception class.
Raising the gunicorn timeout only delays the kill — it does not
fix the underlying stuck request. The right fix is not having
requests that block longer than the timeout; see the broker /
Celery section above for one concrete class of fix.
Emulating a Celery publisher / worker in a side pod
Many of the broker tests in the earlier section (publish latency probe, heartbeat verification, socket option introspection) can be run from the production API pod, but you don't want to risk side effects on real traffic. A safer pattern is to deploy a short-lived "emulator" pod that uses the same image and broker credentials as the production API but is isolated from the deployment. Run experiments there, delete when done.
Standalone publisher / worker emulator
This pod mirrors a real API pod's broker config (image, env vars, ConfigMaps, Secrets) but does not register as part of the service. Useful for:
- Measuring broker publish latency without polluting Sentry
- A/B testing new
BROKER_TRANSPORT_OPTIONSbefore rolling them out - Verifying that a settings change actually flows through to kombu and the kernel socket (see the "Verifying a configuration change end-to-end" recipe in the broker section)
- Running the test-driver under
py-spyorstrace(the emulator pod can be deployed withCAP_SYS_PTRACE)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Exec in and run experiments. The emulator does not consume requests or scheduled tasks (no celery worker is started), so it has no visible side effect on the running deployment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
Adding a worker arm to the emulator
If you want to also test the consume side (e.g., verify the new heartbeat survives idle periods on a long-lived consumer connection), extend the emulator with a second container that runs a celery worker against a sandboxed queue. The worker consumes only from a temporary queue you created for the test, so it never picks up real tasks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Then tasks.evaluate_policies_async.apply_async(args=[...],
queue='debug-temp') to route a test task to the emulator's worker
specifically. The real workers will not see it.
Cross-references for gunicorn diagnostics
- Troubleshooting broker / Celery worker timeouts — the most common cause of gunicorn worker timeouts in waldur
- API server problems — higher-level HTTP-side debugging
Debug mode activation
Production caveat
Debug mode should ONLY be used in development environments.
For development deployments, you can enable debug mode:
1 2 3 | |
For production troubleshooting, use targeted logging instead of enabling full debug mode.