Comments (8)
Hi @stesen-wefunder! Thanks for reaching out!
These are some great questions -- exactly the kinds that I'd hoped this library's documentation could prompt! Here's my attempt to provide a few answers:
Are all these built-in jobs already idempotent? Is there something that needs to be done to make them idempotent?
ActiveStorage jobs should be idempotent already, insofar as AnalyzeJobs and variant processing should produce deterministic (or at least functionally equivalent) results every time. ActionMailer jobs, on the other hand, don't seem to go out of their way to stamp any kind of message_id eagerly into the serialized job data, which likely means that they are not idempotent (unless an underlying delivery API gem does something particularly fancy involving patching the deliver_later
method, but I'm not aware of any gem that does this).
Or can
delayed
not be used with these jobs?
I wouldn't say that by any stretch. The real-world impact of an errant double-send is likely very low in practice, and particularly for emails (but for many jobs in general), at-least-once delivery is still far preferable to at-most-once delivery, even without any idempotency guarantees. So I'd argue that you still should generally use delayed
with these kinds of jobs.
In the rare case where idempotency is not possible and you'd rather intervene than chance a double-send, you could set max_attempts
to 1
. (Erroring where attempts == max_attempts
should leave a failed job row in the table until a human intervenes, but see below for caveats.)
Under what circumstances do successful jobs run twice? And what's the time horizon on this? Can a job only be picked up a second time before it's completed? Or is there a window of time after some worker has fully completed the job that it can be picked up again? (This question is motivated by the idea of using Redis as an external locking mechanism to aid with idempotency with e.g., jobs that make external service calls. If a job releases the lock in Redis in
after_perform
would there still be a race condition?)
So, to unpack this, there are a couple things worth knowing.
Firstly, the worker's pickup query does safely lock jobs, so you shouldn't need to use any kind of external locking mechanism to prevent double execution. (Plus, I wouldn't think that using an external datastore would provide meaningful sequencing guarantees beyond what delayed
is capable of providing with its own DB connection, which benefits from all of its desirable ACID properties.)
Secondly, jobs have a maximum runtime, after which they will be forcibly stopped. Therefore, jobs that have been locked for longer than the maximum runtime (plus a hardcoded 30 second buffer) can be assumed to either be in a terminal "failed" state (attempts == max_attempts
) or would have timed-out and are available again for pickup. This results in the very small possibility that a job could succeed but then fail to dequeue properly, resulting in another worker eventually picking it up again. (For example, it's possible for a worker to completely crash in between job completion and dequeue.)1
Because of this dequeue issue (i.e. the job retry/cleanup process), there is an extremely small chance that a job may be attempted twice, even if its max_attempts
is 1
, though in practice I've never seen this edge case occur. Nonetheless, tightening up the dequeue guarantees is something I'd like to dig deeper into for a future iteration. In the meantime, I'd suggest keeping an eye on job timeouts (bubbled out as Delayed::WorkerTimeout
exceptions) and on worker crashes in general (which we monitor via our deployment infrastructure) as these are the likeliest sources of undesired double-sends for non-erroring jobs.
Footnotes
-
As far as I'm aware, the only backend that makes an attempt to solve for this is
que
, with itsdestroy
API that you must call from within the job'srun
method (allowing the dequeue to be co-transactional with the job's business logic). However, this API is not available if you useque
as an ActiveJob backend, and I'm not aware of an ActiveJob-compatible workaround. ↩
from delayed.
Thanks for the detailed response! :D
So it sounds like a job completing twice is actually an exceptionally rare case? More of a "be aware there's a chance this could happen" than an "expect and prepare for this to happen regularly"?
I had kind of misinterpreted things as a need for idempotency being the trade-off made to get better performance out of delayed_job
. Like some race condition had been intentionally accepted.
So in practice, would you say most users could drop delayed
in as a replacement for delayed_job
and not have too many worries about this? I was preparing for a big effort to go through the whole codebase and try to add some sort of idempotency guard over hundreds of different jobs before we could make the switch (thus trying to look for easy guards like using Redis or perhaps Postgres advisory locks).
P.S., How did you make a footnote in a GitHub comment?? 🤯
from delayed.
More of a "be aware there's a chance this could happen" than an "expect and prepare for this to happen regularly"?
Yes, that's a good way of putting it. Though I'd argue that similar idempotency/retryability strategies will apply if your job raises an exception somewhere in the middle (especially if there are side effects external to the local DB that cannot be rolled-back), which may be a more regular occurrence.
So in practice, would you say most users could drop
delayed
in as a replacement fordelayed_job
and not have too many worries about this?
Yes, that was very much the intent! While the two libraries have drifted a bit in terms of functionality and querying strategy, the default resiliency of delayed
is in many ways just a more opinionated set of default configs (e.g. not deleting failed jobs by default).
P.S., How did you make a footnote in a GitHub comment?? 🤯
Instructions here. It's one of my favorite features that they've added in the last couple years! 😄
from delayed.
Wonderful! Thanks for all the info! I'm moving forward with a renewed sense of ease, hahaha.
(Sorry for going quiet for a bit there, haha. I do have one other unrelated question, but I'll open another thread for that.)
I'll let you decide whether to leave this issue open for others to come across, or as a reminder to add to the docs, or you can just close it. Whatever works for you. :)
from delayed.
This was a very helpful conversation. I'm also considering switching from delayed_job to delayed, but concerned about the need for idempotency with delayed.
A few follow-up questions to dig into this deeper @stesen-wefunder ... Does delayed_job also have an At-Least-Once Delivery guarantee? More importantly, does delayed_job also have the behaviour that under certain conditions a job may run more than once, and if so are these conditions the same ones that delayed has? I'm mostly concerned about the possibility of introducing new edge-case behaviour wherein a job can finish successfully more than once. Your post indicates that this behaviour can occur with the delayed gem. Can this behaviour occur with delayed_job too? Your footnote above seems to have spoken to this question/concern, but I want to ensure I understood correctly that the behaviour can occur with delayed_job too. If this behaviour can occur with delayed_job too, is it just as likely to occur with delayed as it is with delayed_job? In other words, with delayed, is it as common to have jobs finish successfully more than once as with delayed_job?
from delayed.
Jobs being worked more than once is possible in the default configurations of both delayed_job and delayed. Delayed's opinionated worker draining should in fact reduce the incidence of jobs being worked multiple times vs delayed_job, assuming that jobs never take longer than the recommended 20 minute timeout to complete.
from delayed.
Thanks @jmileham ! How about if jobs take longer than 20 minutes? We have jobs that take many hours to run, and it would be a very large amount of work to decompose them into smaller jobs.
from delayed.
You can tune your timeout settings, but delayed's default tunings are designed to:
- support frequent deploys (including our rule to have no more than two versions of the app 'live' at the same time during deployment)
- be able to resume work with minimal waste/latency impact
leading us to recommend decomposing work into short-running jobs and setting a short timeout. If you are ok to leave your old worker to drain for longer than the longest running job, that could work, but you'll be subject to all the usual challenges if the worker dies for any other reason.
from delayed.
Related Issues (20)
- Dashboard UI HOT 4
- Configuring Exception Notification on Failure HOT 4
- cron, separate gem or pull request? HOT 3
- Running against a secondary database HOT 10
- How does delayed compare to good_job? HOT 7
- how many simultaneous workers can we run? HOT 2
- Periodic (Cron) jobs HOT 8
- Open to PRs? HOT 10
- Concurrent job not running HOT 4
- Error while reserving job(s): PG::SyntaxError: ERROR: syntax error at or near "SKIP" HOT 5
- how to monitor worker processes HOT 2
- Difference in error behavior when a job is undefined HOT 4
- Changes in schema usage/assumptions from `delayed_job`? HOT 2
- Consider adding jitter to retry interval
- Handling duplicate jobs HOT 4
- Job failed to load: undefined class/module Delayed::JobWrapper. HOT 2
- Locks not being cleared on SIGKILL HOT 2
- Support raise_signal_exceptions HOT 5
- Possible unthrottled spinloop when jobs fail to deserailize HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delayed.