Comments (37)
Just for reference, since you mentioned in a previous issue that you were new to elixir: http://erlang.org/doc/apps/erts/time_correction.html
:)
from honeydew.
Hey Martin,
Can I see your Honeydew.start_queue/2
?
from honeydew.
Here it is:
Honeydew.start_queue(login,
queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
)
from honeydew.
I've just changed the times
param to 15 and restarted the server. It was 10 prior to that.
from honeydew.
Huh, that's quite strange. It's either that the job is delayed by a large amount of time for some reason, or the queue's :finally
option didn't abandon the job correctly.
Can you show me the result of:
:mnesia.activity(:async_dirty, fn ->
:mnesia.foldl(fn wrapped_job_record, list ->
[wrapped_job_record | list]
end, [], :"honeydew_\"my_queue\"")
end)
That should show us when the job is meant to run next, it includes the delay information.
from honeydew.
Oh, and also, System.system_time(:millisecond)
when you run the above, please.
from honeydew.
@koudelka, The job got processed at 17h47. About 3hrs after the last retry. So I can't give you more information since the command that you ask me to run was empty. I am planning to test my app on production Monday, I will probably get the issue. Thus I will be able to give you more information.
from honeydew.
I am running the app on my dev computer and I am having the issue. When I run the mnesia.foldl function I get the following
[
{:wrapped_job, {-576413803, -576460752303419643},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 8191,
enqueued_at: 1581829711489,
failure_private: 14,
from: nil,
job_monitor: nil,
private: -576460752303419643,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576415451, -576460752303419515},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829707732,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303419515,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576418152, -576460752303420214},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829872436,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303420214,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576418152, -576460752303421375},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829868556,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303421375,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576419177, -576460752303419579},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 2047,
enqueued_at: 1581829714651,
failure_private: 12,
from: nil,
job_monitor: nil,
private: -576460752303419579,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}}
]
from honeydew.
I think that if the Job cannot be processed at the time specified it will never be processed.
I have configured a delayed job that would occur in 1 min. Right after I've shutdown the application for 2 mins. the job never got picked up, so it stayed in the queue.
Maybe my problem is related to this. My theory is that a job that failed was postponed to a later time. At that later time, the workers were busy thus the failed job skipped the time it was supposed to be processed, thus it never got processed.
from honeydew.
@koudelka, I still have few kinds of stuck job issues. I am just wondering, in what order jobs are processed. Does the system processes them in the enqueued_at + delay_in_secs
order ?
How can I list the retried jobs ?
from honeydew.
I have 18 jobs that are stuck in the queue. I don't understand why because the queue is still processing jobs even if I have only one worker configured. So it seems that it's not related to the worker.
from honeydew.
I am pretty sure that the issue was caused by a shutdown.
from honeydew.
Are you able to create a stand-alone reproduction of the issue?
from honeydew.
I can't understand how the delay works. If I run the following code in the iex console:
defmodule Worker do
@behaviour Honeydew.Worker
def hello(thing) do
IO.puts("Processing #{thing}")
:timer.sleep(5000)
IO.puts("Completed #{thing |> inspect}!")
end
end
defmodule App do
def start do
nodes = [node()]
:ok =
Honeydew.start_queue(:my_queue,
queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
)
:ok = Honeydew.start_workers(:my_queue, Worker, num: 1)
end
end
App.start()
{:hello, [1]} |> Honeydew.async(:my_queue, delay_secs: 30)
{:hello, [2]} |> Honeydew.async(:my_queue, delay_secs: 40)
{:hello, [3]} |> Honeydew.async(:my_queue, delay_secs: 50)
And I shut down the the app right after the last command then I restart it 2 mins after, it takes a while before the jobs gets processed.
from honeydew.
When you say that you "shut down the app", do you mean that you terminated the entire BEAM, or you shut down the honeydew application?
from honeydew.
The entire beam, ctrl-c twice in the iex console.
from honeydew.
Honeydew uses the BEAM's monotonic clock (an arbitrary time since the OS process was started), it doesn't use the wall clock time, since that's prone to skews. The monotonic clock keeps ticking when the BEAM timewarps, and hence should keeps delays working properly while the system is running, rather than optimizing for the case that the BEAM is restarting often.
I believe you're seeing this behaviour because the monotonic clock value + the delay value is persisted in mnesia and then read again when the system restarts (even though that clock value is meaningless).
I'm going to dig in a bit more and get back you.
from honeydew.
I've just pushed a fix to the reset-mnesia-run-at
branch, want to give that a spin and let me know how it goes?
from honeydew.
Thanks @koudelka, now the job execution are repeatable however it seems that the jobs delay are now counted from the time the application starts.
I am not sure that it's the best solution since in my case I can have some retry that have 32738 secs delay. In those cases, if I shutdown the application 1 second before the end of that delay and I restart the app, the job will only gets executed only after another 32738 seconds.
Does this makes sense to you ?
from honeydew.
Yep, that makes sense, thanks for pointing that out.
I've just pushed a commit to reset their delay times. Give 'er a spin. :)
from honeydew.
If I run the following in the console:
enqueue = fn data, delay ->
opts =
case delay do
0 -> []
a -> [delay_secs: a]
end
{:hello, ["#{data |> inspect} enqueued at #{NaiveDateTime.local_now()}, delay: #{delay}"]}
|> Honeydew.async(:my_queue, opts)
end
1 |> enqueue.(30)
2 |> enqueue.(40)
3 |> enqueue.(50)
And I restart iex. I get the following
Application started at ~N[2020-02-20 16:21:12], queue count 3 before starting the workers
Worker started
Processing "3 enqueued at 2020-02-20 16:20:57, delay: 50", processing at: ~N[2020-02-20 16:21:12]
Completed "3 enqueued at 2020-02-20 16:20:57, delay: 50"!
Processing "1 enqueued at 2020-02-20 16:20:57, delay: 30", processing at: ~N[2020-02-20 16:21:17]
Completed "1 enqueued at 2020-02-20 16:20:57, delay: 30"!
Processing "2 enqueued at 2020-02-20 16:20:57, delay: 40", processing at: ~N[2020-02-20 16:21:22]
Completed "2 enqueued at 2020-02-20 16:20:57, delay: 40"!
If you look at Job 3
, it was started at 16:21:17
. I think it should to be run at 16:21:47
.
Note that have only one worker.
from honeydew.
That looks about right to me. Since Honeydew isn't using wall clock time, and only the monotonic clock, it has no concept of time between BEAM instances.
That means that Honeydew doesn't know how much time has passed between when the BEAM was shut down, and when it was started again, so it can't know if the delay interval has already lapsed, so I just assumes that all jobs need to be run.
Does that make sense?
from honeydew.
I wondering if can we use the enqueued_at
and delay_secs
field to recalculate the remaining time. If the time has lapsed, we can put 0 if not we can recalculate the delta. There's probably something that I don't understand. I need to read more about the monotonic clock.
from honeydew.
Yeah check out the docs I linked above, especially the bit about monotonic clock freezes and absolute time warps.
The choice here is relative time (since the BEAM started) vs absolute wall clock time.
My thinking is that the queue is spending the majority of its time operational, rather than the majority of the time switched off, so that's the case I'd prefer to optimize for.
-
With absolute wall clock time the worst case scenario is that a machine's clock leaps far into the future, the user enqueues a job, and then the clock corrects back. The job wouldn't run until it reaches that point in time, which would lead to a large service degradation.
-
With the relative monotonic clock, the worst case scenario is that a delayed job runs too soon after a queue restart.
From my perspective, in the pathological worst case, I'd much rather run a job too soon than not at all. The former is sub-optimal (job ran too early), the latter is a constraint violation (job will never run).
from honeydew.
If you do something like:
def reset_run_at(
%__MODULE__{job: %Job{delay_secs: delay_secs, enqueued_at: enqueued_at}} = wrapped_job
) do
elapsed = System.system_time(:millisecond) - (enqueued_at + delay_sec * 1_000)
new_delay = if elapsed >= 0, do: elapsed, else: 0
run_at = now() + delay_secs
%__MODULE__{wrapped_job | run_at: run_at}
end
It would still use the relative monotonic clock for operational purpose but use the OS clock just for boot time when recalculating the run_at
field. I think that this is reasonable since this calculation occurs only at boot time. I understand that the OS clock can be adjusted by NTP or by other time drift when the application is down, but not very likely to cause an issue (I think). It will only impact delayed jobs that survives a reboot.
from honeydew.
I'd much prefer to just run delayed jobs immediately when the application starts, so honeydew can guarantee that a job will run regardless of bad behaviour by the system clock. If the user would prefer to rely on the system clock, they can easily write that into their jobs.
from honeydew.
It's not exactly what I wished because sometimes I have hundreds of retries. When the app boots , the jobs will all be retried at the same time. If they fail a second time (this happens in my case), some of them might still run at the same time thus interrupting the normal flow.
Personally I think that clock don't get that much out of sync and issues with the clock are due to time adjust with NTP, which if it occurs during a shutdown won't affect much the delivery of jobs after a boot. The big problem with the wall clock is mostly when the app is running.
Also, It's confusing for newcomers to see that jobs are not executed at enqueued_at + delay_secs
.
from honeydew.
I definitely hear you, I don't think that either end of the spectrum is ideal for all use cases. I do suspect that massively damaging NTP skews should be rare, but I would also like to provide a guarantee rather than a probability.
That being said, what do you think about allowing the user to choose if they want to use the monotonic clock or the wall clock for their queue? That way we can leave it up to the user to decide if the risk tradeoff makes sense for their use case.
from honeydew.
I do like your hybrid approach, too. Using the monotonic clock during normal operations, but trusting the system clock between beam restarts. That'd keep the complexity for the user to a minimum too.
from honeydew.
I started using Honeydew because it was a queue that had persistency. I wanted an easy way to restart the application at any time for continuous deployment. But, every time that I deployed, all failed jobs were retried at the same time on boot. This was making the queue unusable in my case since hundreds of jobs were restarted at the same time this causing issues with my clients.
It was also surprising to me since I was not expecting that behavior at all.
The other point that I would like to make is that most Elixir systems are running 24/7, which is a good point for using a monotonic clock. If we get a time change (from standard time to daylight time) the scheduled job will be run at the proper time.
The time between restart is mostly short thus making the likelihood of time shift of wall clock less likely to occur. Also if this occurs during time change (from daylight time to normal time) the person will probably make the link between the events. Even then, the scheduled event will be executed anyways maybe an hour too soon or too late but it will be run.
I have made a fork of the project using the wall clock for recalculating the run_at field on restart and everything plays well for now and I have redeployed at least 20 times since then without any issue. This solution works well for me and I think it would be a good solution because people won't need to know about monotonic and wallclock (which was my case before using this project).
Thus I think it would be better to use the monotonic clock for job scheduling but using the wall clock on restart.
from honeydew.
Hi,
What is the status of this issue? I run version 1.5.0 and I still have this problem. I have a Mnesia Queue and when I shutdown the application (^c^c in iex) with entries in the queue, they remain in the queue when I restart the application. New entries are processed immediately, but the ones that came from disk are not until after some random time. At a sudden moment (after more than 30 minutes) they get processed, but I have no clue what triggers them to get processed finally.
Is there perhaps a way to instruct the queue to force the enqueued entries?
UPDATE: after applying the changes from reset-mnesia-run-at the jobs from disk get processed after the queue is resumed. Maybe someone can make a new release so this can be used by everyone?
from honeydew.
@heinsaris. I have created a pull request that fixes the issues that I was having. Note that if you want to use this branch [email protected]:ISPT/honeydew.git, you'll need to drain your queues before shutdown since I have modified the mnesia job schema. The approach of branch (reset-mnesia-run-at) does not estimate correctly the time at which the original Job was scheduled at.
from honeydew.
@martinos Thanks for your reply. I have forked the master branch and applied the changes on the reset-amnesia-run-at branch, which seems to work for my case. I don’t need delays or scheduled jobs. I use the queue to send packets to a mqtt broker. As soon as the connection to the broker disappears, I suspend the queue, and I “unsuspend" it when the connection is back. I need all packets to arrive on the broker in order, so I only use one worker.
You can find my fork here: github.com/heinsaris/honeydew
from honeydew.
@heinsaris, I think my patch is irrelevant to your case, but in the case that we have exponential retries, the reset-mnesia-run-at was not working as expected.
When a retry is made, a delay is added to the job. When the application restarts (with that branch), that delay is added to the current time, thus moving the execution to a later time. If you have multiple jobs with the same delay (that failed the same number of times), they will be run simultaneously, which in my case, made the workers run tons of failed jobs simultaneously. Which, in return, were basting my customer's APIs.
My branch is recalculating the original retry time on restart. That change spreads the execution of the job in a more natural way.
from honeydew.
@martinos Thanks for the clarification. I will have a look at your solution but as you said it is not relevant in my case. I suspend the queue as soon as the connection to the mqtt broker drops, and normally my app never shuts down. It is only in the rare case that something goes wrong and the app stops while the mqtt connection is down that I need the packets saved on disk. But in normal circumstances the queue is always empty.
from honeydew.
I have an issue with a Flow/stream and a hanging job. Like the symptoms above... I have no returns with :mnesia.activity/2 calls. I can see the worker with the job in a pool of 16 workers but cannot actually even use filter to surface any job like @martinos had demonstrated in the first place. It's simply showing up in the status output as the only job :in_progress. I even ran the job isolated and still has the issue. My hope is that setting the +C multi_time_warp
in our vm.args file for Distillery will solve the issue. I will report my findings today.
from honeydew.
UPDATE: What I have found is that during a refactor I missed moving a File.open/2
call to the Flow.reduce/3
start_fun call. So in my case this just means I didn't realize how many "empty file events" were occurring with this new "reporting date cutoff". So, post-refactor I can see that inotifywait
(now that I open/close file on each chunk I stream into it after it gets piped to our file reduction Flow) is still processing these events which is seemingly creating an I/O block somewhere in our file watch... I think it's just a bi-product of using Honeydew.yield/2
in our file watch GenServer implementation. Hmmmm.... I really appreciate you @koudelka for spending time on this issue as it has created transparency to my own.
from honeydew.
Related Issues (20)
- Clear an entire queue HOT 1
- Disabling clock freeze warning in test environment HOT 1
- Control retry time within job HOT 2
- Support Ecto.Adapters.MyXQL for MySQL HOT 6
- Mnesia as default queue not mentioned in start_queue/2 documentation HOT 1
- Custom queue makes Honeydew crashs by timeout HOT 4
- warning: function equal?/2 required by behaviour Ecto.Type
- How to rerun jobs if the system crashes? HOT 1
- pg2 scheduled for removal HOT 10
- How to cancel Jobs in Ecto Queue? HOT 1
- Doc links to code broken HOT 2
- Creates a behaviour for `Honeydew.Logger` module
- Unexpected behavior: bad_return application start failure using Mnesia outside of Honeydew HOT 5
- no match of right hand side value: [] HOT 2
- API for accessing in_progress jobs
- Redesigning API for concurrency
- Timeout error on application start HOT 1
- ErlangQueue doesn't respect delay_secs option HOT 1
- Shutdown hangs HOT 11
- Error when listing mnesia in mix extra_applications.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from honeydew.