I have a job that was retried multiple times, but it got stuck in the queue. <p di

Here it is: <div class="snippet-clipboard-content notranslate position-relative ov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Job stuck in a queue,about koudelka/honeydew

Comments (37)

koudelka commented on August 12, 2024 1

Just for reference, since you mentioned in a previous issue that you were new to elixir: http://erlang.org/doc/apps/erts/time_correction.html

from honeydew.

koudelka commented on August 12, 2024

Hey Martin,

Can I see your Honeydew.start_queue/2?

from honeydew.

martinos commented on August 12, 2024

Here it is:

      Honeydew.start_queue(login,
        queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
        failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
      )

from honeydew.

martinos commented on August 12, 2024

I've just changed the times param to 15 and restarted the server. It was 10 prior to that.

from honeydew.

koudelka commented on August 12, 2024

Huh, that's quite strange. It's either that the job is delayed by a large amount of time for some reason, or the queue's :finally option didn't abandon the job correctly.

Can you show me the result of:

:mnesia.activity(:async_dirty, fn ->
  :mnesia.foldl(fn wrapped_job_record, list ->
    [wrapped_job_record | list]
  end, [], :"honeydew_\"my_queue\"")
end)

That should show us when the job is meant to run next, it includes the delay information.

from honeydew.

koudelka commented on August 12, 2024

Oh, and also, System.system_time(:millisecond) when you run the above, please.

from honeydew.

martinos commented on August 12, 2024

@koudelka, The job got processed at 17h47. About 3hrs after the last retry. So I can't give you more information since the command that you ask me to run was empty. I am planning to test my app on production Monday, I will probably get the issue. Thus I will be able to give you more information.

from honeydew.

martinos commented on August 12, 2024

I am running the app on my dev computer and I am having the issue. When I run the mnesia.foldl function I get the following

[
  {:wrapped_job, {-576413803, -576460752303419643},
   %Honeydew.Job{
     by: nil,
     completed_at: nil,
     delay_secs: 8191,
     enqueued_at: 1581829711489,
     failure_private: 14,
     from: nil,
     job_monitor: nil,
     private: -576460752303419643,
     queue: "my_queue",
     result: nil,
     started_at: nil,
     task: {"send_msg",
      [
        %Smsc.Mt.Payload{
          delivery_method: nil,
          delivery_url: nil,
          dest: nil,
          login: nil,
          msg: nil,
          msg_id: nil,
          orig: nil,
          smsc_id: nil
        }
      ]}
   }},
  {:wrapped_job, {-576415451, -576460752303419515},
   %Honeydew.Job{
     by: nil,
     completed_at: nil,
     delay_secs: 4095,
     enqueued_at: 1581829707732,
     failure_private: 13,
     from: nil,
     job_monitor: nil,
     private: -576460752303419515,
     queue: "my_queue",
     result: nil,
     started_at: nil,
     task: {"send_msg",
      [
        %Smsc.Mt.Payload{
          delivery_method: nil,
          delivery_url: nil,
          dest: nil,
          login: nil,
          msg: nil,
          msg_id: nil,
          orig: nil,
          smsc_id: nil
        }
      ]}
   }},
  {:wrapped_job, {-576418152, -576460752303420214},
   %Honeydew.Job{
     by: nil,
     completed_at: nil,
     delay_secs: 4095,
     enqueued_at: 1581829872436,
     failure_private: 13,
     from: nil,
     job_monitor: nil,
     private: -576460752303420214,
     queue: "my_queue",
     result: nil,
     started_at: nil,
     task: {"send_msg",
      [
        %Smsc.Mt.Payload{
          delivery_method: nil,
          delivery_url: nil,
          dest: nil,
          login: nil,
          msg: nil,
          msg_id: nil,
          orig: nil,
          smsc_id: nil
        }
      ]}
   }},
  {:wrapped_job, {-576418152, -576460752303421375},
   %Honeydew.Job{
     by: nil,
     completed_at: nil,
     delay_secs: 4095,
     enqueued_at: 1581829868556,
     failure_private: 13,
     from: nil,
     job_monitor: nil,
     private: -576460752303421375,
     queue: "my_queue",
     result: nil,
     started_at: nil,
     task: {"send_msg",
      [
        %Smsc.Mt.Payload{
          delivery_method: nil,
          delivery_url: nil,
          dest: nil,
          login: nil,
          msg: nil,
          msg_id: nil,
          orig: nil,
          smsc_id: nil
        }
      ]}
   }},
  {:wrapped_job, {-576419177, -576460752303419579},
   %Honeydew.Job{
     by: nil,
     completed_at: nil,
     delay_secs: 2047,
     enqueued_at: 1581829714651,
     failure_private: 12,
     from: nil,
     job_monitor: nil,
     private: -576460752303419579,
     queue: "my_queue",
     result: nil,
     started_at: nil,
     task: {"send_msg",
      [
        %Smsc.Mt.Payload{
          delivery_method: nil,
          delivery_url: nil,
          dest: nil,
          login: nil,
          msg: nil,
          msg_id: nil,
          orig: nil,
          smsc_id: nil
        }
      ]}
   }}
]

from honeydew.

martinos commented on August 12, 2024

I think that if the Job cannot be processed at the time specified it will never be processed.

I have configured a delayed job that would occur in 1 min. Right after I've shutdown the application for 2 mins. the job never got picked up, so it stayed in the queue.

Maybe my problem is related to this. My theory is that a job that failed was postponed to a later time. At that later time, the workers were busy thus the failed job skipped the time it was supposed to be processed, thus it never got processed.

from honeydew.

martinos commented on August 12, 2024

@koudelka, I still have few kinds of stuck job issues. I am just wondering, in what order jobs are processed. Does the system processes them in the enqueued_at + delay_in_secs order ?

How can I list the retried jobs ?

from honeydew.

martinos commented on August 12, 2024

I have 18 jobs that are stuck in the queue. I don't understand why because the queue is still processing jobs even if I have only one worker configured. So it seems that it's not related to the worker.

from honeydew.

martinos commented on August 12, 2024

I am pretty sure that the issue was caused by a shutdown.

from honeydew.

koudelka commented on August 12, 2024

Are you able to create a stand-alone reproduction of the issue?

from honeydew.

martinos commented on August 12, 2024

I can't understand how the delay works. If I run the following code in the iex console:

defmodule Worker do
  @behaviour Honeydew.Worker

  def hello(thing) do
    IO.puts("Processing #{thing}")
    :timer.sleep(5000)
    IO.puts("Completed #{thing |> inspect}!")
  end
end

defmodule App do
  def start do
    nodes = [node()]

    :ok =
      Honeydew.start_queue(:my_queue,
        queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
        failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
      )

    :ok = Honeydew.start_workers(:my_queue, Worker, num: 1)
  end
end

App.start()
{:hello, [1]} |> Honeydew.async(:my_queue, delay_secs: 30)
{:hello, [2]} |> Honeydew.async(:my_queue, delay_secs: 40)
{:hello, [3]} |> Honeydew.async(:my_queue, delay_secs: 50)

And I shut down the the app right after the last command then I restart it 2 mins after, it takes a while before the jobs gets processed.

from honeydew.

koudelka commented on August 12, 2024

When you say that you "shut down the app", do you mean that you terminated the entire BEAM, or you shut down the honeydew application?

from honeydew.

martinos commented on August 12, 2024

The entire beam, ctrl-c twice in the iex console.

from honeydew.

koudelka commented on August 12, 2024

Honeydew uses the BEAM's monotonic clock (an arbitrary time since the OS process was started), it doesn't use the wall clock time, since that's prone to skews. The monotonic clock keeps ticking when the BEAM timewarps, and hence should keeps delays working properly while the system is running, rather than optimizing for the case that the BEAM is restarting often.

I believe you're seeing this behaviour because the monotonic clock value + the delay value is persisted in mnesia and then read again when the system restarts (even though that clock value is meaningless).

I'm going to dig in a bit more and get back you.

from honeydew.

koudelka commented on August 12, 2024

I've just pushed a fix to the reset-mnesia-run-at branch, want to give that a spin and let me know how it goes?

from honeydew.

martinos commented on August 12, 2024

Thanks @koudelka, now the job execution are repeatable however it seems that the jobs delay are now counted from the time the application starts.

I am not sure that it's the best solution since in my case I can have some retry that have 32738 secs delay. In those cases, if I shutdown the application 1 second before the end of that delay and I restart the app, the job will only gets executed only after another 32738 seconds.

Does this makes sense to you ?

from honeydew.

koudelka commented on August 12, 2024

Yep, that makes sense, thanks for pointing that out.

I've just pushed a commit to reset their delay times. Give 'er a spin. :)

from honeydew.

martinos commented on August 12, 2024

If I run the following in the console:

enqueue = fn data, delay ->
  opts =
    case delay do
      0 -> []
      a -> [delay_secs: a]
    end

  {:hello, ["#{data |> inspect} enqueued at #{NaiveDateTime.local_now()}, delay: #{delay}"]}
  |> Honeydew.async(:my_queue, opts)
end

1 |> enqueue.(30)
2 |> enqueue.(40)
3 |> enqueue.(50)

And I restart iex. I get the following

Application started at ~N[2020-02-20 16:21:12], queue count 3 before starting the workers

Worker started
Processing "3 enqueued at 2020-02-20 16:20:57, delay: 50", processing at: ~N[2020-02-20 16:21:12]
Completed "3 enqueued at 2020-02-20 16:20:57, delay: 50"!
Processing "1 enqueued at 2020-02-20 16:20:57, delay: 30", processing at: ~N[2020-02-20 16:21:17]
Completed "1 enqueued at 2020-02-20 16:20:57, delay: 30"!
Processing "2 enqueued at 2020-02-20 16:20:57, delay: 40", processing at: ~N[2020-02-20 16:21:22]
Completed "2 enqueued at 2020-02-20 16:20:57, delay: 40"!

If you look at Job 3, it was started at 16:21:17. I think it should to be run at 16:21:47.

Note that have only one worker.

from honeydew.

koudelka commented on August 12, 2024

That looks about right to me. Since Honeydew isn't using wall clock time, and only the monotonic clock, it has no concept of time between BEAM instances.

That means that Honeydew doesn't know how much time has passed between when the BEAM was shut down, and when it was started again, so it can't know if the delay interval has already lapsed, so I just assumes that all jobs need to be run.

Does that make sense?

from honeydew.

martinos commented on August 12, 2024

I wondering if can we use the enqueued_at and delay_secs field to recalculate the remaining time. If the time has lapsed, we can put 0 if not we can recalculate the delta. There's probably something that I don't understand. I need to read more about the monotonic clock.

from honeydew.

koudelka commented on August 12, 2024

Yeah check out the docs I linked above, especially the bit about monotonic clock freezes and absolute time warps.

The choice here is relative time (since the BEAM started) vs absolute wall clock time.

My thinking is that the queue is spending the majority of its time operational, rather than the majority of the time switched off, so that's the case I'd prefer to optimize for.

With absolute wall clock time the worst case scenario is that a machine's clock leaps far into the future, the user enqueues a job, and then the clock corrects back. The job wouldn't run until it reaches that point in time, which would lead to a large service degradation.
With the relative monotonic clock, the worst case scenario is that a delayed job runs too soon after a queue restart.

From my perspective, in the pathological worst case, I'd much rather run a job too soon than not at all. The former is sub-optimal (job ran too early), the latter is a constraint violation (job will never run).

from honeydew.

martinos commented on August 12, 2024

If you do something like:

  def reset_run_at(
        %__MODULE__{job: %Job{delay_secs: delay_secs, enqueued_at: enqueued_at}} = wrapped_job
      ) do
    elapsed = System.system_time(:millisecond) - (enqueued_at + delay_sec * 1_000)
    new_delay = if elapsed >= 0, do: elapsed, else: 0
    run_at = now() + delay_secs

    %__MODULE__{wrapped_job | run_at: run_at}
  end

It would still use the relative monotonic clock for operational purpose but use the OS clock just for boot time when recalculating the run_at field. I think that this is reasonable since this calculation occurs only at boot time. I understand that the OS clock can be adjusted by NTP or by other time drift when the application is down, but not very likely to cause an issue (I think). It will only impact delayed jobs that survives a reboot.

from honeydew.

koudelka commented on August 12, 2024

I'd much prefer to just run delayed jobs immediately when the application starts, so honeydew can guarantee that a job will run regardless of bad behaviour by the system clock. If the user would prefer to rely on the system clock, they can easily write that into their jobs.

from honeydew.

martinos commented on August 12, 2024

It's not exactly what I wished because sometimes I have hundreds of retries. When the app boots , the jobs will all be retried at the same time. If they fail a second time (this happens in my case), some of them might still run at the same time thus interrupting the normal flow.

Personally I think that clock don't get that much out of sync and issues with the clock are due to time adjust with NTP, which if it occurs during a shutdown won't affect much the delivery of jobs after a boot. The big problem with the wall clock is mostly when the app is running.

Also, It's confusing for newcomers to see that jobs are not executed at enqueued_at + delay_secs.

from honeydew.

koudelka commented on August 12, 2024

I definitely hear you, I don't think that either end of the spectrum is ideal for all use cases. I do suspect that massively damaging NTP skews should be rare, but I would also like to provide a guarantee rather than a probability.

That being said, what do you think about allowing the user to choose if they want to use the monotonic clock or the wall clock for their queue? That way we can leave it up to the user to decide if the risk tradeoff makes sense for their use case.

from honeydew.

koudelka commented on August 12, 2024

I do like your hybrid approach, too. Using the monotonic clock during normal operations, but trusting the system clock between beam restarts. That'd keep the complexity for the user to a minimum too.

from honeydew.

martinos commented on August 12, 2024

I started using Honeydew because it was a queue that had persistency. I wanted an easy way to restart the application at any time for continuous deployment. But, every time that I deployed, all failed jobs were retried at the same time on boot. This was making the queue unusable in my case since hundreds of jobs were restarted at the same time this causing issues with my clients.

It was also surprising to me since I was not expecting that behavior at all.

The other point that I would like to make is that most Elixir systems are running 24/7, which is a good point for using a monotonic clock. If we get a time change (from standard time to daylight time) the scheduled job will be run at the proper time.

The time between restart is mostly short thus making the likelihood of time shift of wall clock less likely to occur. Also if this occurs during time change (from daylight time to normal time) the person will probably make the link between the events. Even then, the scheduled event will be executed anyways maybe an hour too soon or too late but it will be run.

I have made a fork of the project using the wall clock for recalculating the run_at field on restart and everything plays well for now and I have redeployed at least 20 times since then without any issue. This solution works well for me and I think it would be a good solution because people won't need to know about monotonic and wallclock (which was my case before using this project).

Thus I think it would be better to use the monotonic clock for job scheduling but using the wall clock on restart.

from honeydew.

heinsaris commented on August 12, 2024

Hi,

What is the status of this issue? I run version 1.5.0 and I still have this problem. I have a Mnesia Queue and when I shutdown the application (^c^c in iex) with entries in the queue, they remain in the queue when I restart the application. New entries are processed immediately, but the ones that came from disk are not until after some random time. At a sudden moment (after more than 30 minutes) they get processed, but I have no clue what triggers them to get processed finally.

Is there perhaps a way to instruct the queue to force the enqueued entries?

UPDATE: after applying the changes from reset-mnesia-run-at the jobs from disk get processed after the queue is resumed. Maybe someone can make a new release so this can be used by everyone?

from honeydew.

martinos commented on August 12, 2024

@heinsaris. I have created a pull request that fixes the issues that I was having. Note that if you want to use this branch [email protected]:ISPT/honeydew.git, you'll need to drain your queues before shutdown since I have modified the mnesia job schema. The approach of branch (reset-mnesia-run-at) does not estimate correctly the time at which the original Job was scheduled at.

from honeydew.

heinsaris commented on August 12, 2024

@martinos Thanks for your reply. I have forked the master branch and applied the changes on the reset-amnesia-run-at branch, which seems to work for my case. I don’t need delays or scheduled jobs. I use the queue to send packets to a mqtt broker. As soon as the connection to the broker disappears, I suspend the queue, and I “unsuspend" it when the connection is back. I need all packets to arrive on the broker in order, so I only use one worker.

You can find my fork here: github.com/heinsaris/honeydew

from honeydew.

martinos commented on August 12, 2024

@heinsaris, I think my patch is irrelevant to your case, but in the case that we have exponential retries, the reset-mnesia-run-at was not working as expected.

When a retry is made, a delay is added to the job. When the application restarts (with that branch), that delay is added to the current time, thus moving the execution to a later time. If you have multiple jobs with the same delay (that failed the same number of times), they will be run simultaneously, which in my case, made the workers run tons of failed jobs simultaneously. Which, in return, were basting my customer's APIs.

My branch is recalculating the original retry time on restart. That change spreads the execution of the job in a more natural way.

from honeydew.

heinsaris commented on August 12, 2024

@martinos Thanks for the clarification. I will have a look at your solution but as you said it is not relevant in my case. I suspend the queue as soon as the connection to the mqtt broker drops, and normally my app never shuts down. It is only in the rare case that something goes wrong and the app stops while the mqtt connection is down that I need the packets saved on disk. But in normal circumstances the queue is always empty.

from honeydew.

cigzigwon commented on August 12, 2024

I have an issue with a Flow/stream and a hanging job. Like the symptoms above... I have no returns with :mnesia.activity/2 calls. I can see the worker with the job in a pool of 16 workers but cannot actually even use filter to surface any job like @martinos had demonstrated in the first place. It's simply showing up in the status output as the only job :in_progress. I even ran the job isolated and still has the issue. My hope is that setting the +C multi_time_warp in our vm.args file for Distillery will solve the issue. I will report my findings today.

from honeydew.

cigzigwon commented on August 12, 2024

UPDATE: What I have found is that during a refactor I missed moving a File.open/2 call to the Flow.reduce/3 start_fun call. So in my case this just means I didn't realize how many "empty file events" were occurring with this new "reporting date cutoff". So, post-refactor I can see that inotifywait (now that I open/close file on each chunk I stream into it after it gets piped to our file reduction Flow) is still processing these events which is seemingly creating an I/O block somewhere in our file watch... I think it's just a bi-product of using Honeydew.yield/2 in our file watch GenServer implementation. Hmmmm.... I really appreciate you @koudelka for spending time on this issue as it has created transparency to my own.

from honeydew.

Job stuck in a queue about honeydew HOT 37 OPEN

Comments (37)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent