Giter Club home page Giter Club logo

Comments (16)

glwagner avatar glwagner commented on August 20, 2024 1

Can you put together a minimal working example that illustrates the issue?

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024 1

Ah, I think you are right! Thanks for pasting that.

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024 1

Okay, here's my MWE which includes running the first simulation to generate the checkpoint file:

using Oceananigans
using Printf

""" Set up a simple simulation to test picking up from a checkpoint. """
function test_simulation(stop_time, Δt, δt)
    grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
    model = NonhydrostaticModel(; grid)
    simulation = Simulation(model; Δt, stop_time)

    progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
    simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(δt))

    checkpointer = Checkpointer(model,
                                schedule = TimeInterval(stop_time),
                                prefix = "test",
                                cleanup = false)

    simulation.output_writers[:checkpointer] = checkpointer

    return simulation
end

rm("test_iteration*.jld2", force=true)

Δt = 1    # timestep (s)
T1 = 4    # first simulation stop time (s)
T2 = 2T1  # second simulation stop time (s)
δt = 2    # progress message frequency

# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)

# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
run!(simulation, pickup=checkpoint)

This reproduces the issue because I get

julia> include("test.jl")
[ Info: Initializing simulation...
[ Info: Iter: 0, time: 0 seconds
[ Info:     ... simulation initialization complete (2.697 seconds)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (5.721 seconds).
[ Info: Iter: 2, time: 2 seconds
[ Info: Simulation is stopping after running for 8.786 seconds.
[ Info: Simulation time 4 seconds equals or exceeds stop time 4 seconds.
[ Info: Iter: 4, time: 4 seconds
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (529.973 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (575.447 μs).
[ Info: Iter: 5, time: 5 seconds
[ Info: Iter: 6, time: 6 seconds
[ Info: Iter: 7, time: 7 seconds
[ Info: Simulation is stopping after running for 5.469 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 8, time: 8 seconds

When the second simulation runs, I think we expect to see Iter: 6, and then Iter: 8 after the simulation stops.

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024 1

I think the problem is basically that the schedules are not saved to the checkpoint. Actually, the output themselves are also not saved to checkpoint which is also an issue for time averages.

I believe this wasn't previously an issue but recent changes to TimeInterval in #3616 may have created the problem...

Likely we can find some simple way to fix TimeInterval but let's keep in mind that there are some broader challenges to be solved for checkpointing that basically will require a feature that can save callback and output writer states.

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024 1

Here's an even simpler MWE that illustrates the fundamental issue:

using Oceananigans
using Printf

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)

progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(2))

# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
run!(simulation)

which produces

julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (383.622 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (620.680 μs).
[ Info: Iter: 2, time: 1 second
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Iter: 7, time: 6 seconds
[ Info: Simulation is stopping after running for 25.701 ms.
[ Info: Simulation time 8 seconds equals or exceeds stop time 8 seconds.
[ Info: Iter: 9, time: 8 seconds

Basically here there is a "spurious actuation" at the first iteration (here iteration 2, because we started from iteration 1).

This fixes the issue:

using Oceananigans
using Printf

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))
model = NonhydrostaticModel(; grid)
simulation = Simulation(model; Δt=1, stop_time=6)

progress_message(sim) = @info string("Iter: ", iteration(sim), ", time: ", prettytime(sim))
progress_cb = Callback(progress_message, TimeInterval(2))
simulation.callbacks[:progress] = progress_cb

# Run a simulation that saves data to a checkpoint
model.clock.iteration = 1 # we want to start here for some reason
progress_cb.schedule.actuations = 1
run!(simulation)

producing

julia> include("test2.jl")
[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (595.408 μs)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (876.240 μs).
[ Info: Iter: 3, time: 2 seconds
[ Info: Iter: 5, time: 4 seconds
[ Info: Simulation is stopping after running for 51.452 ms.
[ Info: Simulation time 6 seconds equals or exceeds stop time 6 seconds.
[ Info: Iter: 7, time: 6 seconds

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024 1

Hopefully #3660 works and also doesn't break anything else

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024

I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so. After picking up the checkpoint, the output saving interval looked normal (not saving every iteration). The simple example is demonstrated as follows: here.

Our initial guess is that it might be related to #3056. However, after conducting some tests, such as avoiding setting intervals to transcendental numbers, the output saving interval after picking up the checkpoint is still 1 iteration for a while (which is not the desired behavior).

I noticed that when I use IterationInterval instead of TimeInterval, the problem is resolved.

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024

I attempted to reproduce the issue using the 1D diffusion example in the same environment, but I was unable to do so.

Ok, after changing from IterationInterval to TimeInterval in the 1D diffusion example (source), I am able to reproduce the same problem now.

Here is the progress message after picking up the checkpoint:
image

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024

Can you paste your minimal example here?

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024

Yes, of course, here is the MWE.

from oceananigans.jl.

iuryt avatar iuryt commented on August 20, 2024

If I understood correctly, I think that @glwagner just wanted to have the code directly pasted here.

using Oceananigans
using Oceananigans.Units
using Printf

Ns = 200    # number of time saves
T = 7200 # simulation stop time (s)
Δt = 1  # timestep (s)

grid = RectilinearGrid(size=(), topology=(Flat, Flat, Flat))

model = NonhydrostaticModel(; grid, timestepper=:RungeKutta3)

simulation = Simulation(model; Δt, stop_time = T)

progress_message(sim) = @printf("Iteration: %03d, time: %s, Δt: %s, wall time: %s\n",
	iteration(sim), prettytime(sim), prettytime(sim.Δt), prettytime(sim.run_wall_time))

simulation.callbacks[:progress] = Callback(progress_message, TimeInterval(T/Ns))

dir = "output/test_MWE"

## checkpoint  
simulation.output_writers[:checkpointer] = Checkpointer(
                            model,
                            schedule=TimeInterval(T),
                            dir=dir,
                            prefix=string("checkpoint"),
                            cleanup=false)
file = string(dir,"/checkpoint_iteration3600.jld2")
run!(simulation,pickup=file)

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024

Nice, thanks. Yes it helps if we don't have to click links (especially if the links aren't permanent, we lose documentation of this issue).

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024

Here's a clue: this fixes the issue

Δt = 1    # timestep (s)
T1 = 4    # first simulation stop time (s)
T2 = 2T1  # second simulation stop time (s)
δt = 2    # progress message frequency

# Run a simulation that saves data to a checkpoint
simulation = test_simulation(T1, Δt, δt)
run!(simulation)

progress_cb = simulation.callbacks[:progress]
@show actuations = progress_cb.schedule.actuations

# Now try again, but picking up from the previous checkpoint
N = iteration(simulation)
checkpoint = "test_iteration$N.jld2"
simulation = test_simulation(T2, Δt, δt)
progress_cb = simulation.callbacks[:progress]
progress_cb.schedule.actuations = actuations
run!(simulation, pickup=checkpoint)

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024

@glwagner Thanks for solving this issue, I've tested your PR, and it works great!

from oceananigans.jl.

glwagner avatar glwagner commented on August 20, 2024

Great! FYI we like to keep the issue open until the PR is actually merged into main. Since the issue is linked, it will automatically get closed when the PR is merged so no need for manual intervention. I've opened it up to let that happen.

from oceananigans.jl.

liuchihl avatar liuchihl commented on August 20, 2024

Ah, gotcha!

from oceananigans.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.