zib-iol / frankwolfe.jl Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 18.0 651.59 MB

Julia implementation for various Frank-Wolfe and Conditional Gradient variants

License: MIT License

Julia 100.00%

conditional-gradients first-order-methods frank-wolfe hacktoberfest julia optimization optimization-algorithms

frankwolfe.jl's People

Contributors

Stargazers

Watchers

Forkers

zengliaoyuan j-geuter mattplo dhendryc hannahtro dviladrich95 morgankohler zevwoodstock tolischal xlxs4 jeremiahpslewis jannishal elwirth wenjiexiao-2022 victorthouvenot saurabhintoml dkuzi sebastiendesignolle

frankwolfe.jl's Issues

Add a gradient check function

Add function to verify user provided gradients against numerical ones...

when rewriting the birkhoff LMO to the mathopt one keep the same matrix interface

Is there a way to ensure that when we call the convert method on the birkhoff lmo to moi that it stays matrix?

# initial direction for first vertex
direction_vec = Vector{Float64}(undef, n * n)
randn!(direction_vec)
direction_mat = reshape(direction_vec, n, n)

# takes a matrix and returns a matrix
lmo = FrankWolfe.BirkhoffPolytopeLMO()
x00 = FrankWolfe.compute_extreme_point(lmo, direction_mat)

# modify to GLPK variant
# o = GLPK.Optimizer()

# takes a vector and returns a vector
lmo = FrankWolfe.convert_mathopt(lmo, o, dimension=n)
x00 = FrankWolfe.compute_extreme_point(lmo, direction_vec)

it would be better if it stays with the matrices so that we can simply drop-in replace it. could be probably done right at the beginning of convert and right before return.

adjust short step rule so that it works with rational numbers if all data is rational!

LCG crashes on movie lens

Lazified Conditional Gradients (Frank-Wolfe + Lazification).
EMPHASIS: memory STEPSIZE: adaptive EPSILON: 1.0e-9 max_iteration: 1000 PHIFACTOR: 2 TYPE: Float64
cache_size Inf GREEDYCACHE: false
WARNING: In memory emphasis mode iterates are written back into x0!

─────────────────────────────────────────────────────────────────────────────────────────────────
  Type     Iteration         Primal           Dual       Dual Gap           Time     Cache Size
─────────────────────────────────────────────────────────────────────────────────────────────────
ERROR: LoadError: MethodError: Cannot `convert` an object of type 
  FrankWolfe.RankOneMatrix{Float64,Array{Float64,1},Array{Float64,1}} to an object of type 
  AbstractArray{T,1} where T
Closest candidates are:
  convert(::Type{T}, ::T) where T<:AbstractArray at abstractarray.jl:14
  convert(::Type{T}, ::Factorization) where T<:AbstractArray at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/factorization.jl:55
  convert(::Type{T}, ::T) where T at essentials.jl:171
Stacktrace:
 [1] push!(::Array{AbstractArray{T,1} where T,1}, ::FrankWolfe.RankOneMatrix{Float64,Array{Float64,1},Array{Float64,1}}) at ./array.jl:934
 [2] compute_extreme_point(::FrankWolfe.VectorCacheLMO{FrankWolfe.NuclearNormLMO{Float64},AbstractArray{T,1} where T}, ::Array{Float64,2}; threshold::Float64, store_cache::Bool, greedy::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/spokutta/Code/fwjulia/FrankWolfe.jl/src/oracles.jl:210
 [3] lcg(::typeof(f), ::typeof(grad!), ::FrankWolfe.NuclearNormLMO{Float64}, ::FrankWolfe.RankOneMatrix{Float64,Array{Float64,1},Array{Float64,1}}; line_search::FrankWolfe.LineSearchMethod, L::Int64, phiFactor::Int64, cache_size::Float64, greedy_lazy::Bool, epsilon::Float64, max_iteration::Int64, print_iter::Float64, trajectory::Bool, verbose::Bool, linesearch_tol::Float64, emphasis::FrankWolfe.Emphasis, gradient::Nothing) at /home/spokutta/Code/fwjulia/FrankWolfe.jl/src/FrankWolfe.jl:359
 [4] top-level scope at /home/spokutta/Code/fwjulia/FrankWolfe.jl/examples/movielens.jl:96
in expression starting at /home/spokutta/Code/fwjulia/FrankWolfe.jl/examples/movielens.jl:96

Away-Step FW memory mode

Think whether to improve the active set memory consumption with a memory mode, i.e., for the argmin oracle

Note. the AFW algorithm is more expensive due to the active set anyways. potentially just leave as is. Decide once we have BCG to decide how necessary.

homogenize symbol names

adjust symbol names to follow julia style guide: https://docs.julialang.org/en/v1/manual/style-guide/

currently many functions are camel case but should use underscore instead

another weird issue with momentum + memory

import FrankWolfe
import LinearAlgebra


n = Int(1e3)
k = 10000

xpi = rand(n);
total = sum(xpi);
const xp = xpi ./ total;

f(x) = LinearAlgebra.norm(x-xp)^2
grad(x) = 2 * (x-xp)

lmo = FrankWolfe.UnitSimplexOracle(1.0);
x0 = FrankWolfe.compute_extreme_point(lmo, rand(n))

FrankWolfe.benchmark_oracles(f(x),grad(x),lmo,n;k=100,T=Float64)

@time x, v, primal, dualGap, trajectoryM = FrankWolfe.fw(f,grad,lmo,x0,maxIt=k,
    stepSize=FrankWolfe.shortstep,L=2,printIt=k/10,emph=FrankWolfe.blas,verbose=true, trajectory=true, momentum=0.9);

@time x, v, primal, dualGap, trajectoryM = FrankWolfe.fw(f,grad,lmo,x0,maxIt=k,
    stepSize=FrankWolfe.shortstep,L=2,printIt=k/10,emph=FrankWolfe.memory,verbose=true, trajectory=true, momentum=0.9);

first one works - second one blows up.

Upgrade custom types to allow for broadcasting to fully leverage their sparsity at scale

currently we cannot broadcast with the new sparse types which somewhat offsets the power of the sparse types.

add a "line search" wrapper function to remove code duplication in methods

have a wrapper function for the different step sizes and strategies, so that we do not have to have a copy of the same code block everywhere.

Add gradient_check routine

simple routine to numerically verify gradients

benchmark_oracles does not work with matrix types

problem seems to be the n that we pass which is used for allocating the vectors. either we can have a function with a different signature or we can pass the shape etc. suggestions welcome.

Example that fails:

FrankWolfe.benchmark_oracles(f, (str, x) -> grad!(str, x), lmo, n; k=100, T=Float64)

added to movielens.jl

Interface to MathOptInterface

This will allow us to access the predefined sets, projections from MathOptSetDistances.jl.

From a user point of view, this means one can potentially use the algorithms from the package with the https://jump.dev interface

Vanilla Frank-Wolfe with convex combination representation

As discussed, we need an implementation of vanilla FW which maintains the current iterate as a set of atoms and convex combinations.

This could make the nuclear norm problems more stable by explicitly representing a low-rank matrix as a weighted sum of rank 1, while the current results has a lot of near-zero singular values

Lp-Norm balls act weird when called with zero vector

lmo = FrankWolfe.LpNormLMO{Float64, 2}(1.0)
x0 = FrankWolfe.compute_extreme_point(lmo, zeros(n))

results in a NaN answer

add momentum from FW to lazified FW

carry over the momentum computation from FW to lazified FW

compute_gradient and compute_value do not rescale

we need both functions to rescale according to size of the batch so that in expectation we have the exact gradient -> right now it seems that we are off by some batch normalization factor in the stochastic cases

-> requires discussion. maybe i am missing something

Add to BCG the same sparsity tradeoff as in locAFW

BCG out-of-box improvements

SD steps sometimes make not enough progress although they should (see movielens example). problem with the line search strategy
fix numerical issues

Python interface as in odesolve.jl

Make FrankWolfe.jl accessible in python.

Finding vertex in convex decomposition.

When taking a step towards a FW vertex or an away vertex we need to update the active set. This requires finding if the vertex exists in the active set. Right now we are looping through the active set to find if any vertex in the active set is equal to the vertex we want to add. There are several occasions on which this is not needed.

Taking an away step: We already know the index in the active set, no need to loop through the active set again.
Taking a lazy FW step: Same as above!

Finding the vertex in the convex decomposition can be very costly. An easy fix would be to give as an optional argument to active_set_update! an index, which can be used to update the convex decomposition more quickly in the two cases above. This will likely result in much better improvement if the optimal face is relatively sparse, and we already contain most of these vertices in the active set.

Figure out via BCG does stupid steps for movie lens

BCG in movie lens has only two points in the beginning and simply cycles back and forth. Why?

Optimize all step size computation strategies

some of the step sizes require relatively expensive calculations. it would be good to make them also @emphasis aware and optimize them here and there.

Make AFW and BCG emphasis aware after refactoring

After refactoring AFW and BCG we lost the memory awareness due to the subroutines. We should restore this. It is not that critical because both are hard on the memory anyways but it will still impact the speed of the iterations in particular, if we do not need to call the lmo.

Implement the core of FW methods as iterators

Most methods consist in

Setup
Iteration until criterion met
Cleanup and return

Part 2. could look roughly the same in many algorithms, with the difference being what happens inside each iteration.
For this, an iteration interface could be nice. This also lets the user debug, inspect and so on without us having to anticipate all they might want to inspect at each iteration. The top-level functions can be kept as-is, but users with high-perf will just do the setup and iterations part, without allocating or logging.

Resources:
This discussion mentions the approach in Manopt.jl:
https://discourse.julialang.org/t/ann-optimkit-jl-a-blissfully-ignorant-julia-package-for-gradient-optimization/41063

A blog post describes the iterator approach to solve a linear system:
https://lostella.github.io/2018/07/25/iterative-methods-done-right.html

This is not high-priority, but can make FW more flexible and let users decide what they want to log and how

Non-Euclidean / non-vector space examples

Add examples and check adaptability of the code base to non-trivial atoms.

Easy ones:

matrix spaces (symmetric or not)
complex fields

More exotic:

Base learners (i.e. https://arxiv.org/abs/1910.03742 Greedy Convex Ensemble)
Measure spaces - (speculation) applications to optimal transport, Wasserstein barycenter problems

function and gradient interface

E.g.,

Function taking active set over iterate (super important for FW over functions)

Logging

Using the stdlib Logger https://docs.julialang.org/en/v1/stdlib/Logging/ will probably be a good start to log progression of the algorithm, with users optionally passing a logger to the function to print the log to a file

Add lazification.of standard FW as second version (first candidate with active set like structure)

Using reference tests for extended verifications

https://github.com/JuliaTesting/ReferenceTests.jl

This could be interesting for things that are hard to test. Most algorithms are not random, so they could be tested against references to see if a change did not influence the evolution per iteration for instance

[Needs Documentation] Fix issue with Int64(...) inexact from MaybeHotVector

(non-minimal) example.

using FrankWolfe
using LinearAlgebra
using ReverseDiff;

n = Int(1e3);
k = 1000

xpi = rand(n);
total = sum(xpi);
const xp = xpi ./ total;

const f(x) = 2 * LinearAlgebra.norm(x-xp)^3 - LinearAlgebra.norm(x)^2
const grad = x -> ReverseDiff.gradient(f, x)

# pick feasible region
lmo = FrankWolfe.ProbabilitySimplexOracle(1);

# compute some initial vertex
x0 = FrankWolfe.compute_extreme_point(lmo, zeros(n));

# benchmarking Oracles
FrankWolfe.benchmarkOracles(f,grad,lmo,n;k=100,T=Float64)

# memory variant
@time x, v, primal, dualGap, trajectory = FrankWolfe.fw(f,grad,lmo,x0,maxIt=k,
    stepSize=FrankWolfe.nonconvex,printIt=k/10,emph=FrankWolfe.memory,verbose=true);

# blas variant ## works only with casting at the moment
# x0 = convert(Vector{promote_type(eltype(x0), Float64)}, x0) 
@time x, v, primal, dualGap, trajectory = FrankWolfe.fw(f,grad,lmo,x0,maxIt=k,
    stepSize=FrankWolfe.nonconvex,printIt=k/10,emph=FrankWolfe.blas,verbose=true);

[Low] in the end have a note that explains things

Things to add

1/sqrt(t) step size + proof for non-convex

In memory mode we write back into the x0

needs to be documented properly

Test type genericity

Test that the algorithms work with:

Extended precision (BigFloat)
Reduced precision (Float16/32)
Rational

Gradient interface

For a generic function F(x), one needs to evaluate the gradient at a point grad(x) as a dense vector, and then perform operations on it, mostly passing it to compute_extreme_point, could there be an interface to avoid it?

One solution:

grad: x -> V{T}

is any function-like object that, given a point, returns a vector-like object (does not have to be a fully materialized vector).

svdl seems very slow

benchmarking svdl suggests that it needs about 0.5 sec per call for the movielens example. this seems way too high. @alejandro-carderera can you provide some numbers from the python code for the survey in terms of running time so that we can compare.

add lazification to AFW (in particular to be used together with localized steps)

bring in the lazification from bcg or lcg to AFW so that we can be more efficient for the "localized steps".

Adaptative line search mismatch?

Looking at:

https://github.com/ZIB-IOL/FrankWolfe.jl/blob/master/src/utils.jl#L9

the line search is not doing the same thing, we are not updating gamma in the iteration, only M, so the returned gamma is the initial one

and comparing to the referenced paper

Fix Code formatting to be nicer to 'if's

CG versions to be added

General extensions for all

add Momentum to base variant
general lazification
add adaptive step-size strategy from section 6.3.2 from the survey (requires only a single function)

Stochastic Algorithms

Stochastic FW

Active set algorithms

Away-Step FW (and Pairwise)
BCG

[Low] add basic readme

Add dual prices

The algorithms should compute dual prices at near optimal solutions

in BCG reporting of the initial iteration (the one with "I") is missing

Sparse gradient descent

the numerical instabilities seem to come from the gradient being projected on the probability simplex of the current active set, yielding very low coefficients and high accumulation error.

One could take a few vertices and do a descent step only on those. It needs to have vertices with positive and negative dot product with the gradient since some will be reduced and some increased.

This should especially help for BCG in Float64 at large scale as in the example

TODOS to improve LCG

add K factor for tradeoff @alejandro-carderera
add reporting It/sec

Stochastic FW: optional batched user-provided functions

With the current SFW interface, users provide a function that processes one data point, batching happens a level higher when we call the provided functions.

One possibility would be to make users provide batched functions by default:

f_batched(θ, xs) = sum(f(θ, x_i) for x_i in xs)
g_batched(θ, xs) = sum(g(θ, x_i) for x_i in xs)

What they provide now is the equivalent of the functions f and g above.

Localized gradient doesn't have tests

There is an example but some tests should be added to validate changes we run the tests. This can be verifying the quality of solutions on simple instances

Implement more complex LP oracles

Permutahedron
Birkhoff Polytope
Nuclear Norm Ball

do permutahedron in GLPK (not sure whether this is smart)
add an LP example with GLPK for testin