Giter Club home page Giter Club logo

Comments (20)

pender avatar pender commented on July 21, 2024 3

Here's the paper that proposes stochastic clipping.

from stylegan-encoder.

halcy avatar halcy commented on July 21, 2024 2

fyi: The code I use for truncation is doing (or should be doing) exactly the same thing as the original truncation during generation - it's interpolating dlatents for all layers above a certain point towards the "average" dlatent vector. This is not neccesarily good or ideal for encoding, but seemed to make sense to have the generation part be the same in encoding as it would be doing during sampling later. I put it in because my encoding attempts tended to end up looking like the source image but with a dlatent representation that's not like the sampled dlatents at all. A loss based on the dlatents distance (according to some metric) to the mean dlatent vector might be better.

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024 1

Hi @oneiroid -- glad you like the implementation! You might be interested in this paper from NVIDIA, where they come up with a realism score based on how close a point is to the manifold. I'm not convinced that what I'm doing is the best solution -- it seems to work well in practice, but I am open to suggestions, and code!

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

lerp_ka_trunk_sing
lerp_ka_trunk_avg

Thnx for the paper! But they seem not be dealing with non-identical dlatents.
What I'm saying is that their manifold is a line in 8 or 18-dimensional space. By changing some of those 8 dlat 512-element vectors the face point in latent space moves away from that line - but that's ok, manifold is forced into line intentionally.
My suggestion is to use closest point on that line for dlatents truncation instead of average dlatent.
Using average dlatent is like moving towards the center point of that line - causes observed space distortions...
I've attached grids with "disgust" emotion shift with factor range [-3, 3] - the first using closest point, second - using average dlat.
Reference img below.
kaza_WA0005_sing

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

Thanks for the image, that does seem to result in better interpolations! Do you think this would improve training speed? If you can provide some code, I can test it out.

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

all the code is yours ))) i just first run encode_images.py with tile_dlatents. then use them in truncate func.

`def truncate_fast(dlat, dlat_avg, truncation_psi=0.7, minlayer=0, maxlayer=8, do_clip=False):
layer_idx = np.arange(18)[np.newaxis, :, np.newaxis]
ones = np.ones(layer_idx.shape, dtype=np.float32)
coefs = np.where(layer_idx < maxlayer, truncation_psi * ones, ones)
if minlayer > 0:
coefs[0, :minlayer, :] = ones[0, :minlayer, :]
if do_clip:
return tflib.lerp_clip(dlat_avg, dlat, coefs).eval()
else:
return tflib.lerp(dlat_avg, dlat, coefs)

dlats_mod = truncate_fast(dlats_mod, dlat_singular, truncation_psi=0.7, maxlayer=8, do_clip=True)`

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@oneiroid this looks good to me; I pushed an update and added an option to effnet_train.py -- take a look?

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

ehh, nah, sorry bro, i failed to explain properly...
i meant to completely replace dlatent_avg with dlatent vector (1, 512), that is to be found by running your encode script with
--tile_dlatents=True.

This has to be done for each image that we want to encode (or once for each person).

Then, during final encoding (and for learning directions in latent space, interpolations etc.) all truncation should be done using this pre-found dlatent INSTEAD of dlatent_avg that we fetch from Gs own vars.

Will try to finish and share code when get home

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@oneiroid Please do! That makes a bit more sense. I think the easiest way to do that would be to have an option for a numpy array to load to replace dlatent_avg. I think the only place I'm using that in the encoder is for the L1 penalty.

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

done. Gonna try to implement dynamic dlatent_avg for truncation (or clipping). Wonder if it can be found using linalg methods for finding the closest point on manifold axis. Is it possible for 8D space with each dimension represented by 512 floats - i mean, find closest point on a line from some reference point?
Also, you use stochastic clipping - is it better than truncating (like here: halcy?)
I've noticed couple of times that encoding process resets dlatent variable mid-way when l1 penalty not used...

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@oneiroid merged! I got the stochastic clipping from @pender and his branch -- https://github.com/pender/stylegan-encoder -- it really does seem to speed things up and stop the latents from getting too far away from the model's representations.

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

thnx! this clipping though - if it's purpose is to keep dlats from going extreme - then it seems kinda overkill - information gets lost in randomness... truncating learnable_dlatents and your l1 penalty seem to be enough, but that i need to check _

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@oneiroid that's what I thought at first too, I think it's a bit counterintuitive at first. But you can clearly see the effect from the training videos. Basically, stochastic clipping keeps the representation simple, so it short-circuits the optimization from getting too complex and finding a local minima somewhere else, or wasting time looking for one. We already know that all the good learned representations are closer to the center of the parameter space, so this keeps the search focused there.

P.S. Maybe you could find something that looks better on the surface if you searched further out, but interpolation etc. would be broken because at that point you're well outside the learned representation of the model.

from stylegan-encoder.

pender avatar pender commented on July 21, 2024

I think the intuition (which I'm getting from the paper linked above) is that if you truncate the dlatents (either with the hard bound of a clipping function or the soft bound of L2 loss or similar), you will end up with dlatents that have an abnormal proportion of their components pinned against the bound... so even though each component individually will still be within n standard deviations of the average, the dlatent as a whole will be statistically very different from average. Stochastic clipping prevents that altogether.

But now I'm wondering, should we be stochastically clipping values to keep them close to dlatent_avg, rather than close to zero? I.e. instead of this:

clipping_mask = tf.math.logical_or(self.dlatent_variable > 2.0, self.dlatent_variable < -2.0)
clipped_values = tf.where(clipping_mask, tf.random_normal(shape=self.dlatent_variable.shape), self.dlatent_variable)

maybe it should be this:

dlatent_avg = Gs.get_var('dlatent_avg')
dlatent_dist_from_avg = dlatent_avg - self.dlatent_variable
clipping_mask = tf.math.logical_or(dlatent_dist_from_avg > 2.0, dlatent_dist_from_avg < -2.0)
clipped_values = tf.where(clipping_mask, tf.random_normal(shape=self.dlatent_variable.shape) + dlantent_avg, self.dlatent_variable)

Does dlatent_avg effectively measure the bias of the mapping network -- the mean around which it maps Z-vectors (latents rather than dlatents) that are normally distributed around zero? I guess an even more principled approach would be to backpropagate all the way through the mapping network to optimize the original Z-vector, and then perform stochastic clipping on that Z-vector based on its distance from zero. Doubt it would be worth the extra fragility in the optimization by propagating all the way through the mapping network though.

I also haven't played around with the boundary itself. Is 2 standard deviations the right threshold for clipping? Not sure, and I haven't done much testing... the paper was written with respect to a GAN where the latent variables were uniformly distributed in [-1, 1] rather than normally distributed, so it was easy for them to decide to clip at a distance of 1.

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@pender I think you might be right that starting distributed around the average would be better; but I don't know if it would make that much of a difference in practice -- a good choice for a value will quickly converge, a bad choice might just get clipped again (monte carlo method!). I have played with the threshold, I think in practice [-2, 2] is fine, it seems big enough to allow for quite a bit without getting too crazy. It should be easy to experiment though, since I have a flag for it, and of course training videos...

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

@pbaylies you are right. my stupid, i was wrong - haven't read original paper thoroughly. The learned manifold is not a line, they do style mixing during training stylegan. The fact that samples from W+ (18 vectors) are always a set of only 2 unique (1,512) W vectors means something, but how to use it i dunno, if I finally got that right))

@pender this looks similar to what @halcy does:

def create_variable_for_generator(name, batch_size): 
     truncation_psi_encode = 0.7
    layer_idx = np.arange(16)[np.newaxis, :, np.newaxis]
    ones = np.ones(layer_idx.shape, dtype=np.float32)
    coefs = tf.where(layer_idx < 8, truncation_psi_encode * ones, ones)
    dlatent_variable = tf.get_variable(
        'learnable_dlatents', 
        shape=(1, 16, 512), 
        dtype='float32', 
        initializer=tf.initializers.zeros()zero
    )
    dlatent_variable_trunc = tflib.lerp(dlatent_avg, dlatent_variable, coefs)
    return dlatent_variable_trunc

But you'll need to use some truncation when generate face from them. I used tflib.lerp_clip - and resulting dlatents were inside the manifold. But isn't forcing dlatents to group around dlatent_average is counter-productive, we should be forcing them to group around the bias point of our target face image, not universal average. Pls tell if i am still not getting something...

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

@oneiroid so the interesting thing about doing style mixing at a random point during training is that they also do progressive growing during training -- hence the full range of coarse to fine dlatents in the end, so it isn't two unique vectors either! It's definitely possible to optimize with less than the full set of dlatents, but by default I just use the full set.

from stylegan-encoder.

pender avatar pender commented on July 21, 2024

@oneiroid

@pender this looks similar to what @halcy does:...

That does not look like stochastic clipping, it looks like it's taking the first eight layers 30% of the way toward the average.

But isn't forcing dlatents to group around dlatent_average is counter-productive, we should be forcing them to group around the bias point of our target face image, not universal average. Pls tell if i am still not getting something...

I'm not sure what you mean by "the bias point of our target face image." Encoding is necessary because we don't know which dlatents precisely encode the target face image. The purpose of biasing the result toward the dlatent average is that most randomly generated dlatents (random unit-normal zero-centered Z mapped to dlatent space by the mapping network) cluster near that average, so the synthesis net is more likely to do a good job near that average since that is the domain on which it has primarily been trained.

from stylegan-encoder.

oneiroid avatar oneiroid commented on July 21, 2024

The purpose of biasing the result toward the dlatent average is that most randomly generated dlatents (random unit-normal zero-centered Z mapped to dlatent space by the mapping network) cluster near that average, so the synthesis net is more likely to do a good job near that average

right, i get that. nevertheless, my point still stands: when we search for some img dlatents - we suppose that for that img (or face) there are some perfect set of 18 dlatents (better if identical) that is the dlatent_avg with respect to that img/face. it should be "standard" (1,512) dlatent vector, and may produce a face quite different from target img. anyways, effect of it is miserable compared to current results.

from stylegan-encoder.

pbaylies avatar pbaylies commented on July 21, 2024

Ok; good discussion, I think we've hashed some things out. I'm going to close this issue; feel free to open any more specific issues if you have an idea, or code. :)

from stylegan-encoder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.