Giter Club home page Giter Club logo

language_modeling_via_stochastic_processes's People

Contributors

linxueyuanstdio avatar rosewang2008 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

language_modeling_via_stochastic_processes's Issues

how to reproduce results in the paper?

Hi Rose and other authors,

I found your work quite interesting but I'm confused about some details, hope that you can help clarify my confusions:

Are there instructions on how to reproduce results (i.e., numbers in tables) in the paper? I can understand that you cannot share the large trained models, but can you release scripts and instructions for producing the numbers using small models?

For example, it's not clear to me how you calculated the length mismatch in table 3: in appendix E - Wikisection you said The length mismatch in % used in Table 3 is calculated with respect to the training set lengths, but in your code language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/generation_metrics.py at lines 308-311, the statistics used for wikisection do not match either the training or the test statistics. Which split did you actually use? Besides, how did you calculate the exact numbers? Did you compare the absolute difference between the average section lengths of each section type, or did you compare the absolute difference between the corresponding examples (since for each generation you have a corresponding ground truth where the start of latent variables come from) and then take the average? Did you take the average over all section types to produce a single number? And in the forced long generation, how did you compute the mismatch?

As another example, it is not clear whether the provided generation setting is for the experiments in table 3 or table 4: since no_eos was set in the command, it seems like the forced long generation setting, but then table 4 doesn't use wikisection dataset. Can you clarify which experiment this provided generation command is for?

Lastly, I noticed a seemingly inconsistency between your paper and code: in the paper you said We first sample a start and end latent, z_0 ∼ p(z_0), z_T ∼ p(z_T ) where p(z_0), p(z_T ) are calculated as the density estimates over the training dataset. (appendix C.1), but in the code you only sampled z_T from the estimated gaussian on the training set: z_0 is directly taken from the encoded first ground truth sentence: https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/run_decoding_from_embeddings.py#L371 In fact, the estimated p(z_0) at this line was never used: https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/run_decoding_from_embeddings.py#L292

Looking forward to your reply!

Thanks,
Yuntian

code for roc stories?

It seems that ROC Stories use a different setup text infilling, but I can't find code in this repository for infilling. Am I missing anything or is the code not part of this repo? Thank you in advance!

Issue in data loading

It seems to me that this line should be changed to if 'tm' in self.name (

), since you were using self.start_conversation and self.end_conversation to split the training and test sets (see
https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/language_modeling_via_stochastic_processes/transformers/src/transformers/data/datasets/language_modeling.py#L1182
) for the tm2 dataset. With the current code, it seems that the training and test sets would be the same for tm2.

Just a simple question on your paper

I read your paper just before and realized it is really impressing. As I didn't register for ICLR, I want to ask a simple question (maybe embarrassing) on here.
I thought the function d() on the Loss function you proposed calculates distance between hidden vector of {x_t} and {mu_t} (which lies on the line starts from z_0 to z_T). So does it mean further between z_t and mu_t makes bigger d()? If so, how does the contrastive loss you proposed work? I thought the objective of the loss is to increase d(z_t, mu_t) and to decrease d(z', mu_t).

path2huggenface

Hi, How should I change the path2huggenface variable in constant.py?
Thanks a lot!

AttributeError

Thank you very much for open-sourcing your work.

For these two lines, when I run it, I find that the attribute A of the dataset cannot be found, and the corresponding data_params.dt cannot be found in the brownian_bridge.yaml file.

I don't know if my operation is wrong, I look forward to your answer.

error when using batch size > 1 for decoder training

Hello,

Thanks for the amazing work. I run into this error when using larger batch size to train the decoder.

 File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl                            return forward_call(*input, **kwargs)
  File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", line 1163, in forward
    transformer_outputs = self.transformer(
  File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl                            result = forward_call(*input, **kwargs)
  File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", l
ine 991, in forward                                                                                                                         outputs = block(                                                                                                                      File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl                            return forward_call(*input, **kwargs)                                                                                                 File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", line 321, in forward                                                                                                                         attn_outputs = self.attn(
  File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", l
ine 247, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
ValueError: not enough values to unpack (expected 3, got 1)

I assume the code only works with bsz=1, but I wanted to make sure.

Issues during encoder training with wikisection data

Hi Rose,

The issue occurred during the testing phase. The error message reads like:

ValueError: 'self.log(test_loss, [3.5329943])' was called, but 'ndarray' values cannot be logged

And it was raised from pytorch_lighting/core/lighting.py

Do you have any ideas?

Thanks so much!

Zhecheng

It seems that there is a detail inconsistency between the code and the paper. Where does the µt that the negative sample z' needs to stay away from come from?

Thank you for your excellent work, but I still have a question. As shown in a figure of your paper (the link is posted below), in the latent space, for a positive triplet (z0,zt,zT) and a negative sample z', the encoder makes zt close to the expected embedding µt and z' far from µt. What I understand is that µt is generated according to z0, zT, and time t, which is independent of the time of negative sample z'. In other words, when calculating the negative sample part in the loss function, it does not use the negative sample's own time step.
https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/images/encoder.png
However, in the code, I found that the time t' of the negative sample itself is used (see t=self.t[idx]), even if this time is greater than T in some triplets of the same batch. These times are equivalent to using z0, zT, and a time t' greater than T to estimate µt' and stay away from it. I want to know whether this approach is different from the paper and reasonable. I look forward to and thank you for your answers.

question of a recent commit (and also reproducibility questions)

Hi Rose,

The recent commit changed x_tp1=x_tp to x_tp=x_tp1 (cb3d345?diff=split?), but are your results based on this new commit or the old one? Since I also got different numbers on Wikisection compared to your paper (based on the version before the above commit), as brought up by another person in the latest post of #7. (Just to make sure, the results in the paper are based on GPT2-small which is called gpt-2 in huggingface, but not GPT2-large/xl right?)

Besides, I have a question about simulate_brownian_bridge: in this function, x_tp1 = x_t * (1- dt/(1-t)) + (dt/(1-t))*B_T + noise, but why is there a dt=0.05? According to the Brownian bridge process, shouldn't this be either x_tp1 = x_0 * (1 - t) + t * B_T + noise (if you use the older version always interpolating between x_0 and x_T), or x_tp1 = x_t * (1-1/(T-num_samples)) + 1/(T-num_samples) * B_T + noise (if you use the newer version interpolating between x_t and x_T)? And why is this noise term fixed rather than depending on t and T as in Equation 1 in the paper?

Lastly, I wonder if it's possible for you to share one setting of your model (Wikisection, TC-32) since that would address the issue of #7 as well. I can understand that it's hard to share big files, but I think google drive allows uploading big files, and you can remove optimizer.pt to make the checkpoint smaller.

Thanks,
Yuntian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.