rosewang2008 / language_modeling_via_stochastic_processes Goto Github PK
View Code? Open in Web Editor NEWLanguage modeling via stochastic processes. Oral @ ICLR 2022.
Language modeling via stochastic processes. Oral @ ICLR 2022.
Hi Rose and other authors,
I found your work quite interesting but I'm confused about some details, hope that you can help clarify my confusions:
Are there instructions on how to reproduce results (i.e., numbers in tables) in the paper? I can understand that you cannot share the large trained models, but can you release scripts and instructions for producing the numbers using small models?
For example, it's not clear to me how you calculated the length mismatch in table 3: in appendix E - Wikisection you said The length mismatch in % used in Table 3 is calculated with respect to the training set lengths
, but in your code language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/generation_metrics.py
at lines 308-311, the statistics used for wikisection do not match either the training or the test statistics. Which split did you actually use? Besides, how did you calculate the exact numbers? Did you compare the absolute difference between the average section lengths of each section type, or did you compare the absolute difference between the corresponding examples (since for each generation you have a corresponding ground truth where the start of latent variables come from) and then take the average? Did you take the average over all section types to produce a single number? And in the forced long generation, how did you compute the mismatch?
As another example, it is not clear whether the provided generation setting is for the experiments in table 3 or table 4: since no_eos was set in the command, it seems like the forced long generation setting, but then table 4 doesn't use wikisection dataset. Can you clarify which experiment this provided generation command is for?
Lastly, I noticed a seemingly inconsistency between your paper and code: in the paper you said We first sample a start and end latent, z_0 ∼ p(z_0), z_T ∼ p(z_T ) where p(z_0), p(z_T ) are calculated as the density estimates over the training dataset.
(appendix C.1), but in the code you only sampled z_T from the estimated gaussian on the training set: z_0 is directly taken from the encoded first ground truth sentence: https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/run_decoding_from_embeddings.py#L371 In fact, the estimated p(z_0) at this line was never used: https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/language_modeling_via_stochastic_processes/transformers/examples/pytorch/text-generation/run_decoding_from_embeddings.py#L292
Looking forward to your reply!
Thanks,
Yuntian
It seems that ROC Stories use a different setup text infilling, but I can't find code in this repository for infilling. Am I missing anything or is the code not part of this repo? Thank you in advance!
It seems to me that this line should be changed to if 'tm' in self.name
(
self.start_conversation
and self.end_conversation
to split the training and test sets (seeWe pin our trajectory to the start and end latent, and run the Brownian bridge using Equation ??
please check it,thanks
I read your paper just before and realized it is really impressing. As I didn't register for ICLR, I want to ask a simple question (maybe embarrassing) on here.
I thought the function d() on the Loss function you proposed calculates distance between hidden vector of {x_t} and {mu_t} (which lies on the line starts from z_0 to z_T). So does it mean further between z_t and mu_t makes bigger d()? If so, how does the contrastive loss you proposed work? I thought the objective of the loss is to increase d(z_t, mu_t) and to decrease d(z', mu_t).
Hi, How should I change the path2huggenface variable in constant.py?
Thanks a lot!
Thank you very much for open-sourcing your work.
For these two lines, when I run it, I find that the attribute A of the dataset cannot be found, and the corresponding data_params.dt cannot be found in the brownian_bridge.yaml file.
I don't know if my operation is wrong, I look forward to your answer.
Hello,
Thanks for the amazing work. I run into this error when using larger batch size to train the decoder.
File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs)
File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", line 1163, in forward
transformer_outputs = self.transformer(
File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl result = forward_call(*input, **kwargs)
File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", l
ine 991, in forward outputs = block( File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", line 321, in forward attn_outputs = self.attn(
File "/data/khalifam/envs/tc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/khalifam/tc/language_modeling_via_stochastic_processes/transformers/src/transformers/models/gpt2/modeling_time_gpt2.py", l
ine 247, in forward
query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
ValueError: not enough values to unpack (expected 3, got 1)
I assume the code only works with bsz=1, but I wanted to make sure.
Thx for opening your source code!
However there exists some problems:
In the run_decoding_from_embeddings.py,
line 91: x_tp1 = x_t,
this code doesnt work, should it be reverse?
"datasets=2.0.0" → "datasets==2.0.0"
Hi Rose,
The issue occurred during the testing phase. The error message reads like:
ValueError: 'self.log(test_loss, [3.5329943])' was called, but 'ndarray' values cannot be logged
And it was raised from pytorch_lighting/core/lighting.py
Do you have any ideas?
Thanks so much!
Zhecheng
Thank you for your excellent work, but I still have a question. As shown in a figure of your paper (the link is posted below), in the latent space, for a positive triplet (z0,zt,zT) and a negative sample z', the encoder makes zt close to the expected embedding µt and z' far from µt. What I understand is that µt is generated according to z0, zT, and time t, which is independent of the time of negative sample z'. In other words, when calculating the negative sample part in the loss function, it does not use the negative sample's own time step.
https://github.com/rosewang2008/language_modeling_via_stochastic_processes/blob/main/images/encoder.png
However, in the code, I found that the time t' of the negative sample itself is used (see t=self.t[idx]), even if this time is greater than T in some triplets of the same batch. These times are equivalent to using z0, zT, and a time t' greater than T to estimate µt' and stay away from it. I want to know whether this approach is different from the paper and reasonable. I look forward to and thank you for your answers.
Hi Rose,
The recent commit changed x_tp1=x_tp to x_tp=x_tp1 (cb3d345?diff=split?), but are your results based on this new commit or the old one? Since I also got different numbers on Wikisection compared to your paper (based on the version before the above commit), as brought up by another person in the latest post of #7. (Just to make sure, the results in the paper are based on GPT2-small which is called gpt-2 in huggingface, but not GPT2-large/xl right?)
Besides, I have a question about simulate_brownian_bridge: in this function, x_tp1 = x_t * (1- dt/(1-t)) + (dt/(1-t))*B_T + noise
, but why is there a dt=0.05? According to the Brownian bridge process, shouldn't this be either x_tp1 = x_0 * (1 - t) + t * B_T + noise
(if you use the older version always interpolating between x_0 and x_T), or x_tp1 = x_t * (1-1/(T-num_samples)) + 1/(T-num_samples) * B_T + noise
(if you use the newer version interpolating between x_t and x_T)? And why is this noise term fixed rather than depending on t and T as in Equation 1 in the paper?
Lastly, I wonder if it's possible for you to share one setting of your model (Wikisection, TC-32) since that would address the issue of #7 as well. I can understand that it's hard to share big files, but I think google drive allows uploading big files, and you can remove optimizer.pt to make the checkpoint smaller.
Thanks,
Yuntian
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.