facebookresearch / adaptive-span Goto Github PK

View Code? Open in Web Editor NEW

606.0 606.0 60.0 107 KB

Transformer training code for sequential tasks

License: Other

Python 86.08% Shell 13.92%

adaptive-span's People

Contributors

Stargazers

Watchers

adaptive-span's Issues

A question about parameter z_t

It's a good work.
I am so sorry to ask a stupid question. From my understanding, the current_val represents the span parameter z_t. Is this correct?

Will adaptive-span have faster predictive speeds than gpt-2?

If adaptive-span and gpt-2 have the same model size and the same context length(e.g. 1024), will adaptive-span predict faster?

did you try to start with maximum possibile cache size

Just curious: would the results (the final span length and pred accuracy) still hold if you start with max cache size -- initialized by torch.ones (need to reduce S) or random number between 0~1 instead of zeros?

adaptive-span/adaptive_span.py

Line 34 in a8d90b8

self.current_val = nn.Parameter(torch.zeros(*shape) + init_val)

Queries about adaptive span

Hi, I had few queries:

Do adaptive span change with time as the model sees more data ? Or is the span static ? In my experiments, they do not seem to change for some reason.
Secondly, as long as the values in current_val lies between [0,1], adaptive span loss won't change right since you are using _clamp(0,1). So how much weight does this loss carry ?

Warning with PyTorch 1.4

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Please convert to a permissive license

Other Facebook projects like react use permissive licenses like MIT, would it be possible to relicense this for commercial use so startups could participate in development also?

Accept a mask to remove padding in batch

@tesatory and team.

Thank you for releasing the Adaptive span transformer. For me is the best version of transformer so far!

On thing I noticed, comparing to another (great) transformer( https://github.com/idiap/fast-transformers), I noticed is that when I set in the forward call the mask of padded items, the model converges much faster.

Is this something that could be on adaptive-span?

Where to find the pretrained checkpoint?

For example, checkpoint on the enwiki8 dataset.

what is the cache_size mean?

cache_size and L means what?

Generate text

How to generate a text giving some seed?

Is there a better way than iterating by one byte with?

adaptive-span/trainer.py

Line 20 in a8d90b8

out, h_cache = model(X, h_cache)

Understanding adaptive-span loss

Hi,

Sorry to bother you. I have gone through the paper several times. I've also looked at the code many times
I just had one query with adaptive span loss. Here's what I interpreted:
This parameter self.current_val = nn.Parameter(torch.zeros(*shape) + init_val) is responsible for calculating loss, mask and span.
In this case, this parameter will be initialized with zero values since as per your config since init_val is kept as 0 (since the mean of all the values of the parameter will be 0).

My question is how is this parameter getting updated ?

When I call adaptive_span.get_loss(), it in turn calls:
self._loss_coeff * self._max_span * self._mask.current_val.mean() which will also return 0.
When I do :
adaptive_span.clamp_param(), nothing will happen since all the values inside the parameter were initialized with 0. These are the only two function calls happening inside train method.
Can you please point out what am I missing ?

The way you preprocess data is different from that of Transformer-XL

I noticed that you add a <eos> tokens at the end of each line:
https://github.com/facebookresearch/adaptive-span/blob/master/data.py#L34

But in Transformer-XL's code, they do not add<eos> for enwik8 and text8:
https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/data_utils.py#L205

According to my experience, in enwik8 (sentence length is short), using <eos> would make the final bpc/bpb about 0.02 lower.
It's better if you use the same setting for fair comparison.

confuse

Are the results of dev and test on the test data set?

Experiment | #params | dev | test
enwik8 | 38M | 1.04 bpb | 1.02 bpb
enwik8_large | 209M | 1.00 bpb | 0.98 bpb
text8 | 39M | 1.05 bpc | 1.11 bpc
text8_large | 209M | 1.01 bpc | 1.07 bpc

Do they need to be evaluated multiple times on the test? When I reproduce the model's train and valid bpcs are much larger than those obtained on the test, is it normal?

why not compare other local attention methods？

e.g. Modeling Localness for Self-Attention Networks
Convolutional self attention

Using mask can reduce FLOPs?

Hi,
Good Paper.
From the source code your idea is implemented by mask(AdaptiveMask) which means there will be no FLOPs saving right?

Understanding graphs from papers

Thanks for replying to my previous questions. In the fig 3 of your paper, i had few queries.

In Average Span vs Span Limit (Central graph), you showed that in case of fixed span model, span increases as span limit increases. I wanted to ask, as per your code base, spans are already monitored by current_val only if adapt_span_enabled is set to True (line). So how did you measure the span of fixed model because in that case, the bool value will be false, and then AdaptiveSpan won't monitor it. How did you measure the span of fixed model ?
In FLOPS vs Span Limit, you showed that FLOPS keep on increasing in the case of fixed span model while in the case adaptive span, FLOPS were constant (approximately linear). After through inspection, FLOPS are constant in adaptive span but they don't see seem to be rising in case of standard attention as well. In both the cases, FLOPS are same. Could you please share some insights.

Thanks

What does batch-size mean using distributed trainning?

Hi, Thanks for your great work.

When I set batch-size=8 on 2 gpus, it means that there are 8 batches on each gpu or 4 batches on each gpu?

Why does the hyper-parameter --batch-sz affect the bpc during evaluation?

When I tune the hyper-parameter --batch-sz , I observe different results.

Compute attention span of individual attention heads

I am working in model interpretability and wish to learn more about what each head is looking at and it's attention span (similar to the graphs from the paper). Could you please share what did you use to get the span of individual head ?

BPC

Scripts in experiments directory calculates bits per byte, not bits per character. Am I right?

It is important when comparing chars or words perplexities.

For example, for English enwiki8 ratio chars to bits is 1.0033040809995477:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.727

For Polish ratio chars to bits is 1.0505100080652954:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.859

Question: How to reduce the memory in this project

Hi, I read your paper ,it's great. I'm very interesting about how to reduce the memory in the real project.

I guess the memory things are:
in

adaptive-span/adaptive_span.py

Line 127 in d882404

key_pe = key_pe[:, :, trim_len:]

But I just see you cut the key_pe and It's just reduce a little memory and wouldn't help for reduce the Q K things I think.

So. can you explain How to reduce the memory in the code?

thanks

facebookresearch / adaptive-span Goto Github PK

adaptive-span's People

Contributors

Stargazers

Watchers

Forkers

adaptive-span's Issues

My question is how is this parameter getting updated ?

Recommend Projects

Recommend Topics

Recommend Org