Giter Club home page Giter Club logo

Comments (20)

alexmathfb avatar alexmathfb commented on May 22, 2024

To clarify, I list how a few of the config files differ.

Base config file from which all other config files inherit.

lr = 0.0005
emb_dim = 32
num_heads = 1
num_layers = 1
qkv_dim = 32
mlp_dim = 64
dropout_rate = 0.3
attention_dropout_rate = 0.2

Specific changes to config files

image

What hyperparameters did you use?

I apologize if I misunderstood anything. I really hope that I am misunderstanding something. If this is not the case, I hope we can fix this reproducability issue. If we do not succeed, I feel forced to raise these reproducability concerns in your OpenReview entrance to ensure further research is not negatively impacted by this.

from long-range-arena.

vanzytay avatar vanzytay commented on May 22, 2024

Could you remind me again which dataset you're having a problem with? Cifar?

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

I replied to your question here, and I explained the root cause of the inconsistency [we will try to update the appendix of the paper with the latest hps we found.]

The source of truth is the code and I think it's reasonable to try the configs that are in the code without changing them before complaining about reproducibility (I wish FNet authors also simply tried the configs in the code, instead of choosing to go with changing them to the sub-optimal hps from the paper!)

Please let us know if you have any issue reproducing the results given the configs we shared and we are more than happy to help.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

Yes, I'm considering the image/cifar config files.

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

Do you have any issue reproducing the results of any model on CIFAR with the shared configs (given the title of the issue you opened is "CRITICAL: Not Reproducible")?

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

I'll try to run them when I get the necessary computing resources. That said, I don't suspect I'll encounter any issues when using the config files which specify different hyper parameters for each Transformer.

I think my confusion arose from reading section 3.2 Philosophy of Benchmark.

The large search space motivates us to follow a set of fixed hyperparameters for all models. ... we plan to release the code with all the hyperparameters and implementation details.

This lead me to believe the config files you released in this repository would contain fixed hyperparameters across all the Transformer variants.

Q1. Is the current config files the ones you used to create Table 1?

Transformer | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | FAIL

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

This Google Colab clones this repository and trains a vanilla Transformer on CIFAR10. The result is 31.22% accuracy using the default parameters from the config file, Table 1 claims 42.44%.

I0827 20:13:32.115470 140655253817216 train.py:179] test in step: 34999, loss: 1.9255, acc: 0.3122

I apologize if I ran the code wrong.

Please note I am able to get 42% accuracy as reported in your paper, but this requires me to manually mix hyperparameters from your article, FNet and the ones reported in the other github issue.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

I think text is limiting our ability to communicate clearly about these issues. I'd be happy to do a video chat so we can get to the bottom of this. If you prefer to continue communicating through text that's also fine by me.

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

Thanks you. I'll double check the config with the internal code we have after the weekend and will get back to you.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

Thanks for the quick reply. I suspect the following is causing the problem.

Transformer: emb_dim: 32, mlp_dim: 64, num_heads: 1, qkv_dim: 32
Performer: emb_dim: 128 mlp_dim: 128 num_heads: 8 qkv_dim: 64

This causes a large difference in the number of parameters.

Transformer # params: 52 266
Performer # params: 248 458

Notably, the performer model breaks the 10% more parameter rule. The hyperparameters you shared here leads to the same number of parameters as the Performer.

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

I digged in our internal experiments a bit and based on what I found this configs (on top of those in the base config) should give the test accuracy of 0.42438.

  config.model.num_layers = 1
  config.model.classifier_pool = "CLS"
  config.model.emb_dim = 128
  config.model.mlp_dim = 128
  config.model.num_heads = 8
  config.model.qkv_dim = 64

I will redo the experiment to confirm this and if it was the case, will send a fix (I'm still not sure what went wrong that the internal vanilla transformer config did not to the propagate public repo).

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

I'm still not sure what went wrong that the internal vanilla transformer config did not to the propagate public repo

This makes me concerned the same could happen for other files. Is it impossible to make test cases that compares the functionality of the public code against the internal code?

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

I checked it for all other models and they are fine.
transformer_base.py was the only config existed before this commit and seems we failed to replace it. Configs for other models are just added with that push to the repo.

from long-range-arena.

MostafaDehghani avatar MostafaDehghani commented on May 22, 2024

This should solve the problem. Thanks for spotting the problem and reporting it. Let us know if you got any questions.

from long-range-arena.

albertfgu avatar albertfgu commented on May 22, 2024

Hi, I would like to piggyback on this thread with a few questions:

The default hyperparameter setup (each benchmark should have a config file now). You are not allowed to change hyperparameters such as embedding size, hidden dimensions, number of layers of the new model.

Some parameters are specified to be fixed (such as number of layers), but these are not fixed for all the models in the configs (e.g. Longformer and Reformer have depth 4 instead of 1 on the CIFAR task). Can this be clarified?

The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.

If this metric is the ultimate test for whether a model is fair, could the number of parameters of the base Transformer model be released (e.g. in the Table in the README)? This would help researchers build on this work and make it easy to check fairness of new models. Otherwise, each work has to individually re-calculate this fixed number which is error-prone.

from long-range-arena.

albertfgu avatar albertfgu commented on May 22, 2024

The new model should be within at best 10% larger in terms of parameters compared to the base Transformer model in the provided config file.

Upon doing some of the calculations, I am confused about whether this condition is accurate. For example, the Longformer has essentially the same hparams as the base Transformer (same embedding / projection dims), except that it has 4x the layers.

https://github.com/google-research/long-range-arena/blob/main/lra_benchmarks/image/configs/cifar10/transformer_base.py
https://github.com/google-research/long-range-arena/blob/main/lra_benchmarks/image/configs/cifar10/longformer_base.py

My understanding is that each layer of the Longformer has the same parameter count as the Transformer, but just changes the attention computation. Wouldn't the Longformer backbone then have 4x the parameters as the base Transformer?

Related to this, I have a question about the tuning procedure for the models. Was the configuration specified in this config the only Transformer model tried, or were bigger models tried during tuning? (e.g. the base Transformer was swept up to depth 4, but depth 1 was found to be optimal). If this is the case, shouldn't this statement be modified to:

The new model should be within at best 10% larger in terms of parameters compared to the largest base Transformer model that was tried.

If this was the case, we have no idea of knowing what the maximum allowed parameters are, without this number being explicitly calculated and released. It is not feasible for researchers to individually count the parameters of all 11 models here and take the maximum.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

@albertfgu

Disclaimer: I am not an author of this code.

Some parameters are specified to be fixed (such as number of layers), but these are not fixed for all the models in the configs (e.g. Longformer and Reformer have depth 4 instead of 1 on the CIFAR task). Can this be clarified?

It is true that their article claims the number of layers, embedding dim and number of heads are fixed. This seems to be a mistake. In this comment Mostafa explains that the CIFAR10 hyperparameters were chosen based on a hyperparameter search which gave different parameters to each efficient Transformer.

To further complicate matters, hyperparameters are updated whenever better ones are found. To the best of my knowledge there is no overview of how the hyperparameters changed over time. That is, no table showing (hyperparameters X, accuracy Y, date of experiment) for every efficient Transformer.

If this metric is the ultimate test for whether a model is fair, could the number of parameters of the base Transformer model be released (e.g. in the Table in the README)? This would help researchers build on this work and make it easy to check fairness of new models. Otherwise, each work has to individually re-calculate this fixed number which is error-prone.

I think this is a great point and agree 100%. I'm planning to re-implement this repository in PyTorch and extensively document hyper parameter search, hyper parameters, number of parameters in each model, different experiments and more. I may complement this with an article that critically assess issues related to hyperparameters etc.

Q. Would you be interested in such code?

Wouldn't the Longformer backbone then have 4x the parameters as the base Transformer?

Great question. This also confused me. I admit I'm not too familiar with the Longformer architecture, however, my parameter counting script (see the function below called recurse) returned the following.

Vanilla Transformer: 248 458
Longformer: 545 418

If this is true the Longformer violates this 10% parameter rule. Alternatively, it may be we were (again) given wrong hyperparameters.

Was the configuration specified in this config the only Transformer model tried, or were bigger models tried during tuning? (e.g. the base Transformer was swept up to depth 4, but depth 1 was found to be optimal). If this is the case, shouldn't this statement be modified to ...

This is a very good point, I'm very interested to know the answer. I think your correction is critical! Indeed, this makes it even more important that the number of parameters be shared.


def recurse( dct , level=0 ):
    sum = 0 

    for k, v in sorted( dct.items() ):

      if type(v) == type({}):
        sum += recurse(v, level=level+1)
      else:
        print(">"*level, k, v.shape, np.prod(v.shape))
        sum += np.prod(v.shape)

    return sum

  print("Total parameters", recurse(model.params))

Please let me know if you find any issues with the script.

from long-range-arena.

vanzytay avatar vanzytay commented on May 22, 2024

This is a lively conversation about the "rules" and setup of LRA. It's been about 1+ years since this benchmark went out and I have been observing papers and usage of this benchmark. It's probably time to write some reflections about the journey and difficulties of maintaining a live benchmark.

I would like to chime in with some (perhaps philosophical thoughts) and also my observations from other papers that report results using LRA. There have been some parts of this "philosophy" that had dynamically evolved over time. Part of this was influenced by discussions with reviewers at ICLR and other folks, over time.

The initial intention of LRA was to maintain a leaderboard where folks can just climb to your hearts content. Most leaderboards are "anything goes", where you don't have to be concerned about what hparams your predecessors used and just focused on getting the best numbers. But then we realized, LRA is not meant to be a leaderboard and/or kaggle competition of any sort.
In our paper, we specifically mention that we envisioned LRA to be a diagnostic set of some sort, and not a benchmark for leaderboard/hill climbing.

Because of the huge number of knobs, and hparams of every single model out there, including model-specific hparams, it is almost certain one is able to get better results by sweeping the hparam space. Our results in the paper are for reference, and is an initial glance of relative comparison of models. But unfortunately, just like traditional benchmarking, people tend to use them to simply paste their result over, for the convenience of writing or showing their model does better.

We had a lot of discussions about what to do if the authors of X model tell us they found better hparams for their model and convinced us to update the leaderboard? Is this fair anymore? Since the new model X on the table has been searched more extensively. Likewise, this is also the same reason why we do not continuously update the leaderboard. We have not done so because we don't want folks to use LRA for hill climbing!

So how do we wish people used LRA instead of "copying our table, tuning their model more and reporting +2% gain for a paper", we wished people spend time worrying about fair comparisons between their models and their own baselines (yes, by running baselines yourself to show that the new inductive bias you proposed helps!) and using the datasets/tasks at face value. If it helps, we'll add this to the readme so reviewers don't misunderstand.

Back to the 10% rule and the speculation that the configs here are perhaps wrong. The situation was that this was the initial set of rules we came up with to publish on the leaderboard. but there is no leaderboard now and you should be free to use LRA to do whatever you want. Also, because both vision tasks were extremely hard (and no models do reasonably well on it), we had to do a grid search to squeeze out the potential of different xformer variants.

There is also a lot conversation on how to build on top of this work. Here are some suggestions.

  1. Focus on the big picture. We would like to see some tasks, and not by the +2%,+3% that many papers report now. Honestly, the gains, then again, may have just be by better hparam search. Focus on inductive bias. Once you find good hparams for your own model, be sure to do the same for the baseline Transformer (or N other variants). It's also fine if you don't report all 10 x Xformers in your papers, and it is better to spend more time tuning the baseline to match the amount of tuning done on your own method.
  2. The 10 models reported in the paper are a snapshot of a first glance. It's been 1+ years since these experiments were run. I suspect that results may also vary slightly but poking around a bit should give you similar results. In this case, do make sure to update the results of the baseline instead of just simplying copying ours.
  3. Table copying is not how we intend for this benchmark to be used. You can cite the table as a reference, but be always sure to run the baseline with the same set of hparams that you choose to run your own model with. Report curves, side-by-side comparisons if necessary. There is a chance that you might need 10 layers for a model to solve one of the tasks here. And if so, we're not going to let a 4 layer limit here restrict you. You'll just have to rerun the baselines with a reasonably equivalent configuration.

Over the next few weeks, perhaps some of these thoughts should be distilled into the main readme, or a follow-up write-up (like this) somewhere more visible.

from long-range-arena.

alexmathfb avatar alexmathfb commented on May 22, 2024

If it helps, we'll add this to the readme so reviewers don't misunderstand.

I think that would help. Not only for reviewers, but also authors. Maybe the following will be helpful serving as inspiration.

README.md
--- 
Goal. The goal of LRA is X. 

To authors. We envision LRA used in different ways. 1) to do C we encourage A, 2) ... 3) ...

To reviewers. We encourage reviewers to reward A and penalize B. ... 

It sounds to me like A may be "testing inductive bias of model and re-running previous models with similar hyperparameters" and B something like "hill-climbing to demonstrate +2% which might've just come from hp search".

... or a follow-up write-up (like this) somewhere more visible.

I think distilling what you learning into something like LRA 2.0 would be very interesting. Making a benchmark like LRA is a very difficult undertaking, but I think such comparisons are immensely important to our community. I think it would be great to see an article that details the difficulties and lessons you learned. Everything from the philosophical difficulties as the ones listed above, to simple software engineering tricks that improves the life of everyone involved.

A few further directions (maybe for LRA 3.0).

  1. Are LRA tasks predicative of other tasks? E. g., is ranking of X-Transformers on CIFAR10 task similar to ranking of ImageNet accuracy?
  2. Scaling laws. Is training laws of small transformer models on LRA predicative of scaling laws for larger transformers?

If this is attainable, we may attain something like LRA-nano on which authors can experiment very fast, for which their results hopefully translate to larger models. This may even be used for architecture search.

from long-range-arena.

albertfgu avatar albertfgu commented on May 22, 2024

Hi Yi,

Thanks for the detailed response. I do understand the difficulties involved here, and I (and my collaborators) certainly appreciate the efforts you have made to provide a useful new benchmark for researchers in this area.

About the specifics of the "rules" involved with using LRA: It sounds frustrating that there isn't a simple way to provide an objective benchmark, as you noted. However I do think that there are (relatively simple) steps that can be done to make it easier for researchers to use this work fairly and improve the impact of this benchmark.

As you've noted, it is always possible to find better sets of hyperparameters with more tuning. That is why instead of simply reporting numbers and "best configs", it would be most helpful to report the exact hyperparameter sweeps done for all models. The original paper has these sweeps listed in the Appendix, but the authors have said many times in this repo that those are out of date. Simply having an accurate and easily accessible account of the tuning done for each number reported in the LRA paper makes it possible for future researchers to report those numbers. This enables the tuning procedure reported for XX new model to be easily judged against the tuning procedure of the LRA numbers.

by running baselines yourself to show that the new inductive bias you proposed
better to spend more time tuning the baseline to match the amount of tuning done on your own method
but be always sure to run the baseline with the same set of hparams that you choose to run your own model with

I certainly agree with these sentiments, and that all steps should be taken to be fair to all methods, by reproducing results whenever possible with the same infrastructure and tuning procedures. However, please note that not all researchers have access to the resources to perform these hyperparameter searches. Also, having the same numbers vary across follow-up works (if they all report their own version) is also not conducive to the field. Having readily reportable numbers is one of the primary benefits of a benchmark work such as LRA, and I believe that these can be made fair with simple steps such as reporting accurate tuning procedures and parameter counts.

Thanks again for your consideration. Also, I am happy to open these discussions in a new issue if that is more convenient.

from long-range-arena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.