Comments (10)
@bratao if you are curious what my recommendation is, it would be https://github.com/lucidrains/routing-transformer or https://github.com/lucidrains/compressive-transformer-pytorch for now
from performer-pytorch.
Wow, I loved it. Thanks.
Sorry for the boldness, but I have to take the opportunity to ask an expert like you. If you don´t mind... @lucidrains If you are busy, feel free to ignore my question.
I´m looking for using transformers for Sequence Labeling on very large, almost infinite, documents. This document is composed of many smaller documents, that have some kind of structured information. I model this as a sequence labeling. You can see an example below.
A simple first order CRF can get a very good performance, such as 0.98 of f1 , but miss some examples that required a longer context.
I tried LSTM, GRU and SRU with an CRF and can get a good performance, but lower. SRU got 0.96 on my dataset.
Than I moved to transformers. I setup a search that train more than 12 versions of transformers, such as all of your Transformers, TENER, Transformer-xl, R-transformer, Attention-span transformer and Linear Transformers ( from https://github.com/idiap/fast-transformers). But the best result so far is Linear Transformers with an F1 of 0.33. I train with a moving window of 8192 tokens.
Do you have any recommendation or a pointer to where to look for this problem?
Thank you so much, and sorry to bother you.
from performer-pytorch.
Thank you so much @lucidrains ❤️
I will look closer at compressive and routing. I will also let this hyperparam search run for a little while to see if there is any hope with transformers.
But following your advice, I will focus RNN alternatives such as IndLSTM and QRNN.
from performer-pytorch.
@bratao they are not mutually exclusive! try a bidirectional LSTM in earlier layers, followed by attention near the top. Guarantee you'll see gain :)
from performer-pytorch.
@bratao you will like this! https://openreview.net/forum?id=qVyeW-grC2k
Performer isn't quite ready yet. So linear attention variants come with some drawbacks, often swept underneath the rug (as most papers do). There is a memory cost in the autoregressive case. The authors use Jax to solve this issue, but for Pytorch, I may need to reach for CUDA code. Also, the quadratic issue still exists, it's just shifted to the head embedding dimension. I would avoid using this repository until that is solved!
from performer-pytorch.
@bratao how big is your dataset? how many tokens is your average document?
from performer-pytorch.
@lucidrains My dataset have approximately 150 documents, with 12k tokens as average. I´m using a training with a batch of size 3 ( it is what fits the T4 GPU) with the optimizer as a search hyperparam that picks between Rancher, SGD, Adam.
from performer-pytorch.
@bratao ahhh, I see, you are in the low-data regime. I think it is best to pull a pre-trained model off the shelf from Huggingface, as I think training a sparse attention network from scratch on such low amounts of data is not enough. Attention is all we need, given enough data and compute :)
from performer-pytorch.
@bratao if you are interested in pre-training a sparse attention network yourself, you can use https://github.com/lucidrains/electra-pytorch , which is the most effective way to do so as of today
from performer-pytorch.
Nice, I will try it and report back as soon I get the results!
from performer-pytorch.
Related Issues (20)
- hyperbolic cosine based estimator
- torch.max(data_dash) bug HOT 2
- Residual Connection HOT 3
- way to make two elements invisible?
- torch_tensorrt compilation fails
- Rotary Position Embedding
- Output inconsistent for autoregressive performer HOT 2
- How to test the performer architecture for training new models? HOT 1
- Performer Plain
- Attention map HOT 2
- Huge model state dict size?
- Using Performer with GNNs
- Question: Is Performer order equivariant? (can it transform an unordered set of tensors)
- Question about masking HOT 2
- I want to use Peroformer on MAE
- Using replicating nn.MultiHeadAttention with multiple performer SelfAttention modules
- Performer Pytorch Slower than Expected and Please Help with Understanding Parameter Count HOT 1
- Pretrained example
- Cross-attention with arbitrary causal mask
- Modify the transformer tutorial based on performer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from performer-pytorch.