Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you so much <a class="user-mention notranslate" data-hovercard-type="user" data-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Show what is the performance on enwiki8 is across your projects about performer-pytorch HOT 10 CLOSED

lucidrains commented on May 22, 2024 1

Show what is the performance on enwiki8 is across your projects

from performer-pytorch.

Comments (10)

lucidrains commented on May 22, 2024 1

@bratao if you are curious what my recommendation is, it would be https://github.com/lucidrains/routing-transformer or https://github.com/lucidrains/compressive-transformer-pytorch for now

from performer-pytorch.

bratao commented on May 22, 2024 1

Wow, I loved it. Thanks.

Sorry for the boldness, but I have to take the opportunity to ask an expert like you. If you don´t mind... @lucidrains If you are busy, feel free to ignore my question.

I´m looking for using transformers for Sequence Labeling on very large, almost infinite, documents. This document is composed of many smaller documents, that have some kind of structured information. I model this as a sequence labeling. You can see an example below.

A simple first order CRF can get a very good performance, such as 0.98 of f1 , but miss some examples that required a longer context.
I tried LSTM, GRU and SRU with an CRF and can get a good performance, but lower. SRU got 0.96 on my dataset.

Than I moved to transformers. I setup a search that train more than 12 versions of transformers, such as all of your Transformers, TENER, Transformer-xl, R-transformer, Attention-span transformer and Linear Transformers ( from https://github.com/idiap/fast-transformers). But the best result so far is Linear Transformers with an F1 of 0.33. I train with a moving window of 8192 tokens.

Do you have any recommendation or a pointer to where to look for this problem?

Thank you so much, and sorry to bother you.

from performer-pytorch.

bratao commented on May 22, 2024 1

Thank you so much @lucidrains ❤️
I will look closer at compressive and routing. I will also let this hyperparam search run for a little while to see if there is any hope with transformers.

But following your advice, I will focus RNN alternatives such as IndLSTM and QRNN.

from performer-pytorch.

lucidrains commented on May 22, 2024 1

@bratao they are not mutually exclusive! try a bidirectional LSTM in earlier layers, followed by attention near the top. Guarantee you'll see gain :)

from performer-pytorch.

lucidrains commented on May 22, 2024

@bratao you will like this! https://openreview.net/forum?id=qVyeW-grC2k

Performer isn't quite ready yet. So linear attention variants come with some drawbacks, often swept underneath the rug (as most papers do). There is a memory cost in the autoregressive case. The authors use Jax to solve this issue, but for Pytorch, I may need to reach for CUDA code. Also, the quadratic issue still exists, it's just shifted to the head embedding dimension. I would avoid using this repository until that is solved!

from performer-pytorch.

lucidrains commented on May 22, 2024

@bratao how big is your dataset? how many tokens is your average document?

from performer-pytorch.

bratao commented on May 22, 2024

@lucidrains My dataset have approximately 150 documents, with 12k tokens as average. I´m using a training with a batch of size 3 ( it is what fits the T4 GPU) with the optimizer as a search hyperparam that picks between Rancher, SGD, Adam.

from performer-pytorch.

lucidrains commented on May 22, 2024

@bratao ahhh, I see, you are in the low-data regime. I think it is best to pull a pre-trained model off the shelf from Huggingface, as I think training a sparse attention network from scratch on such low amounts of data is not enough. Attention is all we need, given enough data and compute :)

from performer-pytorch.

lucidrains commented on May 22, 2024

@bratao if you are interested in pre-training a sparse attention network yourself, you can use https://github.com/lucidrains/electra-pytorch , which is the most effective way to do so as of today

from performer-pytorch.

bratao commented on May 22, 2024

Nice, I will try it and report back as soon I get the results!

from performer-pytorch.

Show what is the performance on enwiki8 is across your projects about performer-pytorch HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent