Giter Club home page Giter Club logo

Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens. I've provided the relevant code below for clarity about dnabert_2 HOT 3 OPEN

basehc avatar basehc commented on May 26, 2024
Despite multiple trials and examining the model configuration, it seems that the model hosted on Hugging Face (`huggingface.co`) cannot handle sequences that exceed a length of 512 tokens. I've provided the relevant code below for clarity

from dnabert_2.

Comments (3)

akd13 avatar akd13 commented on May 26, 2024

Hi @basehc , I pointed this out in a closed issue as well. Let me know if this works for you - #26

from dnabert_2.

basehc avatar basehc commented on May 26, 2024

Dear @akd13 , Thank your mentions.
Actually, I have checked your issue long time ago before I opened this issue, I have no idea whether

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
tokens = tokenizer(sequence*10, return_tensors = 'pt', padding='max_length', truncation=True, max_length = 2000)
config = dnabert.config
config.max_position_embeddings = 2000
dnabert = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M",config=config) #Pretrained model
input_ids = tokens.input_ids
attention_mask = tokens.attention_mask
token_type_ids = tokens.token_type_ids
hidden_states = dnabert(input_ids, attention_mask)

above code can work. I may try it If I have time.
However, according to the original paper and words from author, it seems like the 'tokenizer' function from DNABERT2 should deal with sequence (>512) automaticly. Based the experiments from DNABERT2 paper, Author show some results on Virus classifaction which length is about 1000 bp. So I am confused.
My mind is that, The DNABERT2 model loaded from huggingface may be based on original BERT model, But I will try to check.

from dnabert_2.

akd13 avatar akd13 commented on May 26, 2024

You're right. If I add the trust_remote_code=True flag, and change the config, highly likely it defaults to the original BERT model.

from dnabert_2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.