Giter Club home page Giter Club logo

Comments (10)

wxp16 avatar wxp16 commented on May 21, 2024 3

The trained model is uncased, so the returned value of do_lower_case in create_tokenizer_from_hub_module() is True

But in class FullTokenizer, when spm_model_file is not None, the current code ignore the the value of do_lower_case. To fix this, first, in the constructor function of FullTokenizer, add one line self.do_lower_case = do_lower_case, then in def tokenize(self, text) , lowercase the text when you are using sentence piece model ` i.e.

    if self.sp_model:
      if self.do_lower_case:
        text = text.lower()

Hope this works.

from albert.

s4sarath avatar s4sarath commented on May 21, 2024 2

Download the model from tensorflow hub. The downloaded models will have an assets folder. Inside that .vocab and .model is present. .model represents spm model.

With no SPM Model

vocab_file = '/albert_base/assets/30k-clean.vocab'
spm_model_file = None
tokenizer = tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=True,
      spm_model_file=spm_model_file)
text_a   = "Hello how are you"
tokens_a = tokenizer.tokenize(text_a)

Output

['hello', 'how', 'are', 'you']

With SPM Model

vocab_file = '/albert_base/assets/30k-clean.vocab'
spm_model_file ='/albert_base/assets/30k-clean.model'
tokenizer = tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=True,
      spm_model_file=spm_model_file)
text_a   = "Hello how are you"
tokens_a = tokenizer.tokenize(text_a)

Output

['▁', 'H', 'ello', '▁how', '▁are', '▁you']

from albert.

np-2019 avatar np-2019 commented on May 21, 2024

I had a similar issue,
workaround was to use the convert_examples_to_features from XLNet's run_squad.py and prepare_utils and make necessary changes. This helped me bypass it.

from albert.

Rachnas avatar Rachnas commented on May 21, 2024

Thanks @s4sarath and @np-2019, I am able to process data with 30k-clean.model. I also incorporated convert_examples_to_features from XLNet with other changes. I am not bypassing SP model.

from albert.

s4sarath avatar s4sarath commented on May 21, 2024

@np-2019 - It is better not to use XLNET preprocessing. Here things are bit different. The provided code runs without any error. If you are familiar with BERT preprocessing, it is very close except the usage of SentencePiece Model.

from albert.

Rachnas avatar Rachnas commented on May 21, 2024

The trained model is uncased, so the returned value of do_lower_case in create_tokenizer_from_hub_module() is True

But in class FullTokenizer, when spm_model_file is not None, the current code ignore the the value of do_lower_case. To fix this, first, in the constructor function of FullTokenizer, add one line self.do_lower_case = do_lower_case, then in def tokenize(self, text) , lowercase the text when you are using sentence piece model ` i.e.

    if self.sp_model:
      if self.do_lower_case:
        text = text.lower()

Hope this works.

Thanks @wxp16 it helped.

from albert.

Rachnas avatar Rachnas commented on May 21, 2024

Sharing my learning, using XLNet pre processing will not help. As sequence of tokens in XLnet and Albert differs. SQUAD2.0 will get pre processed but training will not converge. Better to make selective changes in Albert Code only.

from albert.

np-2019 avatar np-2019 commented on May 21, 2024

FYI, @Rachnas and @s4sarath , using Xlnet preprocessing I could achieve following results on squad-2.0
Screen Shot 2019-11-19 at 11 28 23 am

from albert.

s4sarath avatar s4sarath commented on May 21, 2024

@np-2019 - Thats pretty good results. Which Albert model ( large, xlarge and version (v1 or v2) ) you have used?

from albert.

Rachnas avatar Rachnas commented on May 21, 2024

@np-2019 , Its very nice that you are able to reproduce the results successfully.

according to XLnet paper: section 2.5: "We only reuse the memory that belongs to
the same context. Specifically, the input to our model is similar to BERT: [A, SEP, B, SEP, CLS],"
According to Albert paper: section 4.1: "We format our inputs as “[CLS] x1 [SEP] x2 [SEP]”,

As we can see, CLS token has different locations, will it not cause any problem if we format data according to XLNet ?

from albert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.