Giter Club home page Giter Club logo

Comments (3)

tRosenflanz avatar tRosenflanz commented on May 10, 2024

Any news on this? Training from token counts would be very nice.
For instance, in my use case original data is processed in Spark so obtaining word counts is easy while dumping all of the strings into files would be very slow

from tokenizers.

n1t0 avatar n1t0 commented on May 10, 2024

I'm really not sure giving the ability to feed word counts is a good idea because the PreTokenizer is in charge of doing the pre-tokenization and it would introduce potential discrepancies between training and tokenization. Also, this trainer API is subject to change in the future, and may not be compatible with word counts anymore.

What about being able to stream raw strings instead?

from tokenizers.

tRosenflanz avatar tRosenflanz commented on May 10, 2024

That would be useful too albeit more cumbersome.

Would you mind expanding on the discrepancies you foresee? Currently, I use a custom loop using tfds subwordtextencoder methods that takes in word counts -> iterates over words, tokenizes them -> for each token increments token counter by the source word count -> builds encoder vocab using token counter.

    code_count = events.groupby("event_code").count().collect()
    print("Starting indexing")
    tokenizer = text_encoder.Tokenizer(
        alphanum_only=False, reserved_tokens=[_UNDERSCORE_REPLACEMENT]
    )
    token_counts= defaultdict(int)
    for row in code_count:
        tokens = tokenizer.tokenize(row["event_code"])
        tokens = _prepare_tokens_for_encode(tokens)
        for token in tokens:
            token_counts[token] += row["count"]
    encoder = tfds.features.text.SubwordTextEncoder._build_from_token_counts(
        token_counts, min_count,[],4, 40
    )
    encoder.save_to_file(os.path.join(output_path, "tensorflow_vocabulary_file"))

But this is a) really slow when encoding b) ugly and hard to customize since it uses hidden method

from tokenizers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.