Giter Club home page Giter Club logo

Comments (5)

chengchingwen avatar chengchingwen commented on May 22, 2024 1

@SeanLee97 Actually, we already have word piece tokenizer in Transformers.jl. See here and here

from transformers.jl.

chengchingwen avatar chengchingwen commented on May 22, 2024

it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl.

from transformers.jl.

freddycct avatar freddycct commented on May 22, 2024

This project needs more love from the community. Looks like progress is stalled? @chengchingwen

from transformers.jl.

chengchingwen avatar chengchingwen commented on May 22, 2024

Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore
I can't spend too much effort on this project. I do have some unreleased code snippet for the tokenizer but most of them are not well tested. I would try to squeeze out some time to release them (probably 1-2 weeks later).

Some update/thought about tokenizers:

  1. The rust implementation doesn't have a stable API and even a public C API for making bindings. We would need either come up with a C API or find some way to directly hook rust functions in Julia.
  2. The easiest way would actually be using PyCall.jl and load huggingface/transformers or huggingface/tokenizer and use those tokenizer in python, but I personally dislike this approach. I avoid python like a plague (that's why there is a Pickle.jl).
  3. Ideally I would love to see a native julia implementation, but then there would be lots of stuff need to be reimplemented. Currently We only have some basic tokenizer that I modified from the origin python implementation (the WordPiece in src/bert, the Bpe from BytePairEncoding.jl) and many other stuff from WordTokenizers.jl
  4. Binding is a desired approach. But the problem is that most of the tokenizer are implemented in a binding-unfriendly way, like Cython/Python or C++/rust without C API.

from transformers.jl.

SeanLee97 avatar SeanLee97 commented on May 22, 2024

I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub.

Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT.

from transformers.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.