Comments (5)
@SeanLee97 Actually, we already have word piece tokenizer in Transformers.jl. See here and here
from transformers.jl.
it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl.
from transformers.jl.
This project needs more love from the community. Looks like progress is stalled? @chengchingwen
from transformers.jl.
Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore
I can't spend too much effort on this project. I do have some unreleased code snippet for the tokenizer but most of them are not well tested. I would try to squeeze out some time to release them (probably 1-2 weeks later).
Some update/thought about tokenizers:
- The rust implementation doesn't have a stable API and even a public C API for making bindings. We would need either come up with a C API or find some way to directly hook rust functions in Julia.
- The easiest way would actually be using PyCall.jl and load huggingface/transformers or huggingface/tokenizer and use those tokenizer in python, but I personally dislike this approach. I avoid python like a plague (that's why there is a Pickle.jl).
- Ideally I would love to see a native julia implementation, but then there would be lots of stuff need to be reimplemented. Currently We only have some basic tokenizer that I modified from the origin python implementation (the
WordPiece
insrc/bert
, theBpe
from BytePairEncoding.jl) and many other stuff from WordTokenizers.jl - Binding is a desired approach. But the problem is that most of the tokenizer are implemented in a binding-unfriendly way, like Cython/Python or C++/rust without C API.
from transformers.jl.
I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub.
Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT.
from transformers.jl.
Related Issues (20)
- Adding support for checkpointing HOT 12
- update NNlib and Flux compat HOT 9
- State of quantization HOT 3
- Dolly example no longer works ... HOT 19
- OWL-ViT HOT 1
- AMDGPU support HOT 1
- DistilBertModel support HOT 1
- Attempting to download CLIP yields UnderVarError `unk_token` not defined
- Performance issue HOT 1
- [Question] Possible to retrieve layer-wise activations? HOT 4
- Adding phi model HOT 5
- Please support Lux.jl HOT 7
- Example Code always produces Max Length Sequences
- how to download model weights on external drive
- Update to newer versions of dependencies
- Improve documentation and take inspiration from python package HOT 6
- please update compat bounds HOT 6
- Looking to update Transformers.jl and the associated modules HOT 1
- Storage of Downloaded Models from HuggingFace HOT 1
- Converting from integer-tokens to one-hot tokens gives different results. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformers.jl.