Giter Club home page Giter Club logo

Comments (1)

polm avatar polm commented on June 12, 2024

There isn't a single standard way to do this.

The most common one is to create a fixed vocabulary and assign every word an index (integer) and use that. You can also use fixed sized hashes if you're reasonably sure they won't collide, which is what spaCy does - for example, you can read about how the Vocab works.

Usually the tricky part is not the vectorization, but building the vocabulary. The simplest thing is to use BPE, like with SentencePiece, but that has been critized, and the right way to handle it is an area of active research. It's also easier to encounter issues in Japanese than in English due to the larger number of characters used. You can see a variety of strategies used in the awesome-bert-japanese repo, or see some details of how GPT works with Japanese in this recent article by @passaglia.

Also your question assumes you are lemmatizing text before vectorizing it. You can definitely do that, but replacing words with lemmas is not common in modern large models, which generally have enough parameters to learn from unlemmatized text. Lemmatization was more important in older models with limited numbers of features.

from fugashi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.