Comments (1)
There isn't a single standard way to do this.
The most common one is to create a fixed vocabulary and assign every word an index (integer) and use that. You can also use fixed sized hashes if you're reasonably sure they won't collide, which is what spaCy does - for example, you can read about how the Vocab works.
Usually the tricky part is not the vectorization, but building the vocabulary. The simplest thing is to use BPE, like with SentencePiece, but that has been critized, and the right way to handle it is an area of active research. It's also easier to encounter issues in Japanese than in English due to the larger number of characters used. You can see a variety of strategies used in the awesome-bert-japanese repo, or see some details of how GPT works with Japanese in this recent article by @passaglia.
Also your question assumes you are lemmatizing text before vectorizing it. You can definitely do that, but replacing words with lemmas is not common in modern large models, which generally have enough parameters to learn from unlemmatized text. Lemmatization was more important in older models with limited numbers of features.
from fugashi.
Related Issues (20)
- type stubs HOT 2
- How to use with Contemporary Spoken Japanese dictionary unidic? HOT 3
- method for preserving half-width spaces? HOT 8
- Unable to Install (Windows x64, Python 3.11.0, fugashi 1.2.0) HOT 3
- When building a user dict, check number of fields
- The unidic_lite dictionary is not installed HOT 5
- UniDic v3.1.1 サポート件 HOT 1
- Importing fugashi raises ImportError on macOS HOT 3
- cmmap_->open(filename, mode)] cannot open HOT 12
- Lemmatizing particles に、で HOT 3
- Is it possible to apply the user dictionary which is a object instead of a file ? HOT 2
- Questions and thoughts(fix of making user dict, unidic terms and mecab_node_t attributes) HOT 5
- Add access to more Node fields
- Installing error when using `python:alpine` as the base image HOT 7
- Failed initializing MeCab HOT 4
- Question about installing on visual studio 2022 windows HOT 3
- Can't install on MacOS Ventura Intel x86 Python 3.11 HOT 5
- Pylance linting gives error: "Tagger" is not a known member of module "fugashi" HOT 2
- 'kana' field differs between the raw MeCab output and the Fugashi tagger output, returning "体" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fugashi.