Comments (8)
@polm I found good example 国家公務員
! I'll update soon.
from sudachipy.
This all connected result is from full-dic.
from sudachipy.
Oh wow, I completely missed the comment explaining that - sorry for the false alarm!
Would it be better to use a string where the tokenization is the same with the full and core dict?
from sudachipy.
I appreciate any concern, thank you.
I agree almost. Actually one user (you) confused.
- mode A,B,C return different tokenized results
- all dic return same tokenized patterns with mode
is the best example.
But I think core dic example is enough if different mode returns different pattern because core dic is default dic.
from sudachipy.
Hi, for non-Japanese speakers, split modes are a bit confusing. Could you elaborate on what they correspond to? :)
Also, how can we get an output similar to mecab one from command line (mecab -O wakati input_file -o output_file returns a file with tokenized strings)?
I want to get rid of NER, etc from the output file obtained from sudachipy tokenize input_file -o output_file
Thank you!
from sudachipy.
The split modes are invented and defined by us, and there isn't really anything that each mode corresponds to, but in our paper, "Sudachi: a Japanese Tokenizer for Business" (Takaoka et al., LREC2018), we state that
- A Unit (short): Compatible with UniDic
- B Unit (middle): Similar to the “words” in general sense
- C Unit (NE): Named entities
from sudachipy.
Getting only the tokenized words without Part of Speech tags; Use SudachiPy as a Python library and call surface()
method.
e.g.,
>>> from sudachipy import dictionary
>>> tokenizer_obj = dictionary.Dictionary().create()
>>> [m.surface() for m in tokenizer_obj.tokenize("これはテストの文です。")]
['これ', 'は', 'テスト', 'の', '文', 'です', '。']
from sudachipy.
Also, your questions are not related to this issue #70; it would be nice if you could create a new issue for the questions next time. Thanks!
from sudachipy.
Related Issues (20)
- Strange behavior when indexing into MorphemeList HOT 1
- ModuleNotFoundError: No module named 'sudachidict' HOT 2
- Inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space
- Problem with user defined dictionary HOT 1
- Cannot specify sudachi.json with the -r option. HOT 2
- ImportError: DLL load failed HOT 3
- Error when trying to import sudachipy with Google Colab HOT 1
- SudachiPy doesn't work in secure environments where users cannot create symlinks on code-envs HOT 12
- sudachipy command line not working with -a option HOT 1
- 数詞を含む分割情報を定義した単語があるとユーザー辞書をビルドできない HOT 6
- Slow building of user dictionary with surface form split info
- Training Sudachi on new data HOT 2
- sudachipy/lattice.pyx:35:12: Assignment to const attribute 'connect_costs' HOT 2
- JoinKatakana plugin behaves differently from the Java version HOT 1
- sudachipy/tokenizer.pyx: /dev/null is left opened. HOT 1
- Consider archiving this repo HOT 5
- Upgrading sortedcontainers HOT 1
- Failed to load user dict HOT 2
- How to pack sudachi with pyinstaller? HOT 7
- Cannot install SudachiDict-core 20200722
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sudachipy.